Auto-Scaling Dedicated Servers with a Script

Dedicated Server Auto Scaling Scripts

Auto-Scaling Dedicated Servers with a Script

Auto-scaling is a common feature in the cloud, but it’s rarely explained for dedicated servers. In dedicated servers, scaling is harder due to new machines taking longer to be ready, workloads often store data locally, and you need to handle IPs, DNS, and load balancers yourself. However, with the right setup, you can configure Dedicated Server Auto Scaling Scripts.

In this guide, you will learn how to build a script that watches your metrics, decides when to add or remove servers, and then updates your system automatically. We will configure the autoscaler to run on PerLod’s high-performance dedicated server.

Prerequisites for Dedicated Server Auto Scaling Scripts

You must be sure to have a control host, which is a Linux machine that will run your script. These tools must be installed on your control host:

  • Python 3.10 or newer: The main programming language for your auto scaling script.
  • Terraform 1.6 or newer (Optional): Use to create and destroy servers automatically.
  • Ansible 2.15 or newer: Helps to set up and configure servers after they are created.
  • Prometheus: Your auto scaler needs to reach a Prometheus endpoint to read metrics like CPU or request rates.

You will also need one of the following ways to create new servers:

  • Vendor API access: If you use a hosting provider like PerLod Hosting, you will need their API keys.
  • PXE or MAAS access: If you use your own bare-metal servers, you will need a way to boot and install them automatically.

You must prepare an image or install a setup for your servers. For vendor servers, use their pre-built templates with cloud-init or a post-install script. For your own servers, prepare a PXE with preseed or kickstart and cloud-init setup so they can install automatically.

Also, you need a load balancer that your script can update, like HAProxy or Nginx.

Design Dedicated Server Auto Scaling Policy

A scaling policy is the logic that decides when to add or remove servers. You can choose one of the following example main rules with extra safety checks:

CPU policy:

  • If average CPU usage is above 75% for 5 minutes, add one server.
  • If it’s below 30% for 15 minutes, remove one server.

RPS policy:

  • If total requests per healthy server are above 1500, add a server.
  • If below 600, remove one.

Queue policy:

  • If each server has more than 5000 jobs in the queue, add one.
  • If fewer than 1000, remove one.

Safety Checks:

  • Min and Max servers: Never go below or above certain limits.
  • Cooldown window: Wait at least 10 minutes before scaling again.
  • Surge buffer: Keep a few extra servers ready during busy times.
  • Graceful drain: Wait for running jobs to finish before shutting down a server.

Autoscaling System Configuration for a Web Server Pool

You must create a config file, which is the central configuration for the autoscaling system that dynamically manages a pool of web servers. To do this, create a working directory and the configuration YAML file with the following command:

sudo nano opt/autoscaler/config.yaml

Add the following configuration to the file:

prometheus:
  url: "http://prometheus.internal:9090"

pool:
  name: "web-pool"
  min_nodes: 3
  max_nodes: 20
  scale_out_threshold_cpu: 0.75
  scale_in_threshold_cpu: 0.30
  evaluation_seconds: 300        # 5 minutes
  cooldown_seconds: 600          # 10 minutes between actions
  drain_timeout_seconds: 300     # 5 minutes to drain connections

provider:
  # choose: "maas" or "vendor"
  type: "vendor"
  # common metadata passed to the adapter
  region: "eu-west"
  image: "ubuntu-22-04-cloud"
  plan: "performance-1"
  ssh_key_name: "autoscaler-key"

lb:
  haproxy_socket: "/run/haproxy/admin.sock"
  backend_name: "bk_web"

dns:
  enabled: true
  zone: "example.com"
  prefix: "web-"

Set up Provider Adapter Interface For Auto Scaling

To allow the autoscaling system to interact with any provider or infrastructure management platform like MAAS through a unified API, you must set up the provider adapter interface.

Create the providers directory and base configuration file with the command below:

sudo nano opt/autoscaler/providers/base.py

Add the following configuration to the file:

from abc import ABC, abstractmethod

class ProviderAdapter(ABC):
    @abstractmethod
    def list_active_nodes(self) -> list:
        """Return [{'id': 'node1', 'ip': 'X.X.X.X', 'hostname': '...'}, ...]"""

    @abstractmethod
    def provision(self, count: int) -> list:
        """Provision N servers. Return same shape as list_active_nodes()."""

    @abstractmethod
    def deprovision(self, node_ids: list[str]) -> None:
        """Destroy given node IDs."""

    @abstractmethod
    def wait_ready(self, node_ids: list[str], timeout: int = 3600) -> list:
        """Block until nodes SSH-ready and cloud-init done. Return infos."""

MAAS or iPXE Provider Implementation

If you are using your own pool with MAAS or iPXE, you can create the following file and replace the values with your MAAS CLI or API calls that allocate a machine, set image, commission, deploy, and fetch IP:

sudo nano opt/autoscaler/providers/provider_mass.py

Add the following configuration with your MAAS CLI:

import time, subprocess, json
from .base import ProviderAdapter

class MAASProvider(ProviderAdapter):
    def __init__(self, cfg):
        self.cfg = cfg

    def list_active_nodes(self):
        # Example: query MAAS for deployed machines in the "web" tag
        out = subprocess.check_output(["maas", "admin", "machines", "read"])
        machines = json.loads(out)
        result = []
        for m in machines:
            if m.get("status_name") == "Deployed" and "web" in m.get("tags", []):
                ip = m["ip_addresses"][0] if m["ip_addresses"] else None
                result.append({"id": m["system_id"], "ip": ip, "hostname": m["hostname"]})
        return result

    def provision(self, count):
        new_ids = []
        for _ in range(count):
            # Find a ready machine with tag "web" and deploy
            # Replace with your MAAS selection logic
            sys_id = self._allocate_one()
            subprocess.check_call(["maas", "admin", "machine", "deploy", sys_id,
                                   f"osystem=ubuntu", "distro_series=jammy"])
            new_ids.append(sys_id)
        return [{"id": i, "ip": None, "hostname": None} for i in new_ids]

    def _allocate_one(self):
        # Pick a machine from a pool. Replace with real selection.
        out = subprocess.check_output(["maas", "admin", "machines", "read"])
        ms = json.loads(out)
        for m in ms:
            if m["status_name"] == "Ready" and "web" in m.get("tags", []):
                return m["system_id"]
        raise RuntimeError("No READY machines in pool")

    def wait_ready(self, node_ids, timeout=3600):
        deadline = time.time() + timeout
        ready = []
        while time.time() < deadline:
            current = {m["id"]: m for m in self.list_active_nodes()}
            for nid in list(node_ids):
                if nid in current and current[nid]["ip"]:
                    ready.append(current[nid])
                    node_ids.remove(nid)
            if not node_ids:
                break
            time.sleep(15)
        if node_ids:
            raise TimeoutError(f"Not ready: {node_ids}")
        return ready

    def deprovision(self, node_ids):
        for nid in node_ids:
            subprocess.call(["maas", "admin", "machine", "release", nid, "erase=true"])

Vendor API Provider Adapter

If you are using a remote vendor API, create the API file with your vendor’s endpoints:

sudo nano opt/autoscaler/providers/provider_vendor_api.py

Add the following configuration to the file:

import time, requests
from .base import ProviderAdapter

class VendorAPIProvider(ProviderAdapter):
    def __init__(self, cfg):
        self.cfg = cfg
        self.base = "https://api.vendor.example/v1"
        self.headers = {"Authorization": f"Bearer {self.cfg['token']}"}

    def list_active_nodes(self):
        r = requests.get(f"{self.base}/servers?tag=web", headers=self.headers, timeout=20)
        r.raise_for_status()
        result = []
        for s in r.json():
            if s["status"] == "active":
                result.append({"id": s["id"], "ip": s["public_ip"], "hostname": s["hostname"]})
        return result

    def provision(self, count):
        new = []
        for _ in range(count):
            payload = {
                "plan": self.cfg["plan"],
                "region": self.cfg["region"],
                "image": self.cfg["image"],
                "ssh_key": self.cfg["ssh_key_name"],
                "tags": ["web"]
            }
            r = requests.post(f"{self.base}/servers", json=payload, headers=self.headers, timeout=30)
            r.raise_for_status()
            s = r.json()
            new.append({"id": s["id"], "ip": None, "hostname": s["hostname"]})
        return new

    def wait_ready(self, node_ids, timeout=3600):
        ready, deadline = [], time.time() + timeout
        pending = set(node_ids)
        while time.time() < deadline and pending:
            for nid in list(pending):
                r = requests.get(f"{self.base}/servers/{nid}", headers=self.headers, timeout=20)
                r.raise_for_status()
                s = r.json()
                if s["status"] == "active" and s.get("public_ip"):
                    ready.append({"id": s["id"], "ip": s["public_ip"], "hostname": s["hostname"]})
                    pending.remove(nid)
            time.sleep(15)
        if pending:
            raise TimeoutError(f"Not ready: {list(pending)}")
        return ready

    def deprovision(self, node_ids):
        for nid in node_ids:
            requests.delete(f"{self.base}/servers/{nid}", headers=self.headers, timeout=20).raise_for_status()

Bootstrap and Service Registration Automation

After a new server is created, you can use Ansible for setup and custom scripts for service registration. The process makes sure every new server gets the same configuration as the others, starts all the required services, and is added to the load balancer so it can start handling traffic right away.

Ansible Bootstrap Playbook: Automated Server Configuration and Service Setup

This Ansible playbook runs automatically on every new server as soon as it’s created. It handles the basic setup tasks, often called “day-1 configuration.” It installs Prometheus Node Exporter, deploys your app, and registers the new server with the load balancer.

This process makes sure that every new server is configured the same way, fully monitored, and ready to serve requests right after provisioning.

Create the Ansible playbook with the command below:

sudo nano opt/autoscaler/bootstrap/site.yml

Add the following configuration to the file:

---
- hosts: new_nodes
  become: true
  tasks:
    - name: Ensure base packages
      apt:
        name:
          - curl
          - jq
        state: present
        update_cache: yes

    - name: Install node exporter (example)
      apt:
        name: prometheus-node-exporter
        state: present

    - name: Deploy app service unit
      copy:
        dest: /etc/systemd/system/webapp.service
        content: |
          [Unit]
          Description=Web App
          After=network-online.target
          [Service]
          ExecStart=/usr/local/bin/webapp
          Restart=always
          [Install]
          WantedBy=multi-user.target

    - name: Enable and start services
      systemd:
        name: "{{ item }}"
        enabled: yes
        state: started
      loop:
        - prometheus-node-exporter
        - webapp

    - name: Register to load balancer
      shell: "/opt/register_to_lb.sh {{ inventory_hostname }} {{ ansible_default_ipv4.address }}"

The Inventory dynamic file is used for the Ansible bootstrapping process. It defines a host group called [new_nodes] where the autoscaler will dynamically write the IP addresses of newly provisioned servers right before running the Ansible playbook.

sudo nano opt/autoscaler/bootstrap/ansible_inventory.ini
[new_nodes]
# autoscaler will write IPs here temporarily per run

HAProxy Runtime API Script: Adding and Removing Backend Servers

This script uses HAProxy’s Runtime API to dynamically add or remove backend servers without reloading the entire service, which prevents connection drops.

Create the load balancer directory and HAProxy script file:

sudo nano opt/autoscaler/lb/haproxy_reconfigure.sh

Add the following script to the file:

#!/usr/bin/env bash
set -euo pipefail
SOCK="${1:-/run/haproxy/admin.sock}"
BACKEND="${2:-bk_web}"
ACTION="${3:-add}"   # add|del
NAME="${4:?backend server name}"
IP="${5:-}"

if [ "$ACTION" = "add" ]; then
  echo "set server ${BACKEND}/${NAME} addr ${IP} port 80" | socat stdio "unix-connect:${SOCK}"
  echo "set server ${BACKEND}/${NAME} state ready" | socat stdio "unix-connect:${SOCK}"
elif [ "$ACTION" = "del" ]; then
  echo "set server ${BACKEND}/${NAME} state maint" | socat stdio "unix-connect:${SOCK}"
  # Optional: keep server object but mark maint, or remove from config on next reload
fi

DNS Update Script: Cloudflare API Integration (Optional)

This script automates service discovery by managing DNS records via the Cloudflare API. Create the DNS directory and Cloudflare update script file:

sudo nano opt/autoscaler/dns/cloudflare_update.sh

Add the following script to the file:

#!/usr/bin/env bash
# CF_API_TOKEN env var required
# Usage: ./cloudflare_update.sh add web-123 203.0.113.10 zone.com
set -euo pipefail
ACTION="$1" ; NAME="$2" ; IP="$3" ; ZONE="$4"
# Implement with curl to Cloudflare API v4. Left as a placeholder.
echo "[DNS] $ACTION $NAME.$ZONE -> $IP"

Autoscaling Controller Script: Core Decision Engine

This script is the core of the autoscaling system. It implements a continuous control loop that monitors application load, makes scaling decisions based on CPU metrics, and executes the complete lifecycle operations for servers.

Create the autoscaler script file with the command below:

sudo nano opt/autoscaler/autoscaler.py

Add the following script to the file:

#!/usr/bin/env python3
import os, time, yaml, subprocess, tempfile, requests, importlib
from datetime import datetime, timedelta

def prom_query(url, q):
    r = requests.get(f"{url}/api/v1/query", params={"query": q}, timeout=10)
    r.raise_for_status()
    data = r.json()["data"]["result"]
    if not data: return None
    return float(data[0]["value"][1])

def choose_provider(cfg):
    if cfg["provider"]["type"] == "maas":
        mod = importlib.import_module("providers.provider_maas")
        return mod.MAASProvider(cfg["provider"])
    elif cfg["provider"]["type"] == "vendor":
        mod = importlib.import_module("providers.provider_vendor_api")
        return mod.VendorAPIProvider(cfg["provider"])
    else:
        raise ValueError("Unknown provider")

def write_inventory(new_nodes):
    path = "bootstrap/ansible_inventory.ini"
    lines = ["[new_nodes]"]
    for n in new_nodes:
        lines.append(n["ip"])
    with open(path, "w") as f:
        f.write("\n".join(lines)+"\n")
    return path

def register_in_lb(lb_sock, backend, nodes):
    for n in nodes:
        name = n["hostname"] or f"node-{n['ip'].replace('.','-')}"
        subprocess.check_call(["bash", "lb/haproxy_reconfigure.sh", lb_sock, backend, "add", name, n["ip"]])

def drain_from_lb(lb_sock, backend, nodes):
    for n in nodes:
        name = n["hostname"] or f"node-{n['ip'].replace('.','-')}"
        subprocess.call(["bash", "lb/haproxy_reconfigure.sh", lb_sock, backend, "del", name])

def main():
    with open("config.yaml") as f:
        cfg = yaml.safe_load(f)

    provider = choose_provider(cfg)
    last_action_at = datetime.fromtimestamp(0)

    # 1) Read state
    nodes = provider.list_active_nodes()
    current = len(nodes)

    # 2) Query metrics
    cpu_avg = prom_query(cfg["prometheus"]["url"],
                         'avg(instance:node_cpu_utilization:rate5m{job="node"})')

    # Fallback if metric missing
    if cpu_avg is None:
        print("No CPU metric, exiting without change")
        return

    # 3) Decide desired
    desired = current
    if cpu_avg > cfg["pool"]["scale_out_threshold_cpu"] and current < cfg["pool"]["max_nodes"]:
        desired = current + 1
    elif cpu_avg < cfg["pool"]["scale_in_threshold_cpu"] and current > cfg["pool"]["min_nodes"]:
        desired = current - 1

    now = datetime.utcnow()
    if desired == current:
        print(f"No change. CPU={cpu_avg:.2f} nodes={current}")
        return

    # cooldown
    if (now - last_action_at).total_seconds() < cfg["pool"]["cooldown_seconds"]:
        print("Cooldown in effect, skipping")
        return

    if desired > current:
        to_add = desired - current
        print(f"Scale OUT +{to_add}")
        new = provider.provision(to_add)
        ready = provider.wait_ready([n["id"] for n in new], timeout=3600)
        # Bootstrap new nodes
        write_inventory(ready)
        subprocess.check_call(["ansible-playbook", "-i", "bootstrap/ansible_inventory.ini", "bootstrap/site.yml"])
        # Register in LB
        register_in_lb(cfg["lb"]["haproxy_socket"], cfg["lb"]["backend_name"], ready)
        # Optional DNS
        # for n in ready: subprocess.call(["bash","dns/cloudflare_update.sh","add", f"{cfg['dns']['prefix']}{n['hostname']}", n["ip"], cfg["dns"]["zone"]])

    if desired < current:
        to_remove = current - desired
        # Pick least busy nodes or newest nodes. Here we pick last ones.
        victims = nodes[-to_remove:]
        print(f"Scale IN -{to_remove}: {[v['id'] for v in victims]}")
        # Drain
        drain_from_lb(cfg["lb"]["haproxy_socket"], cfg["lb"]["backend_name"], victims)
        # Wait drain timeout
        time.sleep(cfg["pool"]["drain_timeout_seconds"])
        # Deprovision
        provider.deprovision([v["id"] for v in victims])

    last_action_at = now

if __name__ == "__main__":
    main()

Notes:

  • The Prometheus query shown uses a common metric pattern. You may need to adjust it to match your own metric labels. A safe and general example is:
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
  • The autoscaler script creates a temporary Ansible inventory file with the new server IPs. Then, it runs Ansible once for each scale-out event to configure the new nodes and register them with HAProxy.
  • During scale-in, the script first drains traffic from the selected nodes to let them finish current requests, and then destroys those servers safely.

Automated Execution with systemd for Controller

Create a system unit file for the autoscaler controller to run as a service:

sudo nano /etc/systemd/system/dedicated-autoscaler.service

Add the following configuration to the file:

[Unit]
Description=Dedicated Autoscaler Controller
After=network-online.target

[Service]
Type=oneshot
WorkingDirectory=/opt/autoscaler
ExecStart=/usr/bin/python3 autoscaler.py
User=autoscaler
Group=autoscaler

Then, create a timer scheduler that runs automatically every 2 minutes:

sudo nano /etc/systemd/system/dedicated-autoscaler.timer

Add the following configuration to the file:

[Unit]
Description=Run autoscaler every 2 minutes

[Timer]
OnBootSec=30s
OnUnitActiveSec=2min
AccuracySec=10s
Unit=dedicated-autoscaler.service

[Install]
WantedBy=timers.target

Enable the services with the commands below:

sudo systemctl daemon-reload
sudo systemctl enable --now dedicated-autoscaler.timer

And check it with the following command:

systemctl list-timers | grep dedicated-autoscaler

Python-Driven Terraform Execution (Optional)

This setup combines Python’s decision-making logic with Terraform’s powerful infrastructure management. Instead of making API calls directly from Python, the autoscaler creates Terraform variable files and runs Terraform plans.

Create the Terraform main file:

sudo nano opt/autoscaler/terraform/main.tf

In the file:

# Add your provider and a "count = var.desired_count" resource for servers.
variable "desired_count" { type = number }

This small Python script connects the autoscaler with Terraform:

open("terraform/terraform.tfvars", "w").write(f"desired_count = {desired}\n")
subprocess.check_call(["terraform","-chdir=terraform","init","-upgrade"])
subprocess.check_call(["terraform","-chdir=terraform","apply","-auto-approve"])

After Terraform finishes, the script runs Ansible using the list of new server IPs that Terraform produced.

For a comprehensive guide on combining Terraform with Ansible for dedicated server automation, check on Terraform and Ansible for Dedicated Server Automation.

FAQs

Can dedicated servers be auto-scaled like cloud?

Yes, but with differences. Dedicated servers take longer to provision, so auto-scaling happens on a slower time scale.

What are the biggest challenges in auto-scaling dedicated servers?

Provisioning time: New servers can take 10–30 minutes before they are ready to use.
Stateful workloads: You need to drain traffic and copy data safely before removing a server.
Hardware differences: Servers may have different CPUs or disks, so the configuration must adjust automatically.
Cost impact: Each server costs money, so scaling decisions should be made carefully.

Is Terraform required for auto-scaling dedicated servers?

No. Terraform simplifies provider interaction, but the Python script can call any API directly. If you already use Ansible or MAAS, the script can control those tools instead.

Conclusion

Implementing Dedicated Server Auto Scaling Scripts brings the flexibility of the cloud with the performance and control of bare metal. By using clear scaling policies, reliable provisioning APIs, and automation tools like Python, Ansible, and Terraform, you can create a self-managing infrastructure that adjusts to workload demand without manual configuration.

We hope you enjoy this guide. Subscribe to X and Facebook channels to get the latest articles and updates in automation and scaling.

For further reading:

Automated ML Workflows on Dedicated Servers

Run the Milvus vector database on a VPS

Explore Web Hosting Trends in 2025

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.