loadBalance

From OuroDev

loadBalance is the name for two configuration files, loadBalanceDefault.cfg and loadBalanceShardSpecific.cfg. These configuration files control how the dbserver tries to balance workload amongst the various host machines made available to it through launcher connections.

The dbserver uses these settings in conjunction with metrics on host utilization and status supplied by the launchers to make decisions on where to best launch server processes such as mission maps and static zones. The load balancing settings presented in these files should reflect the provisioning of the launcher host machine and its environment; e.g., network capacity, cpu cores, available virtual memory, disk speed etc.

The primary goal of the load balancing is to insure survivability of the service and host stability. These settings, in conjunction with servers.cfg settings, can be used to setup an occupancy limits on host machine resources. When those limits are exceeded the service will stop using individual hosts and even enter a system wide overload protection mode. The secondary goal is to provide adequate Quality of Service (QoS) to the players by distributing load to make the service sufficiently responsive.

A live environment consists of a pool of host machines that are shared by all the dbservers. Load balancing is more challenging in this scenario as an individual dbserver is unaware of the impact of the actions of the other dbservers until it receives new status updates from machines in the host pool. The quality of the status snapshot last received on a particular host degrades as the number of dbservers sharing that host increase and also during periods of high map launch rates.

Thus, it is best for an individual dbserver working in this configuration to treat its knowledge of host status as imperfect and only an approximation. Load balancing strategies which employ some randomization can be used to improve balancing in this situation.

Load balancing modes

A load balancing "mode" is set by selecting a balancing strategy and an associated heuristic for the strategy to use if applicable.

It is useful to customize the load balancing strategy employed according to the type of role a new server process will play. Mission maps and bases are inherently transient and usually service a small number of players. On the other hand static map zone instances can persist indefinitely and can grow to service a large number of players.

Supported strategies are:

  • Sequential - balance by round robin assignment to the set of available launchers
  • Random - balance by randomly choosing amongst available launchers
  • RandomChoice <heuristic> - balance by randomly choosing a set of launchers and selecting the one with minimum heuristic
  • Search <heuristic> - balance by walking the set of launchers and selecting the one with minimum heuristic

Supported heuristics are:

  • Utilization - a measure of host machine resource utilization (i.e., cpu, etc)
  • TotalOccupancy - total hosted server count
  • TypeOccupancy - total hosted server count of a given type (e.g., static or mission)

Directives

The directives in both files are the same. The default file is mandatory; the shard specific file is not. Any directive given in the shard specific file will override the ones in the default file. Server roles from the default file will be appended to the shard specific one.

BalanceModeZone <strategy> [<heuristic>]

The load balancing mode for static maps.

BalanceModeMission <strategy> [<heuristic>]

The load balancing mode for mission and base maps.

MaxMapservers <n>

Maximum number of maps we allow to start on a given machine.

Once this limit is reached the host will be suspended and no more launches will be permitted. The limit applies to the combined static and mission maps counts, including maps that are starting up. A value of zero disables this check.

MinAvailVirtualMemory <mb>

If the amount of virtual memory available to commit (in MB) on the host drops below this value the host will be suspended from launching. For system stability a generous amount of virtual memory should be available at all times.

A value of zero disables this check.

MaxHostUtilization <n>

Maximum host utilization estimate allowed on a host before it is suspended from launching any more server processes.

Individual launcher host utilization is calculated from host performance metrics and influenced by settings which follow. In general the goal is have a host load of 100 represent that the host is humming along at capacity doing 100% useful work. However, the host load can be such that it is actually overloaded and the host is spending resources paging and servicing too many processes. In this case the host utilization values will climb up over 100. Host utilization should generally be in the 0 - 200% range.

A value of zero disables this form of capacity suspension.

PagingLoadLow <n>

The lower bound of a range of hard page faults per second that is used to map the current paging rate to a percentage which is then added to host utilization.

This updates utilization when the system is busy paging instead of doing real work. Hard faults will generally occur as new maps load data and will increase significantly once physical memory is exhausted and there are active processes that need to have pages swapped into their working sets to operate.

PagingLoadHigh <n>

The upper bound of the hard page faults per second.

MinAvailPhysicalMemory <n>

If the amount of available physical memory (in MB) on the host drops below this threshold then the associated bias will be applied to the host utilization calculation. For example, a bias of .1 implies a 10% increase in host utilization.

MinAvailPhysicalMemoryBias <decimal>

The applied physical memory utilization bias.

StartingStaticMemBias <mb>

Amount of memory (in MB) we assume a static/zone map will take once it finishes starting. Good default is 600.

StartingStaticCPUBias <decimal>

Amount of CPU (1 = 100%) we assume a static/zone map will take once it finishes starting. Good default is 0.3.

StartingMissionMemBias <mb>

Amount of memory (in MB) we assume a mission map will take once it finishes starting. Good default is 150.

StartingMissionCPUBias <decimal>

Amount of CPU (1 = 100%) we assume a mission map will take once it finishes starting. Good default is 0.01.

SecondaryRoleBias <decimal>

Amount of CPU (1 = 100%) to bias a server by if considering it for it's secondary role. If a secondary machine has this many more CPUs available (1.00 = 1 CPU at 100% usage) then the secondary machine will be used instead.

StaticCPUBias <decimal>

How much CPU we add on for each static mapserver understanding that they will probably need it. This doesn't seem to be needed any more and can be 0.

TroubleSuspensionTime <n>

Amount of time (in seconds) a launcher is suspended if it has a large number of consecutive crashes and/or delinks. Launchers are also suspended if they stop responding, which gets cleared upon a dbserver restart. Good default is 1800.

ServerRole

TODO