Name

distreboot — distributed cluster reboot

Synopsis

distreboot [ --start [--node=directory] | --stop | --reboot | --dry-run | --cancel | --drop=nodename ]

Description

distreboot gets started on all nodes in an object repository cluster, and coordinates an orderly reboot of the nodes, in the least disruptive manner, so that the object repository cluster remains in quorum throughout the entire reboot process. The distreboot command requires root privileges.

System startup

[--start] starts the distreboot daemon. This should be done as part of the system startup, right after the stasher daemon server starts.

Note

distreboot --start should be started in the background. [--start] does not fork off a daemon process.

The --node option is not needed on most servers that are part of only one object repository cluster, and the object repository cluster is configured in stasher's default object repository cluster directory.

Use --node to specify an explicit object repository cluster directory on a server that has more than one, or if it's not in the default stasher object repository node directory.

Note

Running distreboot on a server that's a part of multiple object repository clusters is not recommended. Only one instance of distreboot can run on a server, and distreboot can only make sure that the object repository cluster it's running on will not have to transfer the cluster's master node more than once during a distributed reboot.

System shutdown

distreboot --stop stops the currently running distreboot daemon. This should be done as part of the system, shutdown script, just before stopping stasher

Current status

Running distreboot with no options shows the daemon's current status: whether it's connected to its object repository node, the status of the node, and the status of other distreboot daemons on other nodes in the object repository cluster. Each instance of distreboot periodically posts a heartbeat, or a keepalive message. With no option, distreboot shows a list of all the nodes, and each node's heartbeat expiration.

If a node's timestamp is in the past, this indicates that distreboot is no longer running on that node. This node is down, unreachable, or is in the process of getting rebooted (possibly triggered by something else, other than distreboot). Timestamps more than a day old get dropped from the list.

Forcibly ejecting a node from the distreboot status

In the event a node gets removed from the cluster, or gets shut down for a prolonged period of time, and its distreboot daemon stops running, its heartbeat status goes stale, but won't get removed from the distreboot distributed heartbeat status listing for at least a day.

This will prevent a distributed reboot from getting triggered. If a distributed reboot is needed before a stale node gets purged from the distreboot's status list, use [--drop] to do it manually.

Rebooting, for real or just kidding

The [--dry-run] option shows what order the nodes in the object repository will get rebooted, if a reboot gets triggered now. [--reboot] starts the reboot process.

[--reboot] will trigger a reboot only if the following conditions are met:

  • None of the distreboot heartbeat timestamps are stale, indicating that all instances of distreboot on all nodes are running.

  • The stasher object repository is in quorum.

[--reboot] calculates the order in which the nodes get rebooted so that the current object repository cluster master node gets rebooted last, thus making sure that the object repository cluster master gets transitioned only once, as part of the distributed reboot cycle.

The distributed reboot starts with the reboot getting initiated on the first node in the list. When the node finishes rebooting, comes back up, joins the cluster, resychronized with cluster, and its instance of the distreboot daemon comes back up (since distreboot should be a part of each server's startup script, after all), a reboot gets triggered on the next node in the list, and so on, ending with the master node.

distreboot relies on stasher's shutdown script to take care of transition the master node. stasher's default shutdown script checks if the server is the current master node; if so it tries to execute an orderly transition of the master node to another node in the cluster.

Cancelling a distributed reboot in progress

If something goes wrong, and a node fails to reboot, use [--cancel] to cancel the reboot list. Otherwise, when everything gets patched up, with the ailing server, and it finally boots back up, you'll be surprised when another object repository node, which was originally next in the reboot list, starts rebooting itself.

Hopefully, this will happen within 24 hours, otherwise its heartbeat status disappears, and the distributed reboot will resume right where it left off, 24 hours ago!

PROPERTIES

rebootcmd

The command that distreboot runs in order to reboot the server, this is reboot by default.

Use --set-property rebootcmd=command with [--start] to execute a custom command, or set it in distreboot.properties (installed in /usr/local/etc or /etc.

distreboot forks and execs the given command, as is, in order to trigger the server's reboot.