Restart a workload based on health checks
The check_restart
stanza instructs Nomad when to restart
tasks with unhealthy service checks. When a health check in Consul has been
unhealthy for the limit specified in a check_restart stanza, it is restarted
according to the task group's restart policy. Restarts are local to the node
running the task based on the tasks restart
policy.
The limit
field is used to specify the number of times a failing health check
is seen before local restarts are attempted. Operators can also specify a
grace
duration to wait after a task restarts before checking its health.
You should configure the check restart on services when its likely that a restart would resolve the failure. An example of this might be restarting to correct a transient connection issue on the service.
The following check_restart
stanza waits for two consecutive health check
failures with a grace period and considers both critical
and warning
statuses as failures.
check_restart { limit = 2 grace = "10s" ignore_warnings = false}
The following CLI example output shows health check failures triggering restarts until its restart limit is reached.
$ nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690ID = e1b43128Eval ID = 249cbfe9Name = demo.demo[0]Node ID = 221e998eJob ID = demoJob Version = 0Client Status = failedClient Description = <none>Desired Status = runDesired Description = <none>Created = 2m59s agoModified = 39s agoTask "test" is "dead"Task ResourcesCPU Memory Disk Addresses100 MHz 300 MiB 300 MiB p1: 127.0.0.1:28422Task Events:Started At = 2018-04-12T22:50:32ZFinished At = 2018-04-12T22:50:54ZTotal Restarts = 3Last Restart = 2018-04-12T17:50:15-05:00Recent Events:Time Type Description2018-04-12T17:50:54-05:00 Not Restarting Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"2018-04-12T17:50:54-05:00 Killed Task successfully killed2018-04-12T17:50:54-05:00 Killing Sent interrupt. Waiting 5s before force killing2018-04-12T17:50:54-05:00 Restart Signaled health check: check "service: \"demo-service-test\" check" unhealthy2018-04-12T17:50:32-05:00 Started Task started by client2018-04-12T17:50:15-05:00 Restarting Task restarting in 16.887291122s2018-04-12T17:50:15-05:00 Killed Task successfully killed2018-04-12T17:50:15-05:00 Killing Sent interrupt. Waiting 5s before force killing2018-04-12T17:50:15-05:00 Restart Signaled health check: check "service: \"demo-service-test\" check" unhealthy2018-04-12T17:49:53-05:00 Started Task started by client