Define reschedule behaviors for a job
Tasks can sometimes fail due to network, CPU or memory issues on the node
running the task. In such situations, Nomad can reschedule the task on another
node. The reschedule
stanza can be used to configure how Nomad
should try placing failed tasks on another node in the cluster. Reschedule
attempts have a delay between each attempt, and the delay can be configured to
increase between each rescheduling attempt according to a configurable
delay_function
. Consult the reschedule
stanza documentation for more
information.
Service jobs are configured by default to have unlimited reschedule attempts. You should use the reschedule stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.
The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad. The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts. The CLI also shows when the next reschedule will be attempted.
$ nomad job status demoID = demoName = demoSubmit Date = 2018-04-12T15:48:37-05:00Type = servicePriority = 50Datacenters = dc1Status = pendingPeriodic = falseParameterized = falseSummaryTask Group Queued Starting Running Failed Complete Lostdemo 0 0 0 2 0 0Future Rescheduling AttemptsTask Group Eval ID Eval Timedemo ee3de93f 5s from nowAllocationsID Node ID Task Group Version Desired Status Created Modified39d7823d f2c2eaa6 demo 0 run failed 5s ago 5s agofafb011b f2c2eaa6 demo 0 run failed 11s ago 10s ago
$ nomad alloc status 3d0bID = 3d0bbdb1Eval ID = 79b846a9Name = demo.demo[0]Node ID = 8a184f31Job ID = demoJob Version = 0Client Status = failedClient Description = <none>Desired Status = runDesired Description = <none>Created = 15s agoModified = 15s agoReschedule Attempts = 3/5Reschedule Eligibility = 25s from nowTask "demo" is "dead"Task ResourcesCPU Memory Disk Addresses100 MHz 300 MiB 300 MiB p1: 127.0.0.1:27646Task Events:Started At = 2018-04-12T20:44:25ZFinished At = 2018-04-12T20:44:25ZTotal Restarts = 0Last Restart = N/ARecent Events:Time Type Description2018-04-12T15:44:25-05:00 Not Restarting Policy allows no restarts2018-04-12T15:44:25-05:00 Terminated Exit Code: 1272018-04-12T15:44:25-05:00 Started Task started by client2018-04-12T15:44:25-05:00 Task Setup Building Task Directory2018-04-12T15:44:25-05:00 Received Task received by client