Systems / WorkloadManagement / <INSTANCE> / Agents / StalledJobAgent - Sub-subsection

The StalledJobAgent hunts for stalled jobs in the Job database. Jobs in “running”state not receiving a heart beat signal for more than stalledTime seconds will be assigned the “Stalled” state.

The FailedTimeHours and StalledTimeHours are actually given in number of cycles. One Cycle is 30 minutes and can be changed in the Systems/WorkloadManagement/<Instance>/JobWrapper section with the CheckingTime and MinCheckingTime options

Name

Description

Example

FailedTimeHours

How much time in hours pass before a stalled job is declared as failed Note: Not actually in hours

FailedTimeHours = 6

StalledTimeHours

How much time in hours pass before running job is declared as stalled Note: Not actually in hours

StalledTimeHours = 2

MatchedTime

Age in seconds until matched jobs are rescheduled

MatchedTime = 7200

RescheduledTime

Age in seconds until rescheduled jobs are rescheduled

RescheduledTime = 600

CompletedTime

Age in seconds until completed jobs are declared failed, unless their minor status is “Pending Requests”

CompletedTime = 86400

StalledJobsTolerantSites

List of site for which the StalledJobAgent will increase the tolerance for stalled jobs

StalledJobsTolerantSites = siteA.cern.ch, siteB.cern.ch

StalledJobsToleranceTime

Time in seconds to be added to the StalledTimeHours in order to increase the time tolerance for stalled jobs.

StalledJobsToleranceTime = 3000