StalledJobAgent

The StalledJobAgent hunts for stalled jobs in the Job database. Jobs in “running”

state not receiving a heart beat signal for more than stalledTime seconds will be assigned the “Stalled” state.

StalledJobAgent options
StalledJobAgent
{
  StalledTimeHours = 2
  FailedTimeHours = 6
  PollingTime = 3600
  MaxNumberOfThreads = 15
  # List of sites for which we want to be more tolerant before declaring the job stalled
  StalledJobsTolerantSites =
  StalledJobsToleranceTime = 0
  # List of sites for which we want to be Reschedule (instead of declaring Failed) the Stalled jobs
  StalledJobsToRescheduleSites =
  SubmittingTime = 300
  MatchedTime = 7200
  RescheduledTime = 600
  Enable = True
}
class DIRAC.WorkloadManagementSystem.Agent.StalledJobAgent.StalledJobAgent(*args, **kwargs)

Bases: DIRAC.Core.Base.AgentModule.AgentModule

Agent for setting Running jobs Stalled, and Stalled jobs Failed. And a few more.

__init__(*args, **kwargs)

c’tor

am_Enabled()
am_checkStopAgentFile()
am_createStopAgentFile()
am_disableMonitoring()
am_getBasePath()
am_getControlDirectory()
am_getCyclesDone()
am_getMaxCycles()
am_getModuleParam(optionName)
am_getOption(optionName, defaultValue=None)

Gets an option from the agent’s configuration section. The section will be a subsection of the /Systems section in the CS.

am_getPollingTime()
am_getShifterProxyLocation()
am_getStopAgentFile()
am_getWatchdogTime()
am_getWorkDirectory()
am_go()
am_initialize(*initArgs)

Common initialization for all the agents.

This is executed every time an agent (re)starts. This is called by the AgentReactor, should not be overridden.

am_monitoringEnabled()
am_removeStopAgentFile()
am_secureCall(functor, args=(), name=False)
am_setModuleParam(optionName, value)
am_setOption(optionName, value)
am_stopExecution()
beginExecution()
endExecution()
execute()

The main agent execution method

finalize()
initialize()

Sets default parameters