Watchdog

The Watchdog class is used by the Job Wrapper to resolve and monitor the system resource consumption. The Watchdog can determine if a running job is stalled and indicate this to the Job Wrapper. Furthermore, the Watchdog will identify when the Job CPU limit has been exceeded and fail jobs meaningfully.

Information is returned to the WMS via the heart-beat mechanism. This also interprets control signals from the WMS e.g. to kill a running job.

  • Still to implement:
    • CPU normalization for correct comparison with job limit
class DIRAC.WorkloadManagementSystem.JobWrapper.Watchdog.Watchdog(pid, exeThread, spObject, jobCPUTime, memoryLimit=0, processors=1, systemFlag='linux', jobArgs={})

Bases: object

__init__(pid, exeThread, spObject, jobCPUTime, memoryLimit=0, processors=1, systemFlag='linux', jobArgs={})

Constructor, takes system flag as argument.

calibrate()

The calibrate method obtains the initial values for system memory and load and calculates the margin for error for the rest of the Watchdog cycle.

execute()

The main agent execution method of the Watchdog.

getDiskSpace()

Attempts to get the available disk space, should be overridden in a subclass

getLoadAverage()

Attempts to get the load average, should be overridden in a subclass

getMemoryUsed()

Attempts to get the memory used, should be overridden in a subclass

getNodeInformation()

Attempts to retrieve all static system information, should be overridden in a subclass

initialize(loops=0)

Watchdog initialization.

run()

The main watchdog execution method