Watchdog

The Watchdog class is used by the Job Wrapper to resolve and monitor the system resource consumption. The Watchdog can determine if a running job is stalled and indicate this to the Job Wrapper. Furthermore, the Watchdog will identify when the Job CPU limit has been exceeded and fail jobs meaningfully.

Information is returned to the WMS via the heart-beat mechanism. This also interprets control signals from the WMS e.g. to kill a running job.

  • Still to implement:
    • CPU normalization for correct comparison with job limit

class DIRAC.WorkloadManagementSystem.JobWrapper.Watchdog.Watchdog(pid, exeThread, spObject, jobCPUTime, memoryLimit=0, processors=1, jobArgs={})

Bases: object

__init__(pid, exeThread, spObject, jobCPUTime, memoryLimit=0, processors=1, jobArgs={})

Constructor, takes system flag as argument.

calibrate()

The calibrate method obtains the initial values for system memory and load and calculates the margin for error for the rest of the Watchdog cycle.

execute()

The main agent execution method of the Watchdog.

getDiskSpace(exclude=None)

Obtains the available disk space.

getNodeInformation()

Retrieves all static system information

initialize()

Watchdog initialization.

run()

The main watchdog execution method

DIRAC.WorkloadManagementSystem.JobWrapper.Watchdog.kill_proc_tree(pid, sig=Signals.SIGTERM, includeParent=True)

Kill a process tree (including grandchildren) with signal “sig” and return a (gone, still_alive) tuple. called as soon as a child terminates.

Taken from https://psutil.readthedocs.io/en/latest/index.html#kill-process-tree