ComponentSupervisionAgent

Component Supervision Agent for monitoring agent, executor or service behaviour and intervene if necessary.

This agent is designed to supervise the Agents, Executors and Services, and restarts them in case they get stuck. It can only control components running on the same machine as the agent. One agent per server is needed.

  • The agent checks the age of the log file and if it is deemed too old will kill the agent so that it is restarted automatically. (Option RestartAgents)

  • Executors will only be restarted if there are jobs in checking status (Option RestartExecutors)

  • Services will be restarted if they don’t answer a ping RPC (Option RestartServices)

  • Check for running and stopped components and ensure they have the proper status as defined in the CS Registry/Hosts/_HOST_/[Running|Stopped] sections. (Option ControlComponents)

  • If desired also service URLs can automatically be added or removed from the Configuration (Option CommitURLs)

The configuration for Running and Stopped components are two sub-sections in Registry/Hosts/<Host>:

Running
{
  Configuration__Server =
  Framework__SystemAdministrator =
  Framework__ComponentSupervisionAgent =
}
Stopped
{
  DataManagement__FileCatalog2 =
  Framework__Monitoring =
}

By moving from one to the other section we can make the ComponentSupervisionAgent Stop/Start the given component. Values for the entries in the list are ignored, Syntax is <System>__<ComponentName>.

For full functioning of the Agent a few additional permissions have to be granted for the Operator role. In the SystemAdministrator/Authorization section of the relevant setup:

getOverallStatus = Operator
stopComponent = Operator
startComponent = Operator

getOverallStatus is needed for basic functioning of the Agent. stopComponent or startComponent are only needed if ControlComponents is enabled.

ComponentSupervisionAgent options
ComponentSupervisionAgent
{
  #Time in seconds between start of cycles
  PollingTime = 600
  # Overall enable or disable
  EnableFlag = False
  # Email addresses receiving notifications
  MailTo =
  # Sender email address
  MailFrom =
  # If True automatically restart stuck agents
  RestartAgents = False
  # if True automatically restart stuck services
  RestartServices = False
  # if True automatically restart stuck executors
  RestartExecutors = False
  # if True automatically start or stop components based on host configuration
  ControlComponents = False
  # if True automatically add or remove service URLs
  CommitURLs = False
  # list of pattern in instances to disable restart for them
  DoNotRestartInstancePattern = RequestExecutingAgent
}
class DIRAC.FrameworkSystem.Agent.ComponentSupervisionAgent.ComponentSupervisionAgent(*args, **kwargs)

Bases: AgentModule

ComponentSupervisionAgent class.

__init__(*args, **kwargs)

Initialize the agent, clients, default values.

am_Enabled()
am_checkStopAgentFile()
am_createStopAgentFile()
am_getControlDirectory()
am_getCyclesDone()
am_getMaxCycles()
am_getModuleParam(optionName)
am_getOption(optionName, defaultValue=None)

Gets an option from the agent’s configuration section. The section will be a subsection of the /Systems section in the CS.

am_getPollingTime()
am_getShifterProxyLocation()
am_getStopAgentFile()
am_getWatchdogTime()
am_getWorkDirectory()
am_go()
am_initialize(*initArgs)

Common initialization for all the agents.

This is executed every time an agent (re)starts. This is called by the AgentReactor, should not be overridden.

am_removeStopAgentFile()
am_secureCall(functor, args=(), name=False)
am_setModuleParam(optionName, value)
am_setOption(optionName, value)
am_stopExecution()
beginExecution()

Reload the configurations before every cycle.

checkAgent(agentName, options)

Check the age of agent’s log file, if it is too old then restart the agent.

checkExecutor(executor, options)

Check the age of executor log file, if too old check for jobs in checking status, then restart the executors.

checkForCheckingJobs(executorName)

Check if there are checking jobs with the executorName as current MinorStatus.

checkService(serviceName, options)

Ping the service, restart if the ping does not respond.

checkURLs()

Ensure that the running services have their URL in the Config.

componentControl()

Monitor and control component status as defined in the CS.

Check for running and stopped components and ensure they have the proper status as defined in the CS Registry/Hosts/_HOST_/[Running|Stopped] sections

Returns:

S_OK(), S_ERROR()

endExecution()
execute()

Execute checks for agents, executors, services.

finalize()
static getLastAccessTime(logFileLocation)

Return the age of log file.

getRunningInstances(instanceType='Agents', runitStatus='Run')

Return a dict of running agents, executors or services.

Key is component’s name, value contains dict with PollingTime, PID, Port, Module, RunitStatus, LogFileLocation

Parameters:
  • instanceType (str) – ‘Agents’, ‘Executors’, ‘Services’

  • runitStatus (str) – Return only those instances with given RunitStatus or ‘All’

Returns:

Dictionary of running instances

initialize(*args, **kwargs)

Agents should override this method for specific initialization. Executed at every agent (re)start.

logError(errStr, varMsg='')

Append errors to a list, which is sent in email notification.

on_terminate(componentName, process)

Execute callback when a process terminates gracefully.

restartInstance(pid, instanceName, enabled)

Kill a process which is then restarted automatically.

sendNotification()

Send email notification about changes done in the last cycle.