CloudComputingElement

Cloud Computing Element

This allows submission to cloud sites using libcloud (via the standard SiteDirector agent). The instances are contextualised using cloud-init.

Running cloud VM instances containing pilots is very analogous to classic cloud jobs. There are however some things that work differently:

  • File I/O: A small amount of input may be transferred through the instance metadata, but after that the VM is inaccessible.

  • Authentication: Most cloud endpoints use a password or API style credentials rather than a grid style proxy based authentication. The pilot still requires a suitable proxy, but this cannot be renewed via the cloud interface due to the I/O limitations.

  • Pilot (VM) Tidy-up: Cloud providers will not remove stopped instances by default.

To avoid the proxy renewal limitations, an alternate pilot proxy is used within the instances. This can either be a longer version of the usual pilot proxy or a pilot proxy generated from another dedicated cert/user. The proxy contains the DIRAC group, but no VOMS (as this would likely expire too quickly).

By default it is assumed that a generic CentOS7 base image is being used. This will be fully contextualised using cloud-init:

  • CVMFS & Singularity will be installed.

  • A dirac user will be created to run the jobs.

  • Pilot proxy and start-up scripts will be installed in /mnt.

  • The usual pilot script will be placed in the dirac home directory and the start-up scripts are run (as the dirac user).

  • After the pilot terminates, the machine is stopped by calling halt.

A partially or fully pre-configured image may be used instead and the cloud-init template can be customised as necessary for this or any use case. This is recommended on production systems to cut-down on the overhead when starting many new instances.

The majority of cloud providers identify instances with some form of unique identifier (generally a UUID), this is used in the pilot references. Each instance can generally also have a “friendly name” associated with it, which may not be unique. We set the friendly name to match a string that can be pattern matched; this allows any stopped instances to be found & removed automatically without affecting other VMs potentially running as the same user.

Instances that match the “friendly name” prefix and have been running above a maximum lifetime are assumed to be stuck or lost and will be removed. This is to ensure that instances don’t reserve/consume resources indefinitely.

Most cloud authentication systems require some form of static secret such as a password or token. To store these securely we load them from an ini format file, which should only be readable by the dirac service user on the host. The values can be stored in the DEFAULT section of the ini file, or a more specific section using the CE hostname can be used.

The special value PROXY will cause the secret to be replaced with the path to the proxy that the site director would normally use to submit a job. This is typically used for FedCloud sites using the libcloud OpenStack VOMS auth plugin.

[DEFAULT]
key = "myusername"
secret = "mypassword"

[cloudprov.mysite.example]
key = "cloudprovuser"
secret = "01234567"

[fedcloud.othersite.example]
key = "fedclouduser"
secret = "PROXY"

Configuration

The configuration is made up of a number of categories: These options are loaded from the CE level, but can be overridden by the queue.

CloudType:

(Required) This should match the libcloud driver name for the Cloud you’re trying to access. e.g. For OpenStack this should be “OPENSTACK”. You can also specify a fully qualified class name to register and use as a driver: For example if your class is “MyNodeDriver” in “MyPkg/Prov/Driver.py”, use “MyPkg.Prov.Driver.MyNodeDriver” here.

CloudAuth:

(Optional) This sets the path to the authentication ini file as described above. Should be an absolute path but may use environment variables. Defaults to (DIRAC.rootPath)/etc/cloud.auth.

Driver_*:

(Required) All options starting with Driver_ will have the prefix stripped and be passed to the libcloud Driver object constructor. See the libcloud manual/examples for the options required for any given driver.

Instance_Image:

(Required) The raw ID of the image to use or the name of the image prefixed by “name:”.

Instance_Flavor:

(Required) The raw ID of the flavor to use or the name of a flavor prefixed by “name:”.

Instance_Networks:

(Optional) A comma seperated list of either the raw IDs or the names prefixed by “name:” of the networks to use.

Instance_SSHKey:

(Optional) The ID of an SSH key (on OpenStack this is just a plain name). If not specified the node will be booted without an extra key.

Context_Template:

(Optional) The path to the cloudinit.template file to use for these instances. If unset the default template file will be used.

Context_ExtPackages:

(Optional) Comma separated list of extra packages to install on the VM. Note: It is highly recommended to use SingularityCE with a container image with the required packages instead.

Context_ProxyLifetime:

(Optional) When submitting an instance, it will be provisioned with a new proxy with the same properties as the one provided by the SiteDirector but with an extended lifetime. This option sets the lifetime of the new proxy in seconds: It must be greater than the maximum time jobs can run for in the instance. Defaults to two weeks.

Context_MaxLifetime:

(Optional) The maximum lifetime of an instance in seconds. Any instances older than this will be removed regardless of state. Defaults to two weeks.

Example

The following is an example set of settings for an OpenStack based cloud:

CE = cloudprov.mysite.example
CEType = Cloud
CloudType = OPENSTACK
Driver_ex_force_auth_url = https://cloudprov.mysite.example:5000
Driver_ex_force_auth_version = 3.x_password
Driver_ex_tenant_name = clouduser
Instance_Image = name:CentOS-7-x86_64-GenericCloud-1905
Instance_Flavor = name:m1.medium
Instance_Networks = name:my_public_net,name:my_private_net
Instance_SSHKey = mysshkey
class DIRAC.Resources.Computing.CloudComputingElement.CloudComputingElement(*args, **kwargs)

Bases: ComputingElement

Cloud computing element class Submits pilot jobs as VMs with libcloud.

__init__(*args, **kwargs)

Constructor Takes the standard CE parameters. See ComputeElement base class for details.

available()

This method returns the number of available slots in the target CE. The CE instance polls for waiting and running jobs and compares to the limits in the CE parameters.

cleanupPilots()
Removes all stopped instances and

removes all instances with a lifetime above threshold from config as a fallback to remove instances that have been lost.

Returns:

S_OK

getCEStatus()

Counts number of running jobs

Returns:

S_OK

Return type:

dict

getDescription()

Get CE description as a dictionary.

This is called by the JobAgent for the case of “inner” CEs.

getJobOutput(*args, **kwargs)

Not implemented: There is no standard way of getting files back from an instance. We rely on remote pilot logging to collect logs for debugging (or an admin logging on to an instance manually).

Returns:

S_ERROR, not implemented.

getJobStatus(jobIDList)

Lookup the status of the given pilot job IDs

Returns:

S_OK(dict(jobID -> str(state))

Return type:

dict

initializeParameters()

Initialize the CE parameters after they are collected from various sources

isValid()

Check the sanity of the Computing Element definition

killJob(jobIDList)

Stops VM instances

Parameters:

jobIDList (list) – Instance IDs to delete.

Returns:

S_OK

loadBatchSystem(batchSystemName)

Instantiate object representing the backend batch system

Parameters:

batchSystemName (str) – name of the batch system

sendOutput(stdid, line)

Callback function such that the results from the CE may be returned.

setCPUTimeLeft(cpuTimeLeft=None)

Update the CPUTime parameter of the CE classAd, necessary for running in filling mode

setParameters(ceOptions)

Add parameters from the given dictionary overriding the previous values

Parameters:

ceOptions (dict) – CE parameters dictionary to update already defined ones

setProxy(proxy, valid=0)

Take existing proxy, and extract group name. Then create new proxy for the cloud pilot user bound to the same group with the lifetime set to the value specified in the CE config.

Returns:

S_OK() or S_ERROR(error string)

setToken(token)
shutdown()

Optional method to shutdown the (Inner) Computing Element

submitJob(executableFile, proxy, numberOfJobs=1)

Creates VM instances

Parameters:
  • executableFile (str) – Path to pilot job wrapper file to use

  • proxy (str) – Unused, see setProxy()

  • numberOfJobs (int) – Number of instances to start

Returns:

S_OK/S_ERROR

writeProxyToFile(proxy)

CE helper function to write a CE proxy string to a file.