CloudComputingElement
Cloud Computing Element
This allows submission to cloud sites using libcloud (via the standard SiteDirector agent). The instances are contextualised using cloud-init.
Running cloud VM instances containing pilots is very analogous to classic cloud jobs. There are however some things that work differently:
File I/O: A small amount of input may be transferred through the instance metadata, but after that the VM is inaccessible.
Authentication: Most cloud endpoints use a password or API style credentials rather than a grid style proxy based authentication.
Pilot (VM) Tidy-up: Cloud providers will not remove stopped instances by default.
The cloud instances now use the standard pilot proxy bundled in the job wrapper script. The extended lifetime proxy that was generated and included in the cloud user_data is no longer required and has been removed.
By default it is assumed that a generic CentOS7 base image is being used. This will be fully contextualised using cloud-init:
CVMFS & Singularity will be installed.
A dirac user will be created to run the jobs.
Pilot start-up scripts will be installed in /mnt.
The usual pilot script will be placed in the dirac home directory and the start-up scripts are run (as the dirac user).
After the pilot terminates, the machine is stopped by calling halt.
A partially or fully pre-configured image may be used instead and the cloud-init template can be customised as necessary for this or any use case. This is recommended on production systems to cut-down on the overhead when starting many new instances.
The majority of cloud providers identify instances with some form of unique identifier (generally a UUID), this is used in the pilot references. Each instance can generally also have a “friendly name” associated with it, which may not be unique. We set the friendly name to match a string that can be pattern matched; this allows any stopped instances to be found & removed automatically without affecting other VMs potentially running as the same user.
Instances that match the “friendly name” prefix and have been running above a maximum lifetime are assumed to be stuck or lost and will be removed. This is to ensure that instances don’t reserve/consume resources indefinitely.
Most cloud authentication systems require some form of static secret such as a password or token. To store these securely we load them from an ini format file, which should only be readable by the dirac service user on the host. The values can be stored in the DEFAULT section of the ini file, or a more specific section using the CE hostname can be used.
The special value PROXY will cause the secret to be replaced with the path to the proxy that the site director would normally use to submit a job. This is typically used for FedCloud sites using the libcloud OpenStack VOMS auth plugin.
[DEFAULT]
key = "myusername"
secret = "mypassword"
[cloudprov.mysite.example]
key = "cloudprovuser"
secret = "01234567"
[fedcloud.othersite.example]
key = "fedclouduser"
secret = "PROXY"
Configuration
The configuration is made up of a number of categories: These options are loaded from the CE level, but can be overridden by the queue.
- CloudType:
(Required) This should match the libcloud driver name for the Cloud you’re trying to access. e.g. For OpenStack this should be “OPENSTACK”. You can also specify a fully qualified class name to register and use as a driver: For example if your class is “MyNodeDriver” in “MyPkg/Prov/Driver.py”, use “MyPkg.Prov.Driver.MyNodeDriver” here.
- CloudAuth:
(Optional) This sets the path to the authentication ini file as described above. Should be an absolute path but may use environment variables. Defaults to (DIRAC.rootPath)/etc/cloud.auth.
- Driver_*:
(Required) All options starting with Driver_ will have the prefix stripped and be passed to the libcloud Driver object constructor. See the libcloud manual/examples for the options required for any given driver.
- Instance_Image:
(Required) The raw ID of the image to use or the name of the image prefixed by “name:”.
- Instance_Flavor:
(Required) The raw ID of the flavor to use or the name of a flavor prefixed by “name:”.
- Instance_Networks:
(Optional) A comma seperated list of either the raw IDs or the names prefixed by “name:” of the networks to use.
- Instance_SSHKey:
(Optional) The ID of an SSH key (on OpenStack this is just a plain name). If not specified the node will be booted without an extra key.
- Context_Template:
(Optional) The path to the cloudinit.template file to use for these instances. If unset the default template file will be used.
- Context_ExtPackages:
(Optional) Comma separated list of extra packages to install on the VM. Note: It is highly recommended to use SingularityCE with a container image with the required packages instead.
- Context_MaxLifetime:
(Optional) The maximum lifetime of an instance in seconds. Any instances older than this will be removed regardless of state. Defaults to two weeks.
Example
The following is an example set of settings for an OpenStack based cloud:
CE = cloudprov.mysite.example
CEType = Cloud
CloudType = OPENSTACK
Driver_ex_force_auth_url = https://cloudprov.mysite.example:5000
Driver_ex_force_auth_version = 3.x_password
Driver_ex_tenant_name = clouduser
Instance_Image = name:CentOS-7-x86_64-GenericCloud-1905
Instance_Flavor = name:m1.medium
Instance_Networks = name:my_public_net,name:my_private_net
Instance_SSHKey = mysshkey
- class DIRAC.Resources.Computing.CloudComputingElement.CloudComputingElement(*args, **kwargs)
Bases:
ComputingElement
Cloud computing element class Submits pilot jobs as VMs with libcloud.
- __init__(*args, **kwargs)
Constructor Takes the standard CE parameters. See ComputeElement base class for details.
- available(jobIDList=None)
This method returns the number of available slots in the target CE. The CE instance polls for waiting and running jobs and compares to the limits in the CE parameters.
- Parameters:
jobIDList (list) – list of already existing job IDs to be checked against
- cleanupPilots()
- Removes all stopped instances and
removes all instances with a lifetime above threshold from config as a fallback to remove instances that have been lost.
- Returns:
S_OK
- getDescription()
Get CE description as a dictionary.
This is called by the JobAgent for the case of “inner” CEs.
- getJobOutput(*args, **kwargs)
Not implemented: There is no standard way of getting files back from an instance. We rely on remote pilot logging to collect logs for debugging (or an admin logging on to an instance manually).
- Returns:
S_ERROR, not implemented.
- getJobStatus(jobIDList)
Lookup the status of the given pilot job IDs
- Returns:
S_OK(dict(jobID -> str(state))
- Return type:
- initializeParameters()
Initialize the CE parameters after they are collected from various sources
- isProxyValid(valid=1000)
Check if the stored proxy is valid
- isValid()
Check the sanity of the Computing Element definition
- killJob(jobIDList)
Stops VM instances
- Parameters:
jobIDList (list) – Instance IDs to delete.
- Returns:
S_OK
- loadBatchSystem(batchSystemName)
Instantiate object representing the backend batch system
- Parameters:
batchSystemName (str) – name of the batch system
- sendOutput(stdid, line)
Callback function such that the results from the CE may be returned.
- setCPUTimeLeft(cpuTimeLeft=None)
Update the CPUTime parameter of the CE classAd, necessary for running in filling mode
- setParameters(ceOptions)
Add parameters from the given dictionary overriding the previous values
- Parameters:
ceOptions (dict) – CE parameters dictionary to update already defined ones
- setProxy(proxy, valid=0)
Set proxy for this instance
- setToken(token)
- shutdown()
Optional method to shutdown the (Inner) Computing Element
- submitJob(executableFile, proxy, numberOfJobs=1)
Creates VM instances
- writeProxyToFile(proxy)
CE helper function to write a CE proxy string to a file.