7. Large Scale DataManagement with the Transformation System
7.1. Pre-Requisite
You should:
have a machine setup as described in Basic Tutorial setup
have installed two DIRAC SE using the tutorial (Install a DIRAC Storage Element).
have installed the DFC using the tutorial (Installing the DIRAC File Catalog).
have followed the tutorial on identity management (Managing identities)
have installed the RMS using the tutorial (Installing the RequestManagement System)
have installed the TS using the tutorial (Installing the TransformationSystem)
7.2. Tutorial Goal
The aim of the tutorial is to demonstrate how large scale data management operations (removals, replications, etc.) can be achieved using the Transformation System. By the end of the tutorial, you will be able to:
Submit simple transformation for manipulating a given list of files
Have transformations automatically fed thanks to metadata
Write your own plugin for the TransformationSystem
The transformations can be monitored and controlled with the Transformation Monitor
in the WebApp
when you use
the dirac_prod
group.
7.3. More Links
7.4. Creating a Transformation with a DIRAC Command
This section is to be performed as diracuser
with a proxy in dirac_prod
group.
First we need to create some files and upload them to StorageElementOne
:
[diracuser@dirac-tuto ~]$ for ID in {1..10}; do echo "MyContent $ID" > File_${ID} ; dirac-dms-add-file /tutoVO/data/Trans_01/File_${ID} File_${ID} StorageElementOne ; done
Then we create the list of LFNs we just uploaded:
[diracuser@dirac-tuto ~]$ dirac-dms-find-lfns Path=/tutoVO/data/Trans_01 > trans01.lfns
The easiest way to create a transformation to replicate files is by using the dirac-transformation-replication command:
[diracuser@dirac-tuto ~]$ dirac-transformation-replication 0 StorageElementTwo --Plugin Broadcast --Enable
Created transformation NNN
Successfully created replication transformation
This created transformation with the unique transformation ID NNN (e.g., 1).
By default this transformation uses Metadata information to obtain the input files using the
InputDataAgent
. Instead we can also just add files manually with the dirac-transformation-add-files command and using the list we created previously,
replace NNN by the ID of the transformation that was just created:
[diracuser@dirac-tuto ~]$ dirac-transformation-add-files NNN trans01.lfns
Successfully added 10 files
Now we have to wait until the TransformationAgent
runs again and creates a Task for each of the files. Once the
tasks are created, the RequestTaskAgent
creates a request out of each task, which is then processed in the
RequestExecutingAgent
of the RMS.
7.5. Creating a Transformation with a Script
In this step we want to remove the replicas of our files from StorageElementOne
, for this purpose we have to write a
script that creates a removal transformation:
1 #!/bin/env python
2
3 # set up the DIRAC configuration, parse command line arguments
4 from DIRAC import gLogger, S_OK, S_ERROR
5 from DIRAC.Core.Utilities.DIRACScript import DIRACScript as Script
6 Script.parseCommandLine()
7
8 from DIRAC.TransformationSystem.Client.Transformation import Transformation
9
10 # create a Transformation instance
11 myTrans = Transformation()
12
13 # transformation names need to be unique
14 uniqueIdentifier = "Trans1"
15 transformationName = "RemoveReplicas_%s" % uniqueIdentifier
16 myTrans.setTransformationName(transformationName)
17
18 # describe what the transformation will do
19 description = "Remove replicas from StorageElementOne"
20 myTrans.setDescription(description)
21 myTrans.setLongDescription(description)
22
23 # 'Replication' type means we do data management
24 myTrans.setType('Removal')
25
26 # group transformations that belong together, these can be selected in the WebApp
27 transGroup = "myRemovals"
28 myTrans.setTransformationGroup(transGroup)
29
30 # groupSize defines the number of files each request will treat
31 groupSize = 1
32 myTrans.setGroupSize(groupSize)
33
34 # the transformation plugin defines which input files are treated, and how they are grouped, for example
35 plugin = 'Broadcast'
36 myTrans.setPlugin(plugin)
37
38 # the 'body' of the transformation, defines a list of Request Operations
39 # that are executed in order for each file added to the transformation
40 targetSE = 'StorageElementOne'
41 transBody = [("RemoveReplica", {"TargetSE": targetSE})]
42
43 myTrans.setBody(transBody)
44
45 res = myTrans.setTargetSE(targetSE)
46 if not res['OK']:
47 gLogger.error("TargetSE not valid: %s" % res['Message'])
48 exit(1)
49
50 res = myTrans.addTransformation()
51 if not res['OK']:
52 gLogger.error("Failed to add the transformation: %s" % res['Message'])
53 exit(1)
54
55 # now activate the transformation
56 myTrans.setStatus('Active')
57 myTrans.setAgentType('Automatic')
58 transID = myTrans.getTransformationID()['Value']
59 gLogger.notice('Created RemoveReplica transformation: %r' % transID)
60 exit(0)
When we execute the script, the transformation is created with the ID MMM (e.g. 2):
[diracuser@dirac-tuto ~]$ python createRemoval.py
Created transformation MMM
Created RemoveReplica transformation: MMML
To remove a replica from StorageElementOne, we just have to add files to this transformation:
[diracuser@dirac-tuto ~]$ dirac-transformation-add-files MMM /tutoVO/data/Trans_01/File_10
Successfully added 1 files
And then wait again for the TransformationAgent
, RequestTaskAgent
, RequestExecutingAgent
chain to complete.
After a short while, you should see that the folder /opt/dirac/storageElementOne/tutoVO/data/Trans_01/
, no longer
contains File_10
.
7.6. Using Metadata Queries to Add Files to Transformations
Adding files manually to transformations can be useful, but if we want to automatically add files to transformations we
can make use of metadata queries in combination with the InputDataAgent
, which executes the queries and adds new
files to the corresponding transformation.
To benefit from metadata query, we first have to create a metadata key, and add the key to a directory. These
operations can be done with the dirac-dms-filecatalog-cli
:
[diracuser@dirac-tuto ~]$ dirac-dms-filecatalog-cli
Starting FileCatalog client
File Catalog Client $Revision: 1.17 $Date:
FC:/$ ls -l
drwxrwxr-x 0 ciuser dirac_user 0 2019-05-06 14:30:36 tutoVO
In the dirac-dms-filecatalog-cli
, like in the other DIRAC CLIs you can use help
and help <command>
to see
information about the available commands.
Initially there are no metadata keys defined:
FC:/$ meta show
FileMetaFields : {}
DirectoryMetaFields : {}
We now create in integer directory metadata called TransformationID
:
FC:/$ meta index -d TransformationID int
Added metadata field TransformationID of type int
FC:/$ meta show
FileMetaFields : {}
DirectoryMetaFields : {'TransformationID': 'INT'}
Let’s add the TransformationID=1
to the files we uploaded earlier:
FC:/$ meta set /tutoVO/data/Trans_01/ TransformationID 1
/tutoVO/data/Trans_01 {'TransformationID': '1'}
You can see the metadata set for a given diretory with the meta get
command, and you can use the find
command
inside the dirac-dms-filecatalog-cli
to search for files with metadata:
FC:/$ meta get /tutoVO/data/Trans_01/
!TransformationID : 1
FC:/$ find / TransformationID=1
Query: {'TransformationID': 1}
/tutoVO/data/Trans_01/File_1
[..snip..]
/tutoVO/data/Trans_01/File_9
QueryTime 0.00 sec
Now let us create another directory, and set a different metadata value, before we create another transformation including an inputdata query:
FC:/$ mkdir /tutoVO/data/Trans_02/
Successfully created directory: /tutoVO/data/Trans_02
FC:/$ meta set /tutoVO/data/Trans_02/ TransformationID 2
/tutoVO/data/Trans_02 {'TransformationID': '2'}
FC:/$ meta get /tutoVO/data/Trans_02/
!TransformationID : 2
Now upload some files to this folder:
[diracuser@dirac-tuto ~]$ for ID in {1..10}; do echo "MyContent $ID" > File_${ID} ; dirac-dms-add-file /tutoVO/data/Trans_02/File_${ID} File_${ID} StorageElementOne ; done
We can also use the command dirac-dms-find-lfns
to search for files with given metadata:
[diracuser@dirac-tuto ~]$ dirac-dms-find-lfns Path=/ TransformationID=2
Now we create a transformation, which uses the metadata to pick up the files:
[diracuser@dirac-tuto ~]$ dirac-transformation-replication 2 StorageElementTwo --Plugin=Broadcast --Enable
Created transformation LLL
Successfully created replication transformation
In fact the command dirac-transformation-replication
already uses metadata, the first argument is the value for the
TransformationID
metadata. Now we have to wait for the InputDataAgent
, TransformationAgent
,
RequestTaskAgent
, RequestExecutingAgent
chain to run its course.
In the log file of the InputDataAgent
in /opt/dirac/pro/runit/Transformation/InputDataAgent/log/current
eventually this line should appear:
<SomeDate> Transformation/InputDataAgent INFO: 10 files returned for transformation LLL from the metadata catalog
You may add some more files to /tutoVO/data/Trans_02/
and see them appearing in your transformation:
[diracuser@dirac-tuto ~]$ for ID in {11..20}; do echo "MyContent $ID" > File_${ID} ; dirac-dms-add-file /tutoVO/data/Trans_02/File_${ID} File_${ID} StorageElementOne ; done
7.6.1. InputDataQuery in the Script
To add the metadata query functionality to our createRemoval.py
script from above, we just need to insert a couple
of lines
44metaQuery = {'TransformationID': 2}
45myTrans.setInputMetaQuery(metaQuery)
46
47...
Adapt the script by inserting the lines and changing the uniqueIdentifier
and execute it:
[diracuser@dirac-tuto ~]$ python createRemoval.py
Created transformation JJJ
Created RemoveReplica transformation: JJJL
7.7. Conclusion
You now have all the knowledge to perform DataManagement in DIRAC with the TransformationSystem.
To learn how to extend the system by creating new transformation plugins, please see how to Create a TransformationAgent Plugin and Create a Body Plugin