Computing
Slurm computing platform
The description of the computing platform and the corresponding documentation are available here.
Launch a job
For more information on job submission, see the complete documentation here.
To submit a job on the computing platform, the sbatch command must be used with the following syntax:
sbatch -A euclid -t 0-00:30 -n 1 --mem 2G job.sh
Where -t <j-hh:mm>
is the time limit, -n <number>
is the number of cores to be used, --mem <number>
is the amount of memory, and job.sh
is the job script.
Jobs profiling
It is possible to profile your job with the computing platform. A HTML file is created and it can be opened in a browser showing profiling infos and graphs, and a XML file with raw profiling values.
To use this option, you need to activate the profiling agent with --profile=task
to your submission line:
sbatch -A euclid -t 0-01:00 -n 3 --mem 7G --profile=task [--acctg-freq=task=10] job.sh
You can retrieve the complete documentation of this option here.
sbatch -A euclid -t 0-01:00 -n 1 --mem 1G -d <jobid> slurm_profiling <jobid>
Pipeline Runner
This documentation provides some help to run pipelines using the Euclid Pipeline Runner in standalone mode (or console mode) at CC-IN2P3.
By standalone we mean:
- With no use of metascheduler, hence no interface with EAS-DPS and EAS-DSS (e.g. no PPO).
- With no use of the Pipeline Run Server. The Pipeline Runner is simply launched as an executable (pipeline_runner.py) that terminates once the pipeline ends (no web server with workflow plot for instance).
The general workflow that is recommended is the following:
- A Pipeline Runner process is launched within a job on the batch farm. This job will keep running as long as the pipeline is executing.
- This Pipeline Runner process will submit pilot jobs to the batch system.
- Pilot jobs will start and request payload jobs (i.e. PF tasks) to the Pipeline Runner, and execute them.
Setup and configuration
The first thing to be prepared in order to launch a pipeline is the pipeline directory (typically in your /sps/euclid/
user space).
In our example setup it is called $PIPELINEDIR, and the directory tree looks like the following:
$PIPELINEDIR/
/workdir/
The PIPELINEDIR
directory must also contain a Pipeline Runner configuration file sdc-fr-local.properties
, see the attached example.
For more details on this configuration file, see Pilot configurations part.
The workdir
directory usually contains a .dat
configuration file and a data
directory with all your input files.
You need now to create a script that will setup your environment and launch the Pipeline Runner.
This script PR_script.sh
should have the following content:
#!/bin/sh
# Setup the Pipeline Runner environment variables
## DON'T CHANGE THIS SECTION ###############################################################################################
export PR=/cvmfs/euclid-dev.in2p3.fr/COMPONENTS/INFRA/ST_PipelineRunner/3.3.2
export PATH=/cvmfs/euclid-dev.in2p3.fr/COMPONENTS/INFRA/ST_PipelineRunner/3.3.2/bin:$PATH
############################################################################################################################
#### UPDATE THE PATH TO PIPELINE DIR #######################################################################################
export PIPELINEDIR=/sps/euclid/Users/foo/pipeline/
############################################################################################################################
# Launch The Pipeline Runner
########### UPDATE THE PIPELINE NAME AND EVENTUALLY THE MyConfigDat.dat NAME ##############################################
$PR/bin/python $PR/bin/pipeline_runner.py localrun --pipeline="MyPipelineScript.py" --config="$PIPELINEDIR/sdc-fr-local.properties" --data="$PIPELINEDIR/workdir/params.dat" --shortid --edenVersion=eden-3.1
You need to adapt the PIPELINEDIR variable, as well as the MyPipelineScript.py
and params.dat
file names in the pipeline_runner.py
command.
Make this script executable. You can now submit your Pipeline Runner job now. Don't forget to change asked memory, CPU numbers and time.
sbatch -A euclid -t 1-00:00:00 -n 1 --mem 3G PR_script.sh
Pilot jobs configuration
The different types of pilot jobs can be configured through the parameters pipelinerunner.pilot.c<cores>m<rss>.<config>=<value>
is an arbitrary string describing the pilot type, and <config>
correspond to one of the 5 following configuration item:
pipelinerunner.pilots.c<cores>m<rss>.cores : number of cores requested by the pilot job
pipelinerunner.pilots.c<cores>m<rss>.ramInMB : quantity of RAM in MB requested by the pilot job
pipelinerunner.pilots.c<cores>m<rss>.walltimeInMin : walltime in minutes requested by the pilot job
pipelinerunner.pilots.c<cores>m<rss>.maxInstances : maximum number of pilot jobs of this type
pipelinerunner.pilots.c<cores>m<rss>.diskspaceInGB : space of the disk in GB
pipelinerunner.pilots.c<cores>m<rss>.tmpPath : temporary directory path of this job type
For example, one can define a pilot with 8 cores and 25GB of memory with the following configuration:
pipelinerunner.pilots.c8m25.cores=8
pipelinerunner.pilots.c8m25.ramInMB=25000
pipelinerunner.pilots.c8m25.walltimeInMin=4320
pipelinerunner.pilots.c8m25.maxInstances=1000
pipelinerunner.pilots.c8m25.diskspaceInGB=200
pipelinerunner.pilots.c8m25.tmpPath=$TMPDIR
In the default configuration we provide, several pilot types are defined, which correspond to the most common job queues at CC-IN2P3. You can:
- Add any other pilot type.
- Disable one of the pilot type, either by setting
pipelinerunner.pilot.c<cores>m<rss>.maxInstances=0
, or simply by removing the pilot type configuration items. - Modify the
pipelinerunner.pilot.c<cores>m<rss>.maxInstances
parameter for the existing pilot types depending on your needs.
Don't forget to change the maxInstances
in the configuration file. By default, each pilot is limited to 10.
Test with Dummy Pipeline
The setup can be quickly tested by launching a short Dummy Pipeline. To do so, follow the section above with the following changes:
- Use the Dummy Pipeline working directory available:
cp -r /sps/euclid/Users/ecprod/dummy_PR_workdir $PIPELINEDIR
- Add the $PIPELINEDIR in PR_script.sh
- Use the following command to launch the Pipeline Runner :
sbatch -A euclid -t 1:00:00 -p htc -n 1 --mem 3G PR_script.sh
To check if your job is running, you can do squeue -u youruser
. This short dummy pipeline is completed in about 4 minutes and you can check Slurm log with tail -f slurm-jobId.out
during the execution. These logs correspond to PR's logs.
Explications
First, the PR checks available pilots to launch pipeline's tasks (5 in this pipeline). After this first step, PR starts to run tasks on pilots according to the resources required by the task.
Here, only pilots with 1 CPU and 10G of memory are available (see sdc-fr-local.properties
file), this is why we just see Pilot__c1m10...
. All pilot's steps are described in PR's log.
At the end of the log, there is a summary with all tasks to check used pilot and their status.
TICK PILOT STATUS PAYLOADJOBID DURATION OUTDIR LOGDIR MESSAGE
createListFile_1 Pilot__c1m10__250204_154813.463030 COMPLETED 1.018214 createListFile log/createListFile
parallel_consume_resources_2_2_1_1 Pilot__c1m10__250204_154813.463030 COMPLETED 37.023206 consume_resources_branch.iterations.1.parallel_consume_resources log/consume_resources_branch.iterations.1.parallel_consume_resources
sequential_consume_resources_3 Pilot__c1m10__250204_154853.604304 COMPLETED 39.017862 sequential_consume_resources log/sequential_consume_resources
createOutputProducts_4 Pilot__c1m10__250204_154938.734487 COMPLETED 4.020102 createOutputProducts log/createOutputProducts
createOutputProducts_4.md5.output_data_products.0_4 Pilot__c1m10__250204_154938.734487 COMPLETED 2.01786 createOutputProducts_4.md5.output_data_products.0 log/createOutputProducts_4.md5.output_data_products.0
PR's logs are always described in jobID logs. If you have a problem with your results, the answer is probably in this logs.
CloneToLocal
Description
CloneToLocal is a soft which allows to create a local workdir based on a PPO. The complete documentation is available here. CloneToLocal allows to re-run locally a PPO:
- It can retrieve a PPO by PPO Id, or it can use a local PPO XML
- It will parse all the entries in the PPO to get ports (inputs) and retrieve associated data (it can take longer if the data are not already available locally)
- It will create the local workdir
- It will create the configuration for the Pipeline Runner for an SDC (SDC-FR or SDC-ES) or for LODEEN.
- It will create the bash script to run the Pipeline Runner in local mode.
- Only for SIM: it can take a local SimRequest to allow run differnt configurations.
First, don't forget to load the EDEN environment: source /cvmfs/euclid-dev.in2p3.fr/EDEN-3.1/bin/activate
.
usage: CloneToLocal [-h] --ppo PPO --output OUTPUT [--easProject EASPROJECT] [--easEnvironment EASENVIRONMENT] --easUser EASUSER [--sdc SDC] [--dss DSS] [--sim SIM] [--pipeline PIPELINE] [--input INPUT] [--easPwd EASPWD] [--script-only] [--PRversion PRVERSION]
[--config-file CONFIG_FILE] [--log-file LOG_FILE] [--log-level LOG_LEVEL] [--version]
optional arguments:
-h, --help show this help message and exit
--ppo PPO Specify a PPO ID to clone, or an xml file.
--output OUTPUT Specify the path where create the $PIPELINEDIR
--easProject EASPROJECT
Specify the eas project tag ('EUCLID', 'TEST'), by default 'EUCLID'
--easEnvironment EASENVIRONMENT
Specify the eas environment tag ('OPS', 'TEST'), by default 'OPS'
--easUser EASUSER Specify the user name to access data in EAS (cosmos or DB)
--sdc SDC Specify a SDC (SDC-FR or SDC-ES) or LODEEN. Default is SDC-FR.
--dss DSS Specify a DSS to use for data retrieval [Possible values: all DSS supported by ST_ArchiveUtils]. SDC_NL is the default and is the only one public and the one to use from LODEEN. To use SDC-FR DSS you have to be connected from the CC-IN2P3 network
--sim SIM Specify the sim request file, by default taken from the PPO
--pipeline PIPELINE Specify the path to the pipeline, default taken from the PPO
--input INPUT Specify the path where input data could be found
--easPwd EASPWD Specify the password to use to access EAS. If not specified it will be asked during the runtime
--script-only Generate only script
--PRversion PRVERSION
Specify the last pipeline runner version. Default is 3.3.2.
How to use it
Generic command:
E-Run SDC_FR_TOOLS CloneToLocal --ppo <PPO_ID> --easUser <user_cosmos> --output <path_to_the_workdir>
Real example:
E-Run SDC_FR_TOOLS CloneToLocal --ppo VIS_D4_NOMINAL_20210621T105631-YB26VKOZ-20210621-105812-004 --easUser xyz --output /sps/euclid/Users/mainetti/VIS_D4_NOMINAL_20210621T105631-YB26VKOZ-20210621-105812-004
You can launch the run on the batch system at CC-IN2P3 (SDC-FR) using the following command:
sbatch -A euclid -t 1-00:00:00 -n 1 --mem 3G PR_script.sh
Default batch system used at CC-IN2P3 is SLURM. You can retrieve all documentation for Pipeline Runner configuration here.