Computing

Slurm computing platform

The description of the computing platform and the corresponding documentation are available here.

Launch a job

For more information on job submission, see the complete documentation here.

To submit a job on the computing platform, the sbatch command must be used with the following syntax:

sbatch -A euclid -t 0-00:30 -n 1 --mem 2G job.sh

Where -t <j-hh:mm> is the time limit, -n <number> is the number of cores to be used, --mem <number> is the amount of memory, and job.sh is the job script.

Jobs profiling

It is possible to profile your job with the computing platform. A HTML file is created and it can be opened in a browser showing profiling infos and graphs, and a XML file with raw profiling values. To use this option, you need to activate the profiling agent with --profile=task to your submission line:

sbatch -A euclid -t 0-01:00 -n 3 --mem 7G --profile=task [--acctg-freq=task=10] job.sh

You can retrieve the complete documentation of this option here.

sbatch -A euclid -t 0-01:00 -n 1 --mem 1G -d <jobid> slurm_profiling <jobid>

Pipeline Runner

This documentation provides some help to run pipelines using the Euclid Pipeline Runner in standalone mode (or console mode) at CC-IN2P3.

By standalone we mean:

With no use of metascheduler, hence no interface with EAS-DPS and EAS-DSS (e.g. no PPO).
With no use of the Pipeline Run Server. The Pipeline Runner is simply launched as an executable (pipeline_runner.py) that terminates once the pipeline ends (no web server with workflow plot for instance).

The general workflow that is recommended is the following:

A Pipeline Runner process is launched within a job on the batch farm. This job will keep running as long as the pipeline is executing.
This Pipeline Runner process will submit pilot jobs to the batch system.
Pilot jobs will start and request payload jobs (i.e. PF tasks) to the Pipeline Runner, and execute them.

Setup and configuration

The first thing to be prepared in order to launch a pipeline is the pipeline directory (typically in your /sps/euclid/ user space). In our example setup it is called $PIPELINEDIR, and the directory tree looks like the following:

$PIPELINEDIR/
            /workdir/

The PIPELINEDIR directory must also contain a Pipeline Runner configuration file sdc-fr-local.properties, see the attached example. For more details on this configuration file, see Pilot configurations part.

The workdir directory usually contains a .dat configuration file and a data directory with all your input files.

You need now to create a script that will setup your environment and launch the Pipeline Runner. This script PR_script.sh should have the following content:

#!/bin/sh

# Setup the Pipeline Runner environment variables

## DON'T CHANGE THIS SECTION ###############################################################################################

export PR=/cvmfs/euclid-dev.in2p3.fr/COMPONENTS/INFRA/ST_PipelineRunner/3.3.2
export PATH=/cvmfs/euclid-dev.in2p3.fr/COMPONENTS/INFRA/ST_PipelineRunner/3.3.2/bin:$PATH

############################################################################################################################

#### UPDATE THE PATH TO PIPELINE DIR #######################################################################################

export PIPELINEDIR=/sps/euclid/Users/foo/pipeline/
############################################################################################################################

# Launch The Pipeline Runner

########### UPDATE THE PIPELINE NAME AND EVENTUALLY THE MyConfigDat.dat NAME  ##############################################

$PR/bin/python $PR/bin/pipeline_runner.py localrun --pipeline="MyPipelineScript.py" --config="$PIPELINEDIR/sdc-fr-local.properties" --data="$PIPELINEDIR/workdir/params.dat" --shortid --edenVersion=eden-3.1

You need to adapt the PIPELINEDIR variable, as well as the MyPipelineScript.py and params.dat file names in the pipeline_runner.py command.

Make this script executable. You can now submit your Pipeline Runner job now. Don't forget to change asked memory, CPU numbers and time.

sbatch -A euclid -t 1-00:00:00 -n 1 --mem 3G PR_script.sh

Pilot jobs configuration

The different types of pilot jobs can be configured through the parameters pipelinerunner.pilot.c<cores>m<rss>.<config>=<value> is an arbitrary string describing the pilot type, and <config> correspond to one of the 5 following configuration item:

pipelinerunner.pilots.c<cores>m<rss>.cores : number of cores requested by the pilot job
pipelinerunner.pilots.c<cores>m<rss>.ramInMB : quantity of RAM in MB requested by the pilot job
pipelinerunner.pilots.c<cores>m<rss>.walltimeInMin : walltime in minutes requested by the pilot job
pipelinerunner.pilots.c<cores>m<rss>.maxInstances : maximum number of pilot jobs of this type
pipelinerunner.pilots.c<cores>m<rss>.diskspaceInGB : space of the disk in GB 
pipelinerunner.pilots.c<cores>m<rss>.tmpPath : temporary directory path of this job type

For example, one can define a pilot with 8 cores and 25GB of memory with the following configuration:

pipelinerunner.pilots.c8m25.cores=8
pipelinerunner.pilots.c8m25.ramInMB=25000
pipelinerunner.pilots.c8m25.walltimeInMin=4320
pipelinerunner.pilots.c8m25.maxInstances=1000
pipelinerunner.pilots.c8m25.diskspaceInGB=200
pipelinerunner.pilots.c8m25.tmpPath=$TMPDIR

In the default configuration we provide, several pilot types are defined, which correspond to the most common job queues at CC-IN2P3. You can:

Add any other pilot type.
Disable one of the pilot type, either by setting pipelinerunner.pilot.c<cores>m<rss>.maxInstances=0, or simply by removing the pilot type configuration items.
Modify the pipelinerunner.pilot.c<cores>m<rss>.maxInstances parameter for the existing pilot types depending on your needs.

Don't forget to change the maxInstances in the configuration file. By default, each pilot is limited to 10.

Test with Dummy Pipeline

The setup can be quickly tested by launching a short Dummy Pipeline. To do so, follow the section above with the following changes:

Use the Dummy Pipeline working directory available:

cp -r /sps/euclid/Users/ecprod/dummy_PR_workdir $PIPELINEDIR

Add the $PIPELINEDIR in PR_script.sh
Use the following command to launch the Pipeline Runner :

sbatch -A euclid -t 1:00:00 -p htc -n 1 --mem 3G PR_script.sh

To check if your job is running, you can do squeue -u youruser. This short dummy pipeline is completed in about 4 minutes and you can check Slurm log with tail -f slurm-jobId.out during the execution. These logs correspond to PR's logs.

Explications

First, the PR checks available pilots to launch pipeline's tasks (5 in this pipeline). After this first step, PR starts to run tasks on pilots according to the resources required by the task. Here, only pilots with 1 CPU and 10G of memory are available (see sdc-fr-local.properties file), this is why we just see Pilot__c1m10.... All pilot's steps are described in PR's log. At the end of the log, there is a summary with all tasks to check used pilot and their status.

TICK                                                    PILOT                                  STATUS        PAYLOADJOBID  DURATION                                                             OUTDIR                                                                   LOGDIR    MESSAGE   
createListFile_1                                        Pilot__c1m10__250204_154813.463030     COMPLETED     1.018214      createListFile                                                       log/createListFile                                                                 
parallel_consume_resources_2_2_1_1                      Pilot__c1m10__250204_154813.463030     COMPLETED     37.023206     consume_resources_branch.iterations.1.parallel_consume_resources     log/consume_resources_branch.iterations.1.parallel_consume_resources               
sequential_consume_resources_3                          Pilot__c1m10__250204_154853.604304     COMPLETED     39.017862     sequential_consume_resources                                         log/sequential_consume_resources                                                   
createOutputProducts_4                                  Pilot__c1m10__250204_154938.734487     COMPLETED     4.020102      createOutputProducts                                                 log/createOutputProducts                                                           
createOutputProducts_4.md5.output_data_products.0_4     Pilot__c1m10__250204_154938.734487     COMPLETED     2.01786       createOutputProducts_4.md5.output_data_products.0                    log/createOutputProducts_4.md5.output_data_products.0

PR's logs are always described in jobID logs. If you have a problem with your results, the answer is probably in this logs.

CloneToLocal

Description

CloneToLocal is a soft which allows to create a local workdir based on a PPO. The complete documentation is available here. CloneToLocal allows to re-run locally a PPO:

It can retrieve a PPO by PPO Id, or it can use a local PPO XML
It will parse all the entries in the PPO to get ports (inputs) and retrieve associated data (it can take longer if the data are not already available locally)
It will create the local workdir
It will create the configuration for the Pipeline Runner for an SDC (SDC-FR or SDC-ES) or for LODEEN.
It will create the bash script to run the Pipeline Runner in local mode.
Only for SIM: it can take a local SimRequest to allow run differnt configurations.

First, don't forget to load the EDEN environment: source /cvmfs/euclid-dev.in2p3.fr/EDEN-3.1/bin/activate.

usage: CloneToLocal [-h] --ppo PPO --output OUTPUT [--easProject EASPROJECT] [--easEnvironment EASENVIRONMENT] --easUser EASUSER [--sdc SDC] [--dss DSS] [--sim SIM] [--pipeline PIPELINE] [--input INPUT] [--easPwd EASPWD] [--script-only] [--PRversion PRVERSION]
                    [--config-file CONFIG_FILE] [--log-file LOG_FILE] [--log-level LOG_LEVEL] [--version]

optional arguments:
  -h, --help            show this help message and exit
  --ppo PPO             Specify a PPO ID to clone, or an xml file.
  --output OUTPUT       Specify the path where create the $PIPELINEDIR
  --easProject EASPROJECT
                        Specify the eas project tag ('EUCLID', 'TEST'), by default 'EUCLID'
  --easEnvironment EASENVIRONMENT
                        Specify the eas environment tag ('OPS', 'TEST'), by default 'OPS'
  --easUser EASUSER     Specify the user name to access data in EAS (cosmos or DB)
  --sdc SDC             Specify a SDC (SDC-FR or SDC-ES) or LODEEN. Default is SDC-FR.
  --dss DSS             Specify a DSS to use for data retrieval [Possible values: all DSS supported by ST_ArchiveUtils]. SDC_NL is the default and is the only one public and the one to use from LODEEN. To use SDC-FR DSS you have to be connected from the CC-IN2P3 network
  --sim SIM             Specify the sim request file, by default taken from the PPO
  --pipeline PIPELINE   Specify the path to the pipeline, default taken from the PPO
  --input INPUT         Specify the path where input data could be found
  --easPwd EASPWD       Specify the password to use to access EAS. If not specified it will be asked during the runtime
  --script-only         Generate only script
  --PRversion PRVERSION
                        Specify the last pipeline runner version. Default is 3.3.2.

How to use it

Generic command:

E-Run SDC_FR_TOOLS CloneToLocal --ppo <PPO_ID> --easUser <user_cosmos> --output <path_to_the_workdir>

Real example:

E-Run SDC_FR_TOOLS CloneToLocal --ppo VIS_D4_NOMINAL_20210621T105631-YB26VKOZ-20210621-105812-004 --easUser xyz --output /sps/euclid/Users/mainetti/VIS_D4_NOMINAL_20210621T105631-YB26VKOZ-20210621-105812-004

You can launch the run on the batch system at CC-IN2P3 (SDC-FR) using the following command:

sbatch -A euclid -t 1-00:00:00 -n 1 --mem 3G PR_script.sh

Default batch system used at CC-IN2P3 is SLURM. You can retrieve all documentation for Pipeline Runner configuration here.