Skip to content

Prototype status Feb 2015

Matthias Richter edited this page Feb 12, 2015 · 3 revisions

As of Feb 11 2015, the feature branches for the FLP-EPN distribution setup (Alexey) and the TPC data processing (Matthias) have been fully merged to AliceO2/dev. The scripts are in a feature branch in github matthiasrichter/AliceO2/dev-scripts.

Environment on the dev cluster

An example of the setup can be found on the hltdev cluster in /home/richter/workdir/epn-schedule-rundir

The following files and directories are required:

cluster-relay-to-tracker.sh # start script for TPC devices
launch-flp-epn.sh           # start script for FLP-EPN setup
data                        # (link to) data directory
nodelist.sh                 # definition of nodes
setup.sh                    # environment setup

The data directory contains configuration files for all publishers and the binary files. Also the OCDB copy is needed here. More data directories can be found in /home/richterm/workdir/alfa-rundir

Testing the environment:

# logon to a node
ssh node
# change to rundir
cd rundir
# setup the environment
source setup.sh
# start a test device
aliceHLTWrapper GlobalMerger_00 1  --input type=pull,size=1000,method=bind,address=tcp://*:48450  --library libAliHLTTPC.so --component TPCCAGlobalMerger --run 167808

Configuration of the processing hierarchy

Both scripts can be edited in the first section to define the number of FLP and EPN devices, and the number of cluster publishers on the FLP nodes. The number of FLP devices has to match the number of FLP nodes needed for the cluster publishing. E.g. for publishing the full TPC (slices 0 to 35) with 1 publisher per FLP node one needs 36 FLP devices, with 4 publishers per FLP node one needs 9 FLP devices.

launch-flp-epn.sh

###################################################################
# global settings
number_of_flps=36
flp_command_socket=48490
flp_heartbeat_socket=48491
baseport_on_flpgroup=48420
baseport_on_epngroup=48480
number_of_epns=28
number_of_epns_per_node=1
rundir=`pwd`

cluster-relay-to-tracker.sh

###################################################################
# global settings
runno=167808
firstslice=0
lastslice=35
slices_per_node=1
pollingtimeout=100
rundir=`pwd`

Running:

The start scripts have to be executed on one central node from the run directory. The two start scripts create a list of all devices to be started, a screen session is opened for every device. The launch-flp-epn.sh script needs to be executed first, it creates some configuration output for the second script. The two scripts should be connected via a pipe.

./launch-flp-epn.sh | ./cluster-relay-to-tracker.sh

The script creates a screen session for every process, does ssh to the node, switches to the rundir and sources the setup, then starts the process.

# List of screens
screen -ls
# connecting to a screen
screen -r GlobalMerger

Troubleshooting:

  • make sure that the ssh key of all nodes is in ~/.ssh/known_hosts, you have have to logon to all nodes at least once
  • if (some of) the screens are missing in the list, the processes have terminated. Check if the rundir is correctly set up with setup.sh and data link
  • if a particular screen is missing you can run the script in 'print' mode, ./cluster-relay-to-tracker.sh --print-commands search for the command, logon to the particular node and execute the command there, so you get the output and whats going on
  • if the component complains about missing OCDB objects, check the location of the OCDB, it is determined by the variable ALIHLT_HCDBDIR in setup.sh