HPCs Which, and how?
1
Opportunistic resources
Some centers have computing resources which are willing to contribute to LHCb, but are not part of WLCG LHCb would like to exploit all of them in an OPPORTUNISTIC way
2
Opportunistic means... Opportunistic, in THEORY, means: ● from time to time: “grab it until you can” but sometimes we are granted a quota
● for “free” ● with no guaranteed support not only volunteer computing (BOINC): ● Local clusters? ● HPC?
3
HPCs for LHCb? ● There are many HPCs out there ○ Many are different from one to the other ○ quite different paradigm from the HTC we do on “the Grid” ■ ...and we use HPCs “like HTCs” ○ not all of them are “so, so used” ■ ■
utilization may not be super-high and especially, not continuous
○ and to some, we can have access ■ ■
4
Not as many as Atlas… … anyway, we already run jobs, in productions on 2 HPCs ● and we’d like to add at least one more
DIRAC.OSC.us osc.edu 4 Gb/core guaranteed
~easy setup! 5
LCG.CSCS-HPC.ch cscs.ch swiss national supercomputing centre (3rd in top500 list) a LCG site, for us (ARC CE + slurm) … so this was “transparent” for us
6
Schematic view of “the Grid”
7
Job running on “the Grid”
8
~easy integration when ● ● ● ● ●
WNs have inbound/outbound connectivity LHCb CVMFS mounted on the WNs SLC6 “compatible” At least 2GB/core x86 This is the case for OSC and CSCS
When some of the requirements above are not met, we can try to go around them, but this requires dedicated work (and anyway it may not be possible, case by case) 9
Santos Dumont (LNCC) HPC at LNCC
sdumont nicely documented in Portuguese only :)
504 B710 computing nodes (thin node), where each node has ● ● ●
2 x Intel Xeon E5-2695v2 Ivy Bridge CPU, 2.4GHZ 24 cores (12 per CPU), totaling 12,096 cores 64GB DDR3 RAM
198 B715 computing nodes (thin node) with K40 GPUs, where each node has: ● ● ● ● ...
10
2 x Intel Xeon E5-2695v2 Ivy Bridge CPU, 2.4GHZ 24 cores (12 per CPU), totaling 4,752 cores 64GB DDR3 RAM 2 x Nvidia K40 (GPU device)
Santos Dumont (LNCC) /2 ... 54 B715 computing nodes (thin node) with Xeon Phi co-processors, where each node has: ● ● ● ●
2 x Intel Xeon E5-2695v2 Ivy Bridge CPU, 2.4GHZ 24 cores (12 per CPU), totaling 1,296 cores 64GB DDR3 RAM 2 x Xeon PHI 7120 (MIC device)
1 MESCA2 computing “fat” node: ● ● ●
16 x Intel Ivy CPU, 2.4GHZ 240 cores (15 per CPU) 6 TB of RAM
The 756 nodes of SDumont are interconnected by an Infiniband FDR interconnect network, with the following technical configurations: ● ●
1,944 doors 58Gb / s and 0.7us per port
Total Throughput = 112,752 Gb / s Flow per port = 137 million messages per second ... 11
Santos Dumont (LNCC) /3 … Finally, SDumont has a Lustre parallel file system, integrated with the Infiband network, with a gross storage capacity of 1.7 PBytes, as well as a secondary file system with a gross capacity of 640 TBytes.
And worker nodes have no external connectivity. We have a friendly deal to grab some of its resources. And they installed CVMFS for us. 12
WNs within “closed doors”
13
WNs within “closed doors”
And we have been allowed to do it.
14
Zooming in... LHCbDIRAC v9r0
Login DIRAC Server installation
DIRAC services
J o b
Gateway J o b
pilot wrapper
Site Director
DIRAC SE proxy (?)
Worker Node pilot
Job
J o b
J o b
RequestExecutingAgent
ReqManager
DIRAC SE
The rest looks quite similar to our BOINC server setup (without security complications)
Computing site
SE SE 15
Status and To-Do Status: ● We have a “personal” login and I used to simply match a “hello world” job To-Do: ● First “installation” ● Testing, “certification” ● Needs administration ○ yet another installation ○ not only baby sitting ○ experience shows we’ll probably see scaling issues 16
Questions
? 17