Simulating the Universe from quarks to galaxy clusters with DiRAC HPC

40 powers of 10
Simulating the Universe from
quarks to galaxy clusters with the
DiRAC HPC Facility
Dr Mark Wilkinson
University of Leicester
Director, DiRAC HPC Facility
DiRAC

DiRAC
Distributed HPC Facility for theoretical astrophysics,
particle physics, cosmology and nuclear physics
Combined:
• ~90,000 cores
• ~5 Pﬂop/s
• >10 PB storage
Extreme Scaling (Edinburgh)
Data Intensive (Cambridge)
Data Intensive
(Leicester)
Memory Intensive (Durham)

The DiRAC Facility, a brief history
2009: DiRAC-1: systems installed at 13 host sites
Nov 2011: DiRAC-2 awarded £15M capital by BIS
- 5 systems at 4 sites - Cambridge, Durham, Edinburgh, Leicester
- Procurement completed in 100 days
Dec 2012: DiRAC-2 operations begin
- Systems full within 1 week - usage >90%
- Access via international peer-review process
- Free to use for STFC-funded researchers
April 2017: DiRAC-2.5 operations begin
- 3 services: Extreme Scaling, Memory Intensive, Data Intensive
April 2018: DiRAC-2.5x
- interim funding to replace 2012 hardware and support 2018/19
science programme
Dec 2018: DiRAC-2.5y
- BEIS-funded stop-gap 2x upgrade of all DiRAC services

• Direct numerical simulation and modelling is core theory research activity
• HPC systems are main scientific instruments for theory research
• Computational requirements of models increasing due to
• Increased resolution: running models with existing physics at finer
scales.
• Increased complexity: introducing new physics into models to reflect
progress in theoretical understanding; often needed to match resolution
• Coupling of models: multi-physics, multi-scale modelling;
• Quantification of modelling uncertainty using large ensembles of
simulations to provide robust statistics
• Constant process of refining and re-defining our tools
• Growing requirement for simulations and modelling concurrent with
observations so that models evolve in line with data acquired
• Observational facilities need access to significant local computing
capabilities as well as option to burst out to the larger, national facilities.
Computing in theory research

The DiRAC approach to service design
Science
case
Workflow
assessment
Technical
case
Peer-reviewed, scientific justification for resources
Peer-reviewed, high-level technical specifications
Technical
design
Co-design with
industry partners
Individual service specifications

Diverse science cases require
heterogenous architectures
Extreme Scaling
“Tesseract”
(Edinburgh)
Memory Intensive
“COSMA”
(Durham)
Data Intensive
“DIaL” and “CSD3”
(Leicester & Cambridge)
230 TB RAM to
support largest
cosmological
simulations
Heterogeneous
architecture to support
complex simulation and
modelling workﬂows
2 Pﬂop/s to
support largest
lattice-QCD
simulations
DiRAC

scientific impact is highly significant, sustained and of world leading quality.
Figure 3-7: Facility Citations and Paper Record 2015-2018
3.5 WP5: Project Management
DiRAC Science Impact Analysis
• Refereed publications since 2012: > 950 papers; > 49000 citations
• Second-most cited paper in all astronomy in 2015
• 199 refereed papers in 2017

DiRAC@Edinburgh – Extreme Scaling (ES)
• HPE SGI 8600
• >1.8 PF
• 1468 standard compute nodes
• Two 2.1GHz 12-core Intel Xeon Skylake 4116 processors
• 96GB RAM
• 8 GPU compute nodes
• Two 2.1GHz 12-core Intel Xeon Skylake 4116 processors
• 96 GB RAM
• 4 NVidia V100 (Volta) GPU accelerators connected over PCIe
• Hypercube Intel Omnipath interconnect
• 3PB Lustre DDN storage
“Tesseract”

Inside the proton
Image: Arpad Horvath - Own work, CC BY-SA 2.5,
https://commons.wikimedia.org/w/index.php?curid=637353
quark
Strong interactions
carried by gluons

Understanding the strong force is key to testing the
Standard Modal of Particle Physics - binds quarks into
hadrons that we see in experiment
ATLAS@Large Hadron
Collider (LHC)
Connecting observed hadron
properties to those of quarks
requires full non-perturbative
treatment of Quantum
Chromodynamics - lattice QCD
DALI Run=16449 Evt=4055
PH
1cm|
IP
B
D
ππ
KK
K
π
e−
−
−
+
+
+s
s
Calculations of
hadron masses and
rates for simple weak
decays allow tests of
Standard Model

• L4
local volume (space + time)
• ﬁnite di↵erence operator 8 point stencil
• ⇠ 1
L of data references come from o↵ node
Scaling QCD sparse matrix requires interconnect bandwidth for halo exchange
Bnetwork ⇠
Bmemory
L
⇥R
where R is the reuse factor obtained for the stencil in caches
• Aim: Distribute 1004
datapoints over 104
nodes
Scalable QCD: sparse matrix PDE solver communications (Boyle et al.)

Immediate Big HPC roadmap
0.1
1
10
100
1000
2010 2012 2014 2016 2018 2020 2022
PF/system
IBM-BGQ/Mira
Intel-Broadwell/Cori1
Intel-KNL/Cori2
Intel-KNL/Oakforest PACS
IBM/Nvidia/Summit/Sierra
Intel-KNH/Aurora
Fujitsu-ARM/post-K
Frontier/ORNL
0.1
1
10
100
1000
2010 2012 2014 2016 2018 2020 2022
SPTF/node
IBM-BGQ/Mira
Intel-KNL/Cori2
IBM/Nvidia/Summit
0.1
1
10
100
1000
2010 2012 2014 2016 2018 2020 2022
commsGB/s/node
IBM-BGQ/Mira
Intel-KNL/Cori2
IBM/Nvidia/Summit/Sierra
• 400x+ increase in SP node performance accompanied by NO increase in interconnect
Immediate “Big HPC” roadmap

Grid QCD code
Design considerations
• Performance portable across multi and many core CPU’s
SIMD⌦OpenMP⌦MPI
• Performance portable to GPU’s
SIMT⌦ofﬂoad⌦MPI
• N-dimensional cartesian arrays
• Multiple grids
• Data parallel C++ layer : Connection Machine inspired
Grid QCD code
(Boyle et al.)

• Relatively cheap node: high node count and scalability
• Improve price per core, bandwidth per core, reduce power
Tesseract performance per node vs nodes, volume
GF/spernode
0.0
150.0
300.0
450.0
600.0
Nodes
1 16 32 64 128 256
12^4 16^4 24^4
nodes (single switch) delivers bidirectional 25GB/s to every node (wirespeed)
2 nodes topology aware bidirectional 19GB/s
• 76% wirespeed using every link in system concurrently
DiRAC HPE ICE-XA hypercube network
• Edinburgh HPE 8600 system (Installed March 2018, 2 days ahead of schedule)
• Low end Skylake Silver 4116, 12 core parts
• Single rail Omnipath interconnect
GF/spernode
0.0
150.0
300.0
450.0
600.0
Nodes
1 16 32 64 128 256
12^4 16^4 24^4
• 16 nodes (single switch) delivers bidirectional 25GB/s to every node (wirespeed)
• 512 nodes topology aware bidirectional 19GB/s
GF/spernode
0.0
150.0
300.0
450.0
600.0
Nodes
1 16 32 64 128 256
12^4 16^4 24^4
• 16 nodes (single switch) delivers bidirectional 25GB/s to every node (wirespeed)
• 512 nodes topology aware bidirectional 19GB/s
GF/spernode
0.0
150.0
300.0
450.0
600.0
Nodes
1 16 32 64 128 256
12^4 16^4 24^4
16 nodes (single switch) delivers bidirectional 25GB/s to every node (wirespeed)
512 nodes topology aware bidirectional 19GB/s
• Small project with SGI/HPE on Mellanox EDR networks (James Southern)
• Embed 2n
QCD torus inside hypercube so that nearest neigbour comms travels single hop
4x speed up over default MPI Cartesian communicators on large systems
) Customise HPE 8600 (SGI ICE-XA) to use 16 = 24
nodes per leaf switch
Boyle et al.

Other Turing work
• Intel’s US Patent Application 20190042544
(FP16-S7E8 MIXED PRECISION FOR DEEP LEARNING AND OTHER ALGORITHMS)
• Authors: Boyle (ATI, Edinburgh), Sid Kashyap, Angus Lepper (Intel, former DiRAC RSE’s)
• Systematic study using Gnu Multi Precision library.
• BFP16 displays greater numerical stability for machine learning training.
• Understanding: histogramme of multiply results during SGD gradient calculations
• Patent application full text searchable on uspto.gov
DiRAC / Alan Turing Institute / Intel Collaboration

Software Innovation - AI on HPC
cessors.
��
��
��
��
�
��
��
��
��
��
G. 5: Wall clock time per reduction call vs. vector length before and after our optimisation. The large vector reducti
formance is ten times better after our optimisations on large vector lengths. The gain includes both computation accelerati
d communication acceleration.
the interface to memory management is not ideal.
ther, it is likely that method dealloc and alloc routine provided by collectives.h should have been used consistent
th use of any other “free” operation declared illegal. Further, this would have enabled simpler modification of t
ocation and deallocation implementation. It would also have been better to separate the vector allocation from t
duction operation; so that in hot loops the programmer could choose to reuse the same allocation.
ur first optimisations were high level: i) remove the expectation that the caller deallocates the returned vector;
Boyle et al.
2017
• Demonstration of factor 10 speed-up in the Baidu Research
optimised reduction code

- a publicly available code designed to optimise the performance
limiting steps in distributed machine learning (ML)
• potentially disruptive implications for design of cloud systems

- shows that ML workflows can achieve 10x performance
improvement when fully optimised and implemented on traditional
HPC architectures

gure 1: Left panel: A ‘fan’ plot for the light quark ud and us decay constants (normalised with the
erage ud, us decay constant) for one lattice spacing against the pion mass; Right panel: The heavy b
uark ub, sb decay constants, again normalised, for several lattice spacings also plotted against the
on mass.
ow further extended this programme to the heavier quark masses and in particular the b, [2].
the right panel of the figure we give the equivalent plot for the B and Bs mesons, together
ith the present (FLAG16) values.
Bornyakov et al. 2017 Hollitt et al. 2018
Testing the standard model of particle physics
• Mesons are composed of quark-antiquark pairs

• Their decay rates into other particles provide a strong test of the standard model

• Anomalies may point to new physics

• DiRAC researchers have developed new, computationally eﬃcient technique for
estimating decay rates

DiRAC@Durham - Memory Intensive (MI)
• COSMA6 (2016)
• IBM/Lenovo/DDN
• 8192 Intel SandyBridge, 2.6 GHz cores
• Combined 65 Tbyte of RAM;
• Mellanox FDR10 interconnect in 2:1 blocking;
• 2.5 Pbyte of Lustre data storage
• COSMA7 (2018)
• Dell
• 23,600 Intel Skylake 4116 cores
• Interconnect: Mellanox EDR in a 2:1 blocking conﬁguration with
islands of 24 nodes;
• Combined 230 Tbyte of RAM;
• a fast checkpointing i/o system (343 Tbyte) with peak performance
of 185 Gbyte/second write and read;
• >4 PB storage
Durham – MI
m was delivered by Dell in March 2018;
y Alces
service started on 1 May 2018
engagement:
closely with the Industrial Strategy of
partment for Business, Energy and
ial Strategy (BEIS).
g leads to industrial engagement; this
in innovation, both leading to
ts academia and the wider industry.

The Universe, a brief history
Credit: NASA / WMAP Science Team
DiRAC supports
calculations on all
physical scales

The components of the Universe
Credit: NASA / WMAP Science Team & Planck Satellite Team
26.8%
4.9%
68.3%

ICC Durham; Virgo Consortium
The EAGLE Simulation
DiRAC
Dark matter (N-body)

The EAGLE Simulation
ICC Durham; Virgo ConsortiumDiRAC
Gas and stars (hydrodynamics)

Research Software Engineers
“With great power comes great responsibility”
• Science requirements for DiRAC-3 demand 10-40x increases in
computing power to stay competitive
- hardware alone cannot deliver this
• We can no longer rely on “free lunch” from the Xeon era
• Vectorisation and code efficiency now critical
• Next generation hardware is more difficult to program efficiently
• RSEs are increasingly important
- RSEs can help with code profiling, optimisation, porting, etc
• DiRAC now has 3 RSEs: effort allocated via peer review process
algorithmic
computational
David Keyes

FirstGalaxies
SolarSystemForms
Reionization
Milky Way
13121110987654321
Age of the Universe (Gyr)
Present DayBig Bang
The Evolution of the Universe
First simulation on DiRAC-2.5y Memory
Intensive system at Durham carried out
with SWIFT, a revolutionary new
cosmological hydrodynamics code
developed at Durham which is 20x faster
than the state-of-the-art.
Key Problems
Origin of galaxies
Indentity of the dark matter
Nature of the dark energy

• A fast checkpointing i/o system (343 Tbyte) with peak performance of
185 Gbyte/second write and read
• 15 Lustre Object Storage Servers on Dell 640 nodes:
• 2 x Intel Skylake 5120 processors
• 192 Gbyte of RAM
• 8 x 3.2 TB NVMe SFF drives.
• 1 Mellanox EDR card
• A user code benchmark produced 180 Gbyte/second write and read –
this is almost wire-speed!
• At time of installation this was the fastest ﬁlesystem in production in
Europe
Memory Intensive:
burst buffer for checkpointing

0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
24-hour 12-hour 6-hour 4-hour 2-hour 1-hour
KiloWatthoursover5years
Axis Title
Power usage for snapshots with different performance solutions
30GB/sec 120GB/sec 140GB/sec
Memory Intensive:
Power saving
Heck (2018)

Core hours lost in checkpointing over 5 years
Snapshot
period 24-hr 12-hr 6-hr 4-hr 2-hr 1-hr
Hours/
snapshot
30GB/sec 7.1M 14.2M 28.4M 42.5M 85.1M 170.1M 0.95
120GB/sec 1.8M 3.5M 7.1M 10.6M 21.3M 42.5M 0.24
140GB/sec 1.5M 3.0M 6.1M 9.1M 18.2M 36.54M 0.20
• Total number of available cpu hours per year: 36M (4116 cores)
• ~13% gain in available core hours due to faster checkpointing
Memory Intensive:
Heck (2018)

Uranus collision?
8 J. A. Kegerreis et al.
10 5 0
x Position (R )
10
5
0
yPosition(R)
10 5 0
x Position (R )
106
107
108
SpeciﬁcInternalEnergyJkg1
Figure 6. A mid-collision snapshots of a grazing impact with 108 SPH particles – compared with the more head-on collision in Fig. 5 –
coloured by their material and internal energy, showing some of the detailed evolution and mixing that can now be resolved. In the left
panel, light and dark grey show the target’s ice and rock material, respectively, and purple and brown show the same for the impactor.
Light blue is the target’s H-He atmosphere.
mon aim in cosmology, to simulate a larger patch of the uni-
verse), or we can study a small system with higher resolution
to model smaller details.
not) question that we ask a computer to solve, so this is an
important ﬁrst step.
Kegerreis et al.
(2019), subm.
Planet collisions
Giant impact explains Uranus spin axis and cold temperature
• 108-particle simulations carried out on DiRAC Memory Intensive service
(Durham) using new hydro+Nbody code, Swift
• Hardware and software for this calculation developed with DiRAC support.

DiRAC & the discovery of Gravitational waves
On September 14, 2015 at 09:50:45 UTC, the LIGO
Hanford, WA, and Livingston, LA, observatories detected
pact binary waveforms [44] recovered GW
most significant event from each detector f
tions reported here. Occurring within the
• Simulations of binary
black hole mergers
performed on DiRAC
DataCentric system
(COSMA5)
• Crucial for interpretation
of LIGO gravitational
wave detection
Abbott et al., (2016)
DiRAC

DiRAC@Leicester – Data Intensive (DIaL)
• HPE system
• 400 compute nodes
• 2x Intel Xeon Skylake 6140, 2.3GHz, 18-core processors
• Dual FMA AVX512
• 192 GB RAM
• 1 x 6TB SuperDome Flex with 144 X6154 cores (3.0GHz)
• 3 x 1.5TB fat nodes with 36 X6140 cores
• Mellanox EDR interconnect in a 2:1 blocking setup
• 3PB Lustre storage.
• 150 TB flash storage for data intensive workflows
HPE/Arm/Suse Catalyst UK Arm system
• 4000-core Thunder X2-based cluster installed in January 2019
• Infiniband all-to-all interconnect

Shared platform with local Cambridge and EPSRC Tier 2 clusters
• Dell multi-architecture system (Skylake, KNL, Nvidia GPU)
• 484 Skylake nodes
• 2 x Intel Xeon Skylake 6142 2.6GHz 16-core processors
• Mix of 192 GB/node and 384 GB/node
• 44 Intel KNL nodes
• Intel Xeon Phi CPU 7210 @ 1.30GHz
• 96GB of RAM per node
• 12 NVIDIA GPU nodes
• four NVIDIA Tesla P100 GPUs
• 96GB memory per node
• connected by Mellanox EDR Infiniband
• Intel OmniPath interconnect in 2:1 blocking configuration
• 1.5PB of Lustre disk storage;
• 0.5PB flash storage-based “Data Accelerator”
DiRAC@Cambridge – Data Intensive (DIaC)

Star Cluster Forma.on in Clouds with Diﬀerent Metallici.es (Bate 2019): Gas Temperatures

DiRAC HPC Training
• DiRAC provides access to training from wide pool of providers
• Currently offering:
- DiRAC Essentials Test: now available online (and
compulsory!)
- Workshops and Hackathons
• Coming soon:
- Domain-specific workshops
- Online individual training portal
Why do we do this?
- maximise DiRAC science output
- flexibility to adopt most cost-effective technologies
- future-proofing our software and skills
- contributes to increasing skills of wider economy

(Better systems) + (Better software) = Better science
This requires:
• More engagement in hardware and software co-design
• Enhanced training and knowledge transfer
DiRAC

Simulating the Universe from quarks to galaxy clusters with DiRAC HPC

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Simulating the Universe from quarks to galaxy clusters with DiRAC HPC

Similar to Simulating the Universe from quarks to galaxy clusters with DiRAC HPC (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Simulating the Universe from quarks to galaxy clusters with DiRAC HPC