Octotiger Expansion Coefficients David Pfander
David Pfander: Octotiger Expansion Coefficients 1
Agenda
What those two functions do: • compute interactions • set basis
Implementation approach: • Preliminary performance model • Vectorization
David Pfander: Octotiger Expansion Coefficients 1
(Very) HighLevel Perspective
What do compute interactions and set basis do? • Part of the gravitational potential computation, effect of cell B on cell A • Need function to approximate influence of B in different locations in A (mass centers
of child nodes) • Approximation done via multipole expansion of gravity • compute interactions and set basis compute the multipole expansion
coefficients • Additionally, angular momentum correction is calculated
But: • Actual potential not calculated or applied here!
David Pfander: Octotiger Expansion Coefficients 2
Parameters and Return Values
Input: • Center of masses of A and B • Multipole moments of A and B • (All particles with their masses?)
Output: • Array with coefficients of expansion • 2 4th order multipole expansions • 2 3d angular correction vectors
David Pfander: Octotiger Expansion Coefficients 3
Towards Gravitational Potential Evaluation Potential without correction (in Einstein’s sum convention): 1 1 ΦB→A (X) ≈ MB + MBij Dij + MB Di + MBjk Dijk xi + 2 2 1 1 MB Dij xi xj + MB Dijk xi xj xk 2 6 • MC monopole (sum of masses), MCij quadrupole moments (weighted sum of
masses), constants for given cells • All D constants; gradients of gravitational potential • Everything except for X is constant
David Pfander: Octotiger Expansion Coefficients 4
With the Sums Written Explicitly Without Einstein’s sum convention: d d X d d X d X X X 1 1 MB Di + MBij Dij + MBjk Dijk xi + ΦB→A (X) ≈ MB + 2 2 i=1 j=1 k=1 i=1 j=1   {z } {z } 1 coefficient
3 coefficients
d X d X 1
d X d X d X 1
MB Dij xi xj + MB Dijk xi xj xk 2 6 i=1 j=1  i=1 j=1 k=1  {z } {z } 9(6) coefficients
27(10) coefficients
• Some coefficients are the same, e.g. i = 0, j = 1 same as i = 1, j = 0 (symmetry) • Leads to additional array with multiplicities of terms
David Pfander: Octotiger Expansion Coefficients 5
Computing the Gradients of the Gravitational Potential Gradients of gravitational potential, X, Y center of masses, R := X − Y and d := R2 −∇n
1 := D(n) . d
It follows 1 Ri 3Ri Rj − δij d2 (1) (2) D(0) := − , Di := 3 , Dij := − , d d d5 15Ri Rj Rk − 3(δij Rk + δjk Ri + δki Rj )d2 (3) Dijk := . d7 • Omitted D (4) , very long
David Pfander: Octotiger Expansion Coefficients 6
Calculating D(0) , . . . , D(4)
1 Ri 3Ri Rj − δij d2 (1) (2) D(0) := − , Di := 3 , Dij := − , d d d5 15Ri Rj Rk − 3(δij Rk + δjk Ri + δki Rj )d2 (3) Dijk := d7 • Again using symmetry to store unique D entries, implemented via coefficient maps:
MapD(2)
0 = 1 2
1 3 4
2 4 5
• Kronecker δij complicates computation • Branchfree, vectorization by replacing the types
David Pfander: Octotiger Expansion Coefficients 7
Analysis of the Calculation of D(0) , . . . , D(4) Floatingpoint operations based on Dominic’s noduplicate approach: • 6 (for
1 d2 ,one
DIV)
• 2 + 2 + 2 + 2 = 8 (precomputation, one SQRT) • D (0) is part of precomputation • D (1) → 3 • D (2) → 3 + 3 {z}
=6
Delta loop
• D (3) → 3 + 6 + 10 + 2 · 3 + 3 · 6 = 19 + 26 = 45

{z
Delta loop
}
• D (4) → 4 · 3 + 5 · 6 + 5 · 10 = 12 + 30 + 50 = 92
Overall 14 + 92 + 45 + 6 = 157 Flops
David Pfander: Octotiger Expansion Coefficients 8
Potential Coefficient Calculation Floatingpoint operations based on Dominic’s noduplicate approach: • Includes angular momentum correction; in the order of the loops • 3 (distance) • 2 • 6 · 6 = 36 (1 DIV) • 10 · 6 = 60 (1 DIV) • 3 · 3 + 3 · 6 · 6 = 9 = 108 • 3 · 10 · 6 = 180 • 6 · 2 + 6 · 3 · 4 = 84 • 10 · 3 = 30 • 10 · 4 = 40 (n0/n1, angular momentum, not vectorized)
Overall 538 Flops
David Pfander: Octotiger Expansion Coefficients 9
Required Memory
• Some stack variables, e.g. 5th order taylor (40 doubles) for gradients • 2· 4th order taylor (20 doubles each) for moments, memoryread • 2· 4th order taylor term (10 doubles each) for angular momentum, memoryread • 2 · 3d vector for X, Y, memoryread • 2· 4th order taylor (20 doubles each) for potential of both interacting cells (return
value), memorywrite • 2 · 3d vectors (6 doubles) for force correction, memorywrite
Stack 40 · 8B = 320B Read (2 · 20 + 2 · 10 + 2 · 3) · 8B = 528B ReadWrite (2 · 20 + 2 · 3) · 8B = 368B
David Pfander: Octotiger Expansion Coefficients 10
Some Assumptions and Some Facts
• Stack variables are likely cached in L1, no shortlived heap variables • Input arrays are read once per interaction • Output arrays have one readwrite access per interaction • 512 multipole interact (seems to be constant) • 60780 multipole interactions (seems to be constant, very few exceptions)
→ 512 · (528B + 368B) = 458752B = 0.44M B (stack variables are per thread, irrelevant) → All data for one compute interactions call can be cached in L2
David Pfander: Octotiger Expansion Coefficients 11
Are we Main Memory Bound? Machine balance of relevant hardware: • A Cori KNL node requires
only:
3T F 90GB/s
3T F 490GB/s
F → 6.3 B (MCDRAM only:
3T F 400GB/s
F → 7.7 B , DRAM
F → 34.1 B )
• A Cori HSW node requires
1.2T F 136.5GB/s
F → 8.8 B
For whole call to compute interactions : 695F · 60780 = 42242100F (528B + 368B) · 512 = 458752B 42242100F F = 92.1 → 458752B B → compute bound, if enough L2 bandwidth → DRAM or HBM2 should not make a difference
David Pfander: Octotiger Expansion Coefficients 12
Is a Single Iteration L2 Bound? Can load up to 64B per core per cycle on HSW and KNL: • L2 balance on 68core KNL node is
3T F (68/2)·1.4GHz·64B
=
3T F 3T B/s
F = 1.0 B
(L2 shared between 2 cores, 64B read per cycle) • L2 balance on 32core HSW node is
1.2T F 32·2.3GHz·64B
=
3T F 4.7T B/s
F = 0.64 B
Naively for single interaction: F 695F = 0.78 528B + 368B B More realistically: • Due to the vectorization approach, most of the time same multipole interacts with
different other multipoles • Therefore, one multipole is cached in L1, and consequentially
8 · 695F 5560F F = = 1.38 264B + 8 · 264B + 184B + 8 · 184B 4032 B David Pfander: Octotiger Expansion Coefficients 13
Some conclusions
• Can stream iteration from L2 on KNL
→ Compute bound on KNL! • Working set (≈ 0.44M B) fits into KNL L2 (1MB for 2 cores), but not into HSW L2
(256KB), L3 might help on HSW • Current scalar to vector instruction ratio ≈ 60 : 40 (measured on HSW with Intel
Advisor) → Leads to naive peak performance bound of 40% → 1.2T F (Without memory and pipeline latencies, . . . )
David Pfander: Octotiger Expansion Coefficients 14
Implementation Properties
• Vectorization is straightforward, no branches (replace type by vector type) • Vectorization improves cacheline utilization (vector size == cache line size) • Vector elements will most often compute interactions between the first multipole and
some other multipole (they some input and output vectors) → Many L1 hits, good cache efficiency • Relatively high ILP (potential calculated for both interacting multipoles) • Lots of scalar operations and MOVs mixed in (bad for KNL) • Example: Expensive indirect loads D (4) map for single element access
A(a, a, b, b) → A[map4[a][a][b][b]]
David Pfander: Octotiger Expansion Coefficients 15
Guideline for Improvements on KNL
Instruction mix requirements: • KNL can only sustain two instructions per cycle (HSW can do 4) • KNL has two FP vector pipes • Every scalar instruction removes one vectorized FP instruction and thereby directly
reduces peak performance • Long chains of AVX512 instructions are needed • (But AVX512 MOVs are bad as well)
Other: • Break up dependency chains for better pipeline utilization • Hyperthreads should decrease performance, due to increased L2 pressure
David Pfander: Octotiger Expansion Coefficients 16
General Questions and Further Challenges Can we somehow get rid of storing the expansion as arrays? • Probably not (according to Dominic) • Needed to calculate the child expansions • Actual evaluation at the leaf level
Could we integrate the moments calculation as part of the interactions? • Likely propagated as part of the upward tree traversal • If true, for O(n) FMM impossible
Further Challenges: • Everything tightly coupled, changing the data structures requires major refactoring • Refactoring requires a better understanding of other parts of octotiger
David Pfander: Octotiger Expansion Coefficients 17
Flops and Bytes of Boundary Multipole Multipole Flops for single interaction (ordered as in the function): • 10 · 2 + 1 = 21 • 3·1=3 • 157 (set basis call) • 6 · 4 = 24 • 10 · 4 = 40 • 3 · (6 · 4 + 1) = 75 • 3 · 10 · 4 = 120 • 6 · (3 · 2 + 1) = 42 • 10 · 1 = 10 • 20 + 3 · 1 = 4
Total: 501 Flop
David Pfander: Octotiger Expansion Coefficients 18
Cont.
Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: (10 + 3) · 8B = 104B (second multipole, per iteration, reuse unclear) • ReadWrite: 2 · (20 + 3) · 8B = 368B (potential and correction, per loop)
(Empirically) at least 8 reusages for first multipole and result: 184B 368B 501F + 104B + = 173B → = 2.9F/B 8 8 173B
David Pfander: Octotiger Expansion Coefficients 19
Flops and Bytes of Boundary Multipole Monopole
Flops for single interaction (ordered as in the function): • 3·1 • 157 (set basis call) • 1 + 6 · 4 + 10 · 4 • 3 ∗ (6 ∗ 4 + 1) • 3 ∗ 10 ∗ 4 • 4∗1+3∗1
Total: 427 Flop
David Pfander: Octotiger Expansion Coefficients 20
Cont.
Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: 3 · 8B = 24B (monopole, per iteration, reuse unclear) • ReadWrite: 2 · (3 + 3) · 8B = 96B (potential and correction)
(Empirically) at least 8 reusages for first multipole and result: 184B 427F + 24B + 96B = 143B → = 3.0F/B 8 143B
David Pfander: Octotiger Expansion Coefficients 21
Flops and Bytes of Boundary Monopole Multipole
Flops for single interaction (ordered as in the function): • 1 + 10 · 2 + 3 · 1 • 157 (set basis call) • 1 + 10 · 1 + 10 · 1 • 3 · 10 · 3 • 20 + 3
Total: 315 Flop
David Pfander: Octotiger Expansion Coefficients 22
Cont.
Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: (3 + 1 + 10) · 8B = 112B (monopole, per iteration, reuse unclear) • ReadWrite: 2 · (10 + 3) · 8B = 208B (potential and correction, per loop)
(Empirically) at least 8 reusages for first multipole and result: 184B 315F + 112B + 208B = 343B → = 0.9F/B 8 343B
David Pfander: Octotiger Expansion Coefficients 23
Flops and Bytes of Boundary Monopole Monopole
Flops for single interaction (ordered as in the function): • 4+4
Total: 8 Flop
David Pfander: Octotiger Expansion Coefficients 24
Cont.
Bytes for single interaction : • Read: 4 · 8B = 24B (load multipole, per loop, min 8 reuses) • Read: 4 · 8B = 24B (monopole, per iteration, reuse unclear) • ReadWrite: 2 · 4 · 8B = 48B (potential and correction, per loop)
(Empirically) at least 8 reusages for first multipole and result: 24B 8F + 24B + 48B = 75B → = 0.1F/B 8 75B
David Pfander: Octotiger Expansion Coefficients 25