Octotiger Expansion Coefficients - GitHub

Viewer
Transcript

Octotiger Expansion Coefficients David Pfander

David Pfander: Octotiger Expansion Coefficients 1

Agenda

What those two functions do: • compute interactions • set basis

Implementation approach: • Preliminary performance model • Vectorization

David Pfander: Octotiger Expansion Coefficients 1

(Very) High-Level Perspective

What do compute interactions and set basis do? • Part of the gravitational potential computation, effect of cell B on cell A • Need function to approximate influence of B in different locations in A (mass centers

of child nodes) • Approximation done via multipole expansion of gravity • compute interactions and set basis compute the multipole expansion

coefficients • Additionally, angular momentum correction is calculated

But: • Actual potential not calculated or applied here!

David Pfander: Octotiger Expansion Coefficients 2

Parameters and Return Values

Input: • Center of masses of A and B • Multipole moments of A and B • (All particles with their masses?)

Output: • Array with coefficients of expansion • 2 4th order multipole expansions • 2 3d angular correction vectors

David Pfander: Octotiger Expansion Coefficients 3

Towards Gravitational Potential Evaluation Potential without correction (in Einstein’s sum convention): 1 1 ΦB→A (X) ≈ MB + MBij Dij + MB Di + MBjk Dijk xi + 2 2 1 1 MB Dij xi xj + MB Dijk xi xj xk 2 6 • MC monopole (sum of masses), MCij quadrupole moments (weighted sum of

masses), constants for given cells • All D constants; gradients of gravitational potential • Everything except for X is constant

David Pfander: Octotiger Expansion Coefficients 4

With the Sums Written Explicitly Without Einstein’s sum convention:     d d X d d X d X X X 1 1 MB Di + MBij Dij  + MBjk Dijk  xi + ΦB→A (X) ≈ MB + 2 2 i=1 j=1 k=1 i=1 j=1 | | {z } {z } 1 coefficient

3 coefficients

d X d X 1

d X d X d X 1

MB Dij xi xj + MB Dijk xi xj xk 2 6 i=1 j=1 | i=1 j=1 k=1 | {z } {z } 9(6) coefficients

27(10) coefficients

• Some coefficients are the same, e.g. i = 0, j = 1 same as i = 1, j = 0 (symmetry) • Leads to additional array with multiplicities of terms

David Pfander: Octotiger Expansion Coefficients 5

Computing the Gradients of the Gravitational Potential Gradients of gravitational potential, X, Y center of masses, R := X − Y and d := ||R||2 −∇n

1 := D(n) . d

It follows 1 Ri 3Ri Rj − δij d2 (1) (2) D(0) := − , Di := 3 , Dij := − , d d d5 15Ri Rj Rk − 3(δij Rk + δjk Ri + δki Rj )d2 (3) Dijk := . d7 • Omitted D (4) , very long

David Pfander: Octotiger Expansion Coefficients 6

Calculating D(0) , . . . , D(4)

1 Ri 3Ri Rj − δij d2 (1) (2) D(0) := − , Di := 3 , Dij := − , d d d5 15Ri Rj Rk − 3(δij Rk + δjk Ri + δki Rj )d2 (3) Dijk := d7 • Again using symmetry to store unique D entries, implemented via coefficient maps:



MapD(2)

0 = 1 2

1 3 4

 2 4 5

• Kronecker δij complicates computation • Branch-free, vectorization by replacing the types

David Pfander: Octotiger Expansion Coefficients 7

Analysis of the Calculation of D(0) , . . . , D(4) Floating-point operations based on Dominic’s no-duplicate approach: • 6 (for

1 d2 ,one

DIV)

• 2 + 2 + 2 + 2 = 8 (precomputation, one SQRT) • D (0) is part of precomputation • D (1) → 3 • D (2) → 3 + 3 |{z}

=6

Delta loop

• D (3) → 3 + 6 + 10 + 2 · 3 + 3 · 6 = 19 + 26 = 45

|

{z

Delta loop

}

• D (4) → 4 · 3 + 5 · 6 + 5 · 10 = 12 + 30 + 50 = 92

Overall 14 + 92 + 45 + 6 = 157 Flops

David Pfander: Octotiger Expansion Coefficients 8

Potential Coefficient Calculation Floating-point operations based on Dominic’s no-duplicate approach: • Includes angular momentum correction; in the order of the loops • 3 (distance) • 2 • 6 · 6 = 36 (1 DIV) • 10 · 6 = 60 (1 DIV) • 3 · 3 + 3 · 6 · 6 = 9 = 108 • 3 · 10 · 6 = 180 • 6 · 2 + 6 · 3 · 4 = 84 • 10 · 3 = 30 • 10 · 4 = 40 (n0/n1, angular momentum, not vectorized)

Overall 538 Flops

David Pfander: Octotiger Expansion Coefficients 9

Required Memory

• Some stack variables, e.g. 5th order taylor (40 doubles) for gradients • 2· 4th order taylor (20 doubles each) for moments, memory-read • 2· 4th order taylor term (10 doubles each) for angular momentum, memory-read • 2 · 3d vector for X, Y, memory-read • 2· 4th order taylor (20 doubles each) for potential of both interacting cells (return

value), memory-write • 2 · 3d vectors (6 doubles) for force correction, memory-write

Stack 40 · 8B = 320B Read (2 · 20 + 2 · 10 + 2 · 3) · 8B = 528B Read-Write (2 · 20 + 2 · 3) · 8B = 368B

David Pfander: Octotiger Expansion Coefficients 10

Some Assumptions and Some Facts

• Stack variables are likely cached in L1, no short-lived heap variables • Input arrays are read once per interaction • Output arrays have one read-write access per interaction • 512 multipole interact (seems to be constant) • 60780 multipole interactions (seems to be constant, very few exceptions)

→ 512 · (528B + 368B) = 458752B = 0.44M B (stack variables are per thread, irrelevant) → All data for one compute interactions call can be cached in L2

David Pfander: Octotiger Expansion Coefficients 11

Are we Main Memory Bound? Machine balance of relevant hardware: • A Cori KNL node requires

only:

3T F 90GB/s

3T F 490GB/s

F → 6.3 B (MCDRAM only:

3T F 400GB/s

F → 7.7 B , DRAM

F → 34.1 B )

• A Cori HSW node requires

1.2T F 136.5GB/s

F → 8.8 B

For whole call to compute interactions : 695F · 60780 = 42242100F (528B + 368B) · 512 = 458752B 42242100F F = 92.1 → 458752B B → compute bound, if enough L2 bandwidth → DRAM or HBM2 should not make a difference

David Pfander: Octotiger Expansion Coefficients 12

Is a Single Iteration L2 Bound? Can load up to 64B per core per cycle on HSW and KNL: • L2 balance on 68-core KNL node is

3T F (68/2)·1.4GHz·64B

=

3T F 3T B/s

F = 1.0 B

(L2 shared between 2 cores, 64B read per cycle) • L2 balance on 32-core HSW node is

1.2T F 32·2.3GHz·64B

=

3T F 4.7T B/s

F = 0.64 B

Naively for single interaction: F 695F = 0.78 528B + 368B B More realistically: • Due to the vectorization approach, most of the time same multipole interacts with

different other multipoles • Therefore, one multipole is cached in L1, and consequentially

8 · 695F 5560F F = = 1.38 264B + 8 · 264B + 184B + 8 · 184B 4032 B David Pfander: Octotiger Expansion Coefficients 13

Some conclusions

• Can stream iteration from L2 on KNL

→ Compute bound on KNL! • Working set (≈ 0.44M B) fits into KNL L2 (1MB for 2 cores), but not into HSW L2

(256KB), L3 might help on HSW • Current scalar to vector instruction ratio ≈ 60 : 40 (measured on HSW with Intel

Advisor) → Leads to naive peak performance bound of 40% → 1.2T F (Without memory and pipeline latencies, . . . )

David Pfander: Octotiger Expansion Coefficients 14

Implementation Properties

• Vectorization is straightforward, no branches (replace type by vector type) • Vectorization improves cache-line utilization (vector size == cache line size) • Vector elements will most often compute interactions between the first multipole and

some other multipole (they some input and output vectors) → Many L1 hits, good cache efficiency • Relatively high ILP (potential calculated for both interacting multipoles) • Lots of scalar operations and MOVs mixed in (bad for KNL) • Example: Expensive indirect loads D (4) map for single element access

A(a, a, b, b) → A[map4[a][a][b][b]]

David Pfander: Octotiger Expansion Coefficients 15

Guideline for Improvements on KNL

Instruction mix requirements: • KNL can only sustain two instructions per cycle (HSW can do 4) • KNL has two FP vector pipes • Every scalar instruction removes one vectorized FP instruction and thereby directly

reduces peak performance • Long chains of AVX512 instructions are needed • (But AVX512 MOVs are bad as well)

Other: • Break up dependency chains for better pipeline utilization • Hyperthreads should decrease performance, due to increased L2 pressure

David Pfander: Octotiger Expansion Coefficients 16

General Questions and Further Challenges Can we somehow get rid of storing the expansion as arrays? • Probably not (according to Dominic) • Needed to calculate the child expansions • Actual evaluation at the leaf level

Could we integrate the moments calculation as part of the interactions? • Likely propagated as part of the upward tree traversal • If true, for O(n) FMM impossible

Further Challenges: • Everything tightly coupled, changing the data structures requires major refactoring • Refactoring requires a better understanding of other parts of octotiger

David Pfander: Octotiger Expansion Coefficients 17

Flops and Bytes of Boundary Multipole Multipole Flops for single interaction (ordered as in the function): • 10 · 2 + 1 = 21 • 3·1=3 • 157 (set basis call) • 6 · 4 = 24 • 10 · 4 = 40 • 3 · (6 · 4 + 1) = 75 • 3 · 10 · 4 = 120 • 6 · (3 · 2 + 1) = 42 • 10 · 1 = 10 • 20 + 3 · 1 = 4

Total: 501 Flop

David Pfander: Octotiger Expansion Coefficients 18

Cont.

Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: (10 + 3) · 8B = 104B (second multipole, per iteration, reuse unclear) • Read-Write: 2 · (20 + 3) · 8B = 368B (potential and correction, per loop)

(Empirically) at least 8 reusages for first multipole and result: 184B 368B 501F + 104B + = 173B → = 2.9F/B 8 8 173B

David Pfander: Octotiger Expansion Coefficients 19

Flops and Bytes of Boundary Multipole Monopole

Flops for single interaction (ordered as in the function): • 3·1 • 157 (set basis call) • 1 + 6 · 4 + 10 · 4 • 3 ∗ (6 ∗ 4 + 1) • 3 ∗ 10 ∗ 4 • 4∗1+3∗1

Total: 427 Flop

David Pfander: Octotiger Expansion Coefficients 20

Cont.

Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: 3 · 8B = 24B (monopole, per iteration, reuse unclear) • Read-Write: 2 · (3 + 3) · 8B = 96B (potential and correction)

(Empirically) at least 8 reusages for first multipole and result: 184B 427F + 24B + 96B = 143B → = 3.0F/B 8 143B

David Pfander: Octotiger Expansion Coefficients 21

Flops and Bytes of Boundary Monopole Multipole

Flops for single interaction (ordered as in the function): • 1 + 10 · 2 + 3 · 1 • 157 (set basis call) • 1 + 10 · 1 + 10 · 1 • 3 · 10 · 3 • 20 + 3

Total: 315 Flop

David Pfander: Octotiger Expansion Coefficients 22

Cont.

Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: (3 + 1 + 10) · 8B = 112B (monopole, per iteration, reuse unclear) • Read-Write: 2 · (10 + 3) · 8B = 208B (potential and correction, per loop)

(Empirically) at least 8 reusages for first multipole and result: 184B 315F + 112B + 208B = 343B → = 0.9F/B 8 343B

David Pfander: Octotiger Expansion Coefficients 23

Flops and Bytes of Boundary Monopole Monopole

Flops for single interaction (ordered as in the function): • 4+4

Total: 8 Flop

David Pfander: Octotiger Expansion Coefficients 24

Cont.

Bytes for single interaction : • Read: 4 · 8B = 24B (load multipole, per loop, min 8 reuses) • Read: 4 · 8B = 24B (monopole, per iteration, reuse unclear) • Read-Write: 2 · 4 · 8B = 48B (potential and correction, per loop)

(Empirically) at least 8 reusages for first multipole and result: 24B 8F + 24B + 48B = 75B → = 0.1F/B 8 75B

David Pfander: Octotiger Expansion Coefficients 25

Octotiger Expansion Coefficients - GitHub

Gradients of gravitational potential, X, Y center of masses, R := X â Y and d := ||R||2 ... Delta loop ... All data for one compute interactions call can be cached in L2.

Download PDF

569KB Sizes 9 Downloads 389 Views

Report

Octotiger Expansion Coefficients - GitHub

Recommend Documents