Octotiger Expansion Coefficients David Pfander

David Pfander: Octotiger Expansion Coefficients 1

Agenda

What those two functions do: • compute interactions • set basis

Implementation approach: • Preliminary performance model • Vectorization

David Pfander: Octotiger Expansion Coefficients 1

(Very) High-Level Perspective

What do compute interactions and set basis do? • Part of the gravitational potential computation, effect of cell B on cell A • Need function to approximate influence of B in different locations in A (mass centers

of child nodes) • Approximation done via multipole expansion of gravity • compute interactions and set basis compute the multipole expansion

coefficients • Additionally, angular momentum correction is calculated

But: • Actual potential not calculated or applied here!

David Pfander: Octotiger Expansion Coefficients 2

Parameters and Return Values

Input: • Center of masses of A and B • Multipole moments of A and B • (All particles with their masses?)

Output: • Array with coefficients of expansion • 2 4th order multipole expansions • 2 3d angular correction vectors

David Pfander: Octotiger Expansion Coefficients 3

Towards Gravitational Potential Evaluation Potential without correction (in Einstein’s sum convention):     1 1 ΦB→A (X) ≈ MB + MBij Dij + MB Di + MBjk Dijk xi + 2 2     1 1 MB Dij xi xj + MB Dijk xi xj xk 2 6 • MC monopole (sum of masses), MCij quadrupole moments (weighted sum of

masses), constants for given cells • All D constants; gradients of gravitational potential • Everything except for X is constant

David Pfander: Octotiger Expansion Coefficients 4

With the Sums Written Explicitly Without Einstein’s sum convention:     d d X d d X d X X X 1 1 MB Di + MBij Dij  + MBjk Dijk  xi + ΦB→A (X) ≈ MB + 2 2 i=1 j=1 k=1 i=1 j=1 | | {z } {z } 1 coefficient

3 coefficients

d X d  X 1



d X d X d  X 1

 MB Dij xi xj + MB Dijk xi xj xk 2 6 i=1 j=1 | i=1 j=1 k=1 | {z } {z } 9(6) coefficients

27(10) coefficients

• Some coefficients are the same, e.g. i = 0, j = 1 same as i = 1, j = 0 (symmetry) • Leads to additional array with multiplicities of terms

David Pfander: Octotiger Expansion Coefficients 5

Computing the Gradients of the Gravitational Potential Gradients of gravitational potential, X, Y center of masses, R := X − Y and d := ||R||2 −∇n

1 := D(n) . d

It follows 1 Ri 3Ri Rj − δij d2 (1) (2) D(0) := − , Di := 3 , Dij := − , d d d5 15Ri Rj Rk − 3(δij Rk + δjk Ri + δki Rj )d2 (3) Dijk := . d7 • Omitted D (4) , very long

David Pfander: Octotiger Expansion Coefficients 6

Calculating D(0) , . . . , D(4)

1 Ri 3Ri Rj − δij d2 (1) (2) D(0) := − , Di := 3 , Dij := − , d d d5 15Ri Rj Rk − 3(δij Rk + δjk Ri + δki Rj )d2 (3) Dijk := d7 • Again using symmetry to store unique D entries, implemented via coefficient maps:

MapD(2)

0 = 1 2

1 3 4

 2 4 5

• Kronecker δij complicates computation • Branch-free, vectorization by replacing the types

David Pfander: Octotiger Expansion Coefficients 7

Analysis of the Calculation of D(0) , . . . , D(4) Floating-point operations based on Dominic’s no-duplicate approach: • 6 (for

1 d2 ,one

DIV)

• 2 + 2 + 2 + 2 = 8 (precomputation, one SQRT) • D (0) is part of precomputation • D (1) → 3 • D (2) → 3 + 3 |{z}

=6

Delta loop

• D (3) → 3 + 6 + 10 + 2 · 3 + 3 · 6 = 19 + 26 = 45

|

{z

Delta loop

}

• D (4) → 4 · 3 + 5 · 6 + 5 · 10 = 12 + 30 + 50 = 92

Overall 14 + 92 + 45 + 6 = 157 Flops

David Pfander: Octotiger Expansion Coefficients 8

Potential Coefficient Calculation Floating-point operations based on Dominic’s no-duplicate approach: • Includes angular momentum correction; in the order of the loops • 3 (distance) • 2 • 6 · 6 = 36 (1 DIV) • 10 · 6 = 60 (1 DIV) • 3 · 3 + 3 · 6 · 6 = 9 = 108 • 3 · 10 · 6 = 180 • 6 · 2 + 6 · 3 · 4 = 84 • 10 · 3 = 30 • 10 · 4 = 40 (n0/n1, angular momentum, not vectorized)

Overall 538 Flops

David Pfander: Octotiger Expansion Coefficients 9

Required Memory

• Some stack variables, e.g. 5th order taylor (40 doubles) for gradients • 2· 4th order taylor (20 doubles each) for moments, memory-read • 2· 4th order taylor term (10 doubles each) for angular momentum, memory-read • 2 · 3d vector for X, Y, memory-read • 2· 4th order taylor (20 doubles each) for potential of both interacting cells (return

value), memory-write • 2 · 3d vectors (6 doubles) for force correction, memory-write

Stack 40 · 8B = 320B Read (2 · 20 + 2 · 10 + 2 · 3) · 8B = 528B Read-Write (2 · 20 + 2 · 3) · 8B = 368B

David Pfander: Octotiger Expansion Coefficients 10

Some Assumptions and Some Facts

• Stack variables are likely cached in L1, no short-lived heap variables • Input arrays are read once per interaction • Output arrays have one read-write access per interaction • 512 multipole interact (seems to be constant) • 60780 multipole interactions (seems to be constant, very few exceptions)

→ 512 · (528B + 368B) = 458752B = 0.44M B (stack variables are per thread, irrelevant) → All data for one compute interactions call can be cached in L2

David Pfander: Octotiger Expansion Coefficients 11

Are we Main Memory Bound? Machine balance of relevant hardware: • A Cori KNL node requires

only:

3T F 90GB/s

3T F 490GB/s

F → 6.3 B (MCDRAM only:

3T F 400GB/s

F → 7.7 B , DRAM

F → 34.1 B )

• A Cori HSW node requires

1.2T F 136.5GB/s

F → 8.8 B

For whole call to compute interactions : 695F · 60780 = 42242100F (528B + 368B) · 512 = 458752B 42242100F F = 92.1 → 458752B B → compute bound, if enough L2 bandwidth → DRAM or HBM2 should not make a difference

David Pfander: Octotiger Expansion Coefficients 12

Is a Single Iteration L2 Bound? Can load up to 64B per core per cycle on HSW and KNL: • L2 balance on 68-core KNL node is

3T F (68/2)·1.4GHz·64B

=

3T F 3T B/s

F = 1.0 B

(L2 shared between 2 cores, 64B read per cycle) • L2 balance on 32-core HSW node is

1.2T F 32·2.3GHz·64B

=

3T F 4.7T B/s

F = 0.64 B

Naively for single interaction: F 695F = 0.78 528B + 368B B More realistically: • Due to the vectorization approach, most of the time same multipole interacts with

different other multipoles • Therefore, one multipole is cached in L1, and consequentially

8 · 695F 5560F F = = 1.38 264B + 8 · 264B + 184B + 8 · 184B 4032 B David Pfander: Octotiger Expansion Coefficients 13

Some conclusions

• Can stream iteration from L2 on KNL

→ Compute bound on KNL! • Working set (≈ 0.44M B) fits into KNL L2 (1MB for 2 cores), but not into HSW L2

(256KB), L3 might help on HSW • Current scalar to vector instruction ratio ≈ 60 : 40 (measured on HSW with Intel

Advisor) → Leads to naive peak performance bound of 40% → 1.2T F (Without memory and pipeline latencies, . . . )

David Pfander: Octotiger Expansion Coefficients 14

Implementation Properties

• Vectorization is straightforward, no branches (replace type by vector type) • Vectorization improves cache-line utilization (vector size == cache line size) • Vector elements will most often compute interactions between the first multipole and

some other multipole (they some input and output vectors) → Many L1 hits, good cache efficiency • Relatively high ILP (potential calculated for both interacting multipoles) • Lots of scalar operations and MOVs mixed in (bad for KNL) • Example: Expensive indirect loads D (4) map for single element access

A(a, a, b, b) → A[map4[a][a][b][b]]

David Pfander: Octotiger Expansion Coefficients 15

Guideline for Improvements on KNL

Instruction mix requirements: • KNL can only sustain two instructions per cycle (HSW can do 4) • KNL has two FP vector pipes • Every scalar instruction removes one vectorized FP instruction and thereby directly

reduces peak performance • Long chains of AVX512 instructions are needed • (But AVX512 MOVs are bad as well)

Other: • Break up dependency chains for better pipeline utilization • Hyperthreads should decrease performance, due to increased L2 pressure

David Pfander: Octotiger Expansion Coefficients 16

General Questions and Further Challenges Can we somehow get rid of storing the expansion as arrays? • Probably not (according to Dominic) • Needed to calculate the child expansions • Actual evaluation at the leaf level

Could we integrate the moments calculation as part of the interactions? • Likely propagated as part of the upward tree traversal • If true, for O(n) FMM impossible

Further Challenges: • Everything tightly coupled, changing the data structures requires major refactoring • Refactoring requires a better understanding of other parts of octotiger

David Pfander: Octotiger Expansion Coefficients 17

Flops and Bytes of Boundary Multipole Multipole Flops for single interaction (ordered as in the function): • 10 · 2 + 1 = 21 • 3·1=3 • 157 (set basis call) • 6 · 4 = 24 • 10 · 4 = 40 • 3 · (6 · 4 + 1) = 75 • 3 · 10 · 4 = 120 • 6 · (3 · 2 + 1) = 42 • 10 · 1 = 10 • 20 + 3 · 1 = 4

Total: 501 Flop

David Pfander: Octotiger Expansion Coefficients 18

Cont.

Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: (10 + 3) · 8B = 104B (second multipole, per iteration, reuse unclear) • Read-Write: 2 · (20 + 3) · 8B = 368B (potential and correction, per loop)

(Empirically) at least 8 reusages for first multipole and result: 184B 368B 501F + 104B + = 173B → = 2.9F/B 8 8 173B

David Pfander: Octotiger Expansion Coefficients 19

Flops and Bytes of Boundary Multipole Monopole

Flops for single interaction (ordered as in the function): • 3·1 • 157 (set basis call) • 1 + 6 · 4 + 10 · 4 • 3 ∗ (6 ∗ 4 + 1) • 3 ∗ 10 ∗ 4 • 4∗1+3∗1

Total: 427 Flop

David Pfander: Octotiger Expansion Coefficients 20

Cont.

Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: 3 · 8B = 24B (monopole, per iteration, reuse unclear) • Read-Write: 2 · (3 + 3) · 8B = 96B (potential and correction)

(Empirically) at least 8 reusages for first multipole and result: 184B 427F + 24B + 96B = 143B → = 3.0F/B 8 143B

David Pfander: Octotiger Expansion Coefficients 21

Flops and Bytes of Boundary Monopole Multipole

Flops for single interaction (ordered as in the function): • 1 + 10 · 2 + 3 · 1 • 157 (set basis call) • 1 + 10 · 1 + 10 · 1 • 3 · 10 · 3 • 20 + 3

Total: 315 Flop

David Pfander: Octotiger Expansion Coefficients 22

Cont.

Bytes for single interaction : • Read: (20 + 3) · 8B = 184B (load multipole, per loop, min 8 reuses) • Read: (3 + 1 + 10) · 8B = 112B (monopole, per iteration, reuse unclear) • Read-Write: 2 · (10 + 3) · 8B = 208B (potential and correction, per loop)

(Empirically) at least 8 reusages for first multipole and result: 184B 315F + 112B + 208B = 343B → = 0.9F/B 8 343B

David Pfander: Octotiger Expansion Coefficients 23

Flops and Bytes of Boundary Monopole Monopole

Flops for single interaction (ordered as in the function): • 4+4

Total: 8 Flop

David Pfander: Octotiger Expansion Coefficients 24

Cont.

Bytes for single interaction : • Read: 4 · 8B = 24B (load multipole, per loop, min 8 reuses) • Read: 4 · 8B = 24B (monopole, per iteration, reuse unclear) • Read-Write: 2 · 4 · 8B = 48B (potential and correction, per loop)

(Empirically) at least 8 reusages for first multipole and result: 24B 8F + 24B + 48B = 75B → = 0.1F/B 8 75B

David Pfander: Octotiger Expansion Coefficients 25

## Octotiger Expansion Coefficients - GitHub

Gradients of gravitational potential, X, Y center of masses, R := X â Y and d := ||R||2 ... Delta loop ... All data for one compute interactions call can be cached in L2.

#### Recommend Documents

WESTWARD EXPANSION
the wild-west was pushed further and further westward in two waves as land was bought, explored, and taken over by the United States Government and settled by immigrants from Europe. The first wave settled land west to the Mississippi River following

Expansion card instructions - Angelfire
Thanks for buying this expansion card and sound rom chip set, please read the following ... The expansion card can only have 1 rom chip fitted at a time.

Expansion card instructions - Angelfire

Relay Expansion - Onion Wiki
Mar 22, 2016 - THE INFORMATION CONTAINED IN THIS DRAWING IS THE SOLE PROPERTY OF. ONION CORPORATION. ANY REPRODUCTION IN PART ...

Learning Articulation from Cepstral Coefficients - Semantic Scholar
Parallel and Distributed Processing Laboratory, Department of Applied Informatics,. University ... training set), namely the fsew0 speaker data from the MOCHA.

Performance Enhancement of Fractional Coefficients ...
Dr. H. B. Kekre is Sr. Professor in Computer Engineering Department with the ... compared with a Training Set of Images and the best ..... Computer Networking.

Learning Articulation from Cepstral Coefficients - Semantic Scholar
2-3cm posterior from the tongue blade sensor), and soft palate. Two channels for every sensor ... (Î½âSVR), Principal Component Analysis (PCA) and Indepen-.

Capital Expansion Fees -
Humane Society for services. Reduced Services ... Animal control field service in the unincorporated ... management practices should be followed: â¢. Maintain a 100-foot ... If a dead pet animal or livestock is located on private property, call any.

Determination of accurate extinction coefficients and simultaneous ...
and Egle [5], Jeffrey and Humphrey [6] and Lich- tenthaler [7], produce higher Chl a/b ratios than those of Arnon [3]. Our coefficients (Table II) must, of course,.

Splitting methods with complex coefficients
More examples. Hamiltonian systems. Poisson systems. Lotka–Volterra eqs., ABC-flow, Duffing oscillator. ('conformal Hamiltonian'). PDEs discretized in space (Schrödinger eq., Maxwell equations) coming from. Celestial Mechanics. Molecular dynamics. Qu

Medicaid Expansion 2014 - King County
Modified Adjusted Gross Income (MAGI) tool will mirror federal income tax ... Exchange and Medicaid plan to develop a simplified and seamless application that can ... Development of IT Systems â Exchange web portal, new eligibility rules ...

Increased apparent diffusion coefficients on MRI linked ...
Oct 22, 2008 - arteries were dissected free and doubly ligated with a 3 to 0 silk suture. .... Data processing was performed using in-house software written in ...

Adaptive Green-Kubo estimates of transport coefficients ...
We present a rigorous Green-Kubo methodology for calculating transport .... where the per-atom energy ÏµÎ± is formed from the kinetic energy of the atom Î± and a ...

doom 3 expansion pack.pdf

Public Opinion of Medicaid Expansion - Commonwealth Foundation
Aug 12, 2013 - Nearly 70% of voters say Medicaid should not be expanded until waste, fraud and abuse is cleaned up.4. Party/Region. Very/Somewhat. Convincing. Not Very Convincing. No Opinion. All Voters. 68%. 28%. 4%. Republicans. 84%. 12%. 4%. Indep

foetal dose conversion coefficients for icrp-compliant ...
Feb 26, 2009 - match the reference data on average pregnant females(17,9). Table 1 in Taranenko and Xu(18) sum- marises the organ masses in detail.

Determination of the Diffusion Coefficients of Organic ...
between a mobile gas phase and a stationary polymer phase. ... Contract grant sponsor: University of the Basque Country; contract grant ... The comparison.

Autophagosome expansion due to amino acid deprivation
controlled by class I and class III PI3-kinases (Petiot et al., 2000). Among the ... (Noda et al., 1995; Klionsky and Emr, 2000). Applying a genetic approach, a ...

MISALLOCATION, EDUCATION EXPANSION AND WAGE INEQUALITY
Jan 2, 2016 - understanding the effects of technical change on inequality, this ... 2The terms education, college and skill premium are used interchangeably.

Public Opinion of Medicaid Expansion - Commonwealth Foundation
Aug 12, 2013 - Polling indicates voters' concerns about Medicaid expansion and support for a prudent approach given unanswered questions. From July 23-24, 2013, Magellan Strategies polled ... Southeast: 1, 2, 6, 8 and 13th congressional districts inc

Educational expansion, earnings compression and changes in ...
Evidence from French cohorts, 1931-1976. Arnaud LEFRANCâ. March 16, 2011. Abstract. This paper analyzes long-term trends in intergenerational earnings mobility in ...... 1900. 1920. 1940. 1960. 1980 cohort tertiary deg. higher secondary deg. lower

Expansion LIBRARY PROGRAM STATEMENT.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Expansion ...

(Galemys pyrenaicus) - Phylogeography and postglacial expansion ...
PDF File: Whisky In Your Pocket: A New Edition Of Wallace Milroy's The Origin 2. Page 2 of 8 .... Igea et al. BMC Evolutionary Biology 2013, 13:115 Page 3 of 19.