Language Constructs for Data Locality: Moving Policy Decisions from Language Definition to User Space

Brad Chamberlain, Chapel Team, Cray Inc. PADAL Workshop, Lugano Switzerland April 28th, 2014

C O M P U T E           |           S T O R E           |           A N A L Y Z E

Three Language Concepts for Taming Data Locality

Language Constructs for Data Locality: Moving Policy Decisions from Language Definition to User Space

Brad Chamberlain, Chapel Team, Cray Inc. PADAL Workshop, Lugano Switzerland April 28th, 2014

C O M P U T E           |           S T O R E           |           A N A L Y Z E

Safe Harbor Statement

This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

3

Prototypical Next-Gen Processor Technologies

Intel MIC

AMD APU

Nvidia Echelon

Tilera Tile-Gx

Sources: http://download.intel.com/pressroom/images/Aubrey_Isle_die.jpg, http://www.zdnet.com/amds-trinity-processors-take-on-intels-ivy-bridge-3040155225/, C O M P U T E           |           S T O R E  http://tilera.com/sites/default/files/productbriefs/Tile-Gx%203036%20SB012-01.pdf         |           A N A L Y Z E http://insidehpc.com/2010/11/26/nvidia-reveals-details-of-echelon-gpu-designs-for-exascale/, Copyright 2014 Cray Inc.

4

Why do we need data locality control?

Emerging processor designs… …are increasingly locality-sensitive …potentially have multiple processor/memory types

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

5

Data Locality Control in Current HPC Models Q: Why are current HPC models lacking w.r.t. data locality? A: Because they… …lock key data locality policies into the language ●  e.g., array layouts, parallel scheduling

…lack support for users to create new policy abstractions …expose too much about their target architectures

In Chapel, we’re striving to improve upon this status quo “How can we define a language that supports high level abstractions and enables users to plug in their own implementations?”

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

6

What is Chapel? ●  An emerging parallel programming language ●  Design and development led by Cray Inc. ●  in collaboration with academia, labs, industry

●  A work-in-progress ●  Goal: Improve productivity of parallel programming

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

7

What does “Productivity” mean to you? Recent Graduate:

“something similar to what I used in school: Python, Matlab, Java, …”

Seasoned HPC Programmer:

“that sugary stuff that I can’t use because I require full control to ensure good performance”

Computational Scientist:

“something that lets me express my parallel computations without having to wrestle with architecture-specific details”

Chapel Team:

“something that lets the computational scientist express what they want, without taking away the control the HPC programmer wants, implemented in a language as attractive as recent graduates want.” C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

8

Chapel's Implementation ●  Being developed as open source at SourceForge ●  Licensed as BSD software

●  Portable design and implementation, targeting: ●  ●  ●  ● 

multicore desktops and laptops commodity clusters and the cloud HPC systems from Cray and other vendors in-progress: exascale-era architectures

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

9

Multiresolution Design Multiresolution Design: Support multiple tiers of features ●  higher levels for programmability, productivity ●  lower levels for greater degrees of control Chapel language concepts Domain Maps Data Parallelism Task Parallelism Base Language Locality Control Target Machine

●  build the higher-level concepts in terms of the lower ●  permit the user to intermix layers arbitrarily C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

10

LULESH in Chapel

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

11

LULESH in Chapel

1288 lines of source code plus

266 lines of comments 487 blank lines

(the corresponding C+MPI+OpenMP version is nearly 4x bigger) This can be found in Chapel v1.9 in examples/benchmarks/lulesh/*.chpl

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

12

LULESH in Chapel

This is the only representation-dependent code. It specifies: •  data structure choices •  structured vs. unstructured mesh •  local vs. distributed data •  sparse vs. dense materials arrays

•  a few supporting iterators

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

13

Data Parallelism in LULESH (Structured) const Elems = {0..#elemsPerEdge, 0..#elemsPerEdge},

Nodes = {0..#nodesPerEdge, 0..#nodesPerEdge}; var determ: [Elems] real; forall k in Elems { …determ[k]… }

Elems

Nodes

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

14

Data Parallelism in LULESH (Unstructured) const Elems = {0..#numElems},

Nodes = {0..#numNodes}; var determ: [Elems] real; var elemToNode: [Elems] nodesPerElem*index(Nodes); forall k in Elems { …determ[k]… }

Elems

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

15

Implementing Domains and Arrays Q: How are domains and arrays implemented? (distributed or local? distributed how? stored in memory how?) const Elems = {0..#numElems},

Nodes = {0..#numNodes}; var determ: [Elems] real;

A: Via Feature #1 (domain maps)…

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

16

Domain Maps: Concept Domain maps are “recipes” that instruct the compiler how to map the global view of a computation… = + α•

A = B + alpha * C;

…to the target locales’ memory and processors: = + α•

= + α•

Locale 0

= + α•

Locale 1

Locale 2

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

17

LULESH Data Structures (local) const Elems = {0..#numElems},

Nodes = {0..#numNodes}; var determ: [Elems] real; forall k in Elems { … }

Elems

No domain map specified ⇒ use default layout •  current locale owns all indices and values •  computation will execute using local processors only

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

18

LULESH Data Structures (distributed, block) const Elems = {0..#numElems} dmapped Block(…),

Nodes = {0..#numNodes} dmapped Block(…); var determ: [Elems] real; forall k in Elems { … }

Elems

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

19

LULESH Data Structures (distributed, cyclic) const Elems = {0..#numElems} dmapped Cyclic(…),

Nodes = {0..#numNodes} dmapped Cyclic(…); var determ: [Elems] real; forall k in Elems { … }

Elems

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

20

Chapel’s Domain Map Philosophy 1.  Chapel provides a library of standard domain maps ●  to support common array implementations effortlessly

2.  Expert users can write their own domain maps in Chapel ●  to cope with any shortcomings in our standard library Domain Maps Data Parallelism Task Parallelism Base Language Locality Control

3.  Chapel’s standard domain maps are written using the same end-user framework ●  to ensure that the framework works and works well

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

21

Domain Map Descriptors Domain Map

Domain

Represents: a domain map value Generic w.r.t.: index type

Array

Represents: a domain

Represents: an array

Generic w.r.t.: index type

Generic w.r.t.: index type, element type

State: the domain map’s representation

State: representation of index set

Typical Size: Θ(1)

Typical Size: Θ(1) → Θ(numIndices)

Typical Size: Θ(numIndices)

Required Interface:

Required Interface:

Required Interface: ● 

create new domains

•  •  •  •  • 

create new arrays queries: size, members iterators: serial, parallel domain assignment index set operations

State: array elements

•  •  •  •  • 

(re-)allocation of elements random access iterators: serial, parallel slicing, reindexing, aliases get/set of sparse “zero” values

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

22

Domain Maps Summary ●  Data locality requires mapping arrays to memory well ●  distributions between distinct memories ●  layouts within a single memory

●  Most languages define a single data layout & distribution ●  where the distribution is often the degenerate “everything’s local”

●  Domain maps… …move such policies into user-space …exposing them to the end-user through high-level declarations const Elems = {0..#numElems} dmapped Block(…)

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

23

Implementing Data Parallel Loops Q: How are parallel loops implemented? (how many tasks? executing where? how are iterations divided up?) forall k in Elems { … }

Q2: What about zippered data parallel operations? (how to reconcile potentially conflicting parallel implementations?) forall (k,d) in zip(Elems, determ) { … } x += xd * dt;

A: Via Feature #2 (leader-follower iterators)…

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

24

Leader-Follower Iterators: Definition ●  Chapel defines all forall loops in terms of leaderfollower iterators: ●  leader iterators: create parallelism, assign iterations to tasks ●  follower iterators: serially execute work generated by leader

●  Given… forall (a,b,c) in zip(A,B,C) do a = b + alpha * c; …A is defined to be the leader …A, B, and C are all defined to be followers

●  Domain maps support default leader-follower iterators ●  specify parallel traversal of a domain’s indices/array’s elements ●  typically written to leverage affinity C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

25

Writing Leaders and Followers Leader iterators are defined using task/locality features: iter BlockArr.lead() { coforall loc in Locales do on loc do coforall tid in here.numCores do yield computeMyChunk(loc.id, tid); } Domain Maps Data Parallelism Task Parallelism Base Language Locality Control

Follower iterators simply use serial features:

Target Machine

iter BlockArr.follow(work) { for i in work do yield accessElement(i); } C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

26

Leader-Follower Summary ●  Data locality requires parallel loops to execute intelligently ●  appropriate number and placement of tasks ●  good data-task affinity

●  Most languages define fixed parallel loop styles ●  where “no parallel loops” is a common choice

●  Leader-follower iterators… …move such policies into user-space …expose them to the end-user through data parallel abstractions forall k in Elems { … } x += xd * dt;

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

27

OK, but what about those future architectures?

Feature #3 (hierarchical locales) ●  extends multiresolution philosophy to architectural modeling

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

28

Traditional Locales Concept: ●  Traditionally, Chapel has supported a 1D array of locales

locale

locale

locale

locale

●  Supports inter-node locality well, but not intra-node ●  (which, of course, is becoming increasingly important)

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

29

Recent Work: Hierarchical Locales Concept: ●  Support locales within locales to describe architectural sub-structures within a node (e.g., memories, processors) sub-locale A C C D

E

sub-locale B

locale

sub-locale A C C D

E

sub-locale B

locale

sub-locale A C C D

E

sub-locale B

locale

sub-locale A C C D

E

sub-locale B

locale

●  As with top-level locales, on-clauses and domain maps

map tasks and variables to sub-locales ●  Locale models are defined using Chapel code

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

30

Defining Hierarchical Locales 1)  Define the processor’s abstract block structure sub-locale A C C D

E

sub-locale B

locale

2)  Define how to run a task on any sublocale 3)  Define how to allocate/access memory on any sublocale

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

31

Hierarchical Locale Summary ●  Data locality requires flexibility w.r.t. future architectures ●  due to uncertainty in processor design ●  to support portability between approaches

●  Most programming models assume certain features in the target architecture ●  this is why MPI/OpenMP/UPC/CUDA/… have restricted applicability

●  Hierarchical Locales …move the definition of new architectural models to user space …are exposed to the end-user via Chapel’s traditional locality features on loc do coforall tid in here.numCores do

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

32

Summary Chapel’s multiresolution philosophy allows users to write… …custom array implementations via domain maps …custom parallel iterators via leader-follower iterators …custom architectural models via hierarchical locales

The result is a language that decouples crucial policies for managing data locality out of the language’s definition and into an expert user’s hand… …while making them available to end-users through highlevel abstractions C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

33

Why a new language? Q: Why develop a new language rather than a library or language extension? A: Because… …having custom syntax presents policies to the end-user more cleanly …it exposes optimization opportunities to the compiler …helps with rank-independent indexing, arr-of-struct v. struct-of-array, … …these concepts are more difficult to write in a traditional HPC language (due to lack of support for features like type inference, iterators, generics, …)

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

34

For More Information on… …domain maps User-Defined Distributions and Layouts in Chapel: Philosophy and Framework [slides], Chamberlain, Deitz, Iten, Choi; HotPar’10, June 2010. Authoring User-Defined Domain Maps in Chapel [slides], Chamberlain, Choi, Deitz, Iten, Litvinov; Cug 2011, May 2011.

…leader-follower iterators User-Defined Parallel Zippered Iterators in Chapel [slides], Chamberlain, Choi, Deitz, Navarro; PGAS 2011, October 2011.

…hierarchical locales Hierarchical Locales: Exposing Node-Level Locality in Chapel, Choi; 2nd KIISEKOCSEA SIG HPC Workshop talk, November 2013.

Status: all of these concepts are in-use in every Chapel program today (pointers to code/docs in the release available by request) C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

35

The Cray Chapel Team (Summer 2013) Chapel USA

C O M P U T E           |         Chapel   S T O R E  Seattle         |           A N A L Y Z E Copyright 2014 Cray Inc.

36

For More Information: Online Resources Chapel project page: http://chapel.cray.com ●  overview, papers, presentations, language spec, … Chapel SourceForge page: https://sourceforge.net/projects/chapel/ ●  release downloads, public mailing lists, code repository, … Mailing Aliases: contact the team at Cray [email protected]: announcement list [email protected]: user-oriented discussion list [email protected]: developer discussion [email protected]: educator discussion [email protected]: public bug forum

●  [email protected]: ●  ●  ●  ●  ● 

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

37

For More Information: Suggested Reading Overview Papers: ●  The State of the Chapel Union [slides], Chamberlain, Choi, Dumler,

Hildebrandt, Iten, Litvinov, Titus. CUG 2013, May 2013. ●  a high-level overview of the project summarizing the HPCS period

●  A Brief Overview of Chapel, Chamberlain (pre-print of a chapter for A

Brief Overview of Parallel Programming Models, edited by Pavan Balaji, to be published by MIT Press in 2014). ●  a more detailed overview of Chapel’s history, motivating themes, features

Blog Articles: ●  [Ten] Myths About Scalable Programming Languages, Chamberlain.

IEEE Technical Committee on Scalable Computing (TCSC) Blog, (https://www.ieeetcsc.org/activities/blog/), April-November 2012. ●  a series of technical opinion pieces designed to rebut standard arguments

against the development of high-level parallel languages C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

38

Chapel: the next five years ●  Harden prototype to production-grade ●  add/improve lacking features ●  optimize performance

●  Target more complex/modern compute node types ●  e.g., Intel MIC, CPU+GPU, AMD APU, …

●  Continue to grow the user and developer communities ●  including nontraditional circles: desktop parallelism, “big data” ●  transition Chapel from Cray-managed to community-governed

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

39

Chapel… …is a collaborative effort — join us!

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

40

Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc. C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

41

http://chapel.cray.com

[email protected] http://sourceforge.net/projects/chapel/

Language Constructs for Data Locality - Semantic Scholar

Apr 28, 2014 - Licensed as BSD software. ○ Portable design and .... specify parallel traversal of a domain's indices/array's elements. ○ typically written to ...

10MB Sizes 2 Downloads 272 Views

Recommend Documents

Language Constructs for Data Locality - Chapel
Apr 28, 2014 - lower levels for greater degrees of control ..... codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is ...

Language Constructs for Data Locality - CRAY Chapel - Cray Inc.
Apr 28, 2014 - Page 24 .... Myths About Scalable Programming Languages, Chamberlain. IEEE Technical Committee on Scalable Computing (TCSC) Blog,.

Language Constructs for Data Locality - CRAY Chapel - Cray Inc.
Apr 28, 2014 - statements that are not historical facts. These statements ... multicore desktops and laptops .... Chapel defines all forall loops in terms of leader-.

The Data Locality of Work Stealing - Semantic Scholar
Jan 22, 2002 - School of Computer Science ... Department of Computer Sciences ..... We also require that the dags have a single node with in-degree x , the ...

The Data Locality of Work Stealing - Semantic Scholar
Jan 22, 2002 - School of Computer Science. Carnegie ... Department of Computer Sciences. University of .... Locality-guided work stealing does significantly better than standard work ...... University of California at Berkeley, November 1989.

Locality-Based Aggregate Computation in ... - Semantic Scholar
The height of each tree is small, so that the aggregates of the tree nodes can ...... “Smart gossip: An adaptive gossip-based broadcasting service for sensor.

structured language modeling for speech ... - Semantic Scholar
20Mwds (a subset of the training data used for the baseline 3-gram model), ... it assigns probability to word sequences in the CSR tokenization and thus the ...

Semantic Language Models for Topic Detection ... - Semantic Scholar
Ramesh Nallapati. Center for Intelligent Information Retrieval, ... 1 Introduction. TDT is a research ..... Proc. of Uncertainty in Artificial Intelligence, 1999. Martin, A.

Automated Locality Optimization Based on the ... - Semantic Scholar
applications string operations take 2 of the top 10 spots. ... 1, where the memcpy source is read again .... A web search showed 10 times more matches for optimize memcpy than for ..... other monitoring processes, such as debuggers or sandboxes. ...

Fast data extrapolating - Semantic Scholar
near the given implicit surface, where image data extrapolating is needed. ... If the data are extrapolated to the whole space, the algorithm complexity is O(N 3. √.

Reactive Data Visualizations - Semantic Scholar
of the commercial visualization package Tableau [4]. Interactions within data visualization environments have been well studied. Becker et al. investigated brushing in scatter plots [5]. Shneiderman et al. explored dynamic queries in general and how

Stable communication through dynamic language - Semantic Scholar
texts in which particular words are used, or the way in which they are ... rules of grammar can only be successfully transmit- ted if the ... are much more likely to pass through the bottleneck into the ... ternal world is not sufficient to avoid the

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute of ... over all competing classes, and have been demonstrated to be effective in isolated word ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute ... NIST (National Institute of Standards and Technology) has ..... the best procedure to follow.

Persistent structural priming from language ... - Semantic Scholar
b NTT Communication Science Laboratories, 2-4 Hikari-dai, Seika-cho, ... c Department of Psychology, McGill University, Montreal, Quebec, Canada, H3A 1B1.

Stable communication through dynamic language - Semantic Scholar
In E. Briscoe, editor, Linguistic. Evolution through Language Acquisition: Formal and Computational Models, pages 173–203. Cam- bridge University Press ...

Support for Machine and Language Heterogeneity ... - Semantic Scholar
saging protocols or on higher-level remote procedure call systems such as RPC ... manually in the application or with the help of object caching middleware [20].

Portability of Syntactic Structure for Language ... - Semantic Scholar
assign conditional word-level language model probabilities. The model is trained in ..... parser based on maximum entropy models,” in Second. Conference on ...

Geo-location for Voice Search Language Modeling - Semantic Scholar
guage model: we make use of query logs annotated with geo- location information .... million words; the root LM is a Katz [10] 5-gram trained on about 695 billion ... in the left-most bin, with the smallest amounts of data and LMs, either before of .

language style and domain adaptation for cross ... - Semantic Scholar
is the data used to train the source SLU, and new language understanding ..... system using an open source tools such as Moses on out-of- domain parallel ...

Support for Machine and Language Heterogeneity ... - Semantic Scholar
to) wide area networks (WANs), in order to take advantage of resources ... They have some important limitations, however, and their wide acceptance does not necessarily ... have spurred the development, over the past fifteen years, of a large number

Steps for Improving Data Comprehension for ... - Semantic Scholar
data is not collected internally, limiting detection ability of insider threats. Alternatively ... tempts to classify current data without the use of training data. Anomalous ..... as proposed in [15] are insufficient; such Internet scale tech- nique