COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs ― Exascale Computing Lab MICRO-41 tutorial November 9, 2008

http://sites.google.com/site/hplabscotson © 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Lake Como, Italy

Core Concepts •

Functional Simulator (SimNow) − Sequences the behavioral simulation of CPUs and devices



Timers − Using functional events, it computer the target metrics (time, power)



Sampler − Decide when to turn on or off the Timers and for how long



Interleaver − Decides how to buffer and reorder functional events (SMP)



Time Predictor − Based on timer metrics evolution over time, decides how to feed the information back to the functional simulator

2

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Decoupling Simulation •

Functional Simulation (fast) − Emulates the behavior of all the components of our system • Disks, video, network cards, etc.

− Necessary to verify correctness, run software •

Timing Simulation (slow) − Models the timing of all the components − Used to measure performance (or power)



COTSon approach: “Functional Directed with sampling and time feedback” Device function and Software

3

9 November 2008

Events (instructions, …)

Functional Simulator

Time feedback (predicted IPC)

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Timing Simulator

Metrics, time and power

COTSon Components COTSon Node Timing feedback

SimNow (Functional)

CPU and Memory Timer Core 0

Northbridge

Sampling

Memory

Asynchronous Events Interleaver ... 4 3 2 1 0 ... 2 1 0

Core 1

Southbridge

C0

D$ I$

C1

D$ I$

L2$

Memory

Bus

Timing feedback NIC

HD 1

HD 0

NIC Timer

Disk Timer

Disk Timer

COTSon Node 0

Sampling

1

4

9 November 2008

0

Network Mediator Network Switch (Functional)

Network Timer

COTSon Node

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

0

Sampling

0

1

Timers • • • • •

(a.k.a. CPU/device models)

Accept instructions, process them and update metrics All timers share the memory hierarchy Some “must have” metrics: cycles and instructions Pluggable architecture Not only CPU models, but also: − Profiling − Trace generation − “Simpoint”-like analysis



Current models − − − −

5

Timer0: simple “linear” model + cache hierarchy Timer1: Timer0 + in-order pipeline Bandwidth: Only limited by memory bandwidth PTLSim (open source): linked to COTSon, full x86  OoO superscalar

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Samplers •

Decide when and how much to simulate and when to move from one simulation state to another − Functional: fast forward to the next state as quickly as possible − Warming (simple/detailed): get data in stateful structures (e.g., caches), but do not account for time − Simulation: account for time

• •

Pluggable architecture Many implementations − Smarts[1], SimPoint[2], Dynamic Sampling[3], Random, Interval-based, … [1] Wunderlich et al. SMARTS: Accelerating Microarchitecture Simulation Via Rigorous Statistical Sampling, ISCA'03 [2] B. Calder. Simpoint (www.cse.ucsd.edu/~calder/simpoint) [3] A. Falcón et al. Combining Simulation and Virtualization through Dynamic Sampling, ISPASS'07



Samplers are what provide the major acceleration component − Even for very accurate (hence slow) timing models, a good sampler only needs to invoke the timing model < 1% of the time.

6

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Single CPU simulation •

Fast and accurate single node simulation using Dynamic Sampling − Detect dynamically program phase changes − The challenge is to avoid disturbing the VM execution in the code cache during fast functional emulation − Phase changes are correlated with VM statistics (exceptions, I/O events, code cache invalidations, …) which are easy to get and don’t impact performance

IPC Exceptions

1 3

4

5

2

Instructions  7

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

6

Dynamic Sampling A. Falcón, P. Faraboschi, and D. Ortega, “Combining Simulation and Virtualization through Dynamic Sampling”, in Proceedings of ISPASS’07

• Allows

users to favor accuracy or speed, depending on their requirements − High accuracy: 0.4% accuracy error with 8.5x speedup − High speed: 309x speedup with 1.9% error

• Fully

dynamic

− Does not require any a priori analysis − Automatically detect code phases • Allows

for providing timing feedback to the functional simulator

8

9 November 2008

COTSon: Infrastructure for system-level simulation — MICRO-41

Multi-core simulation •

SimNow performs functional simulation of multi-cores − It simulates MP as “sequential interleaved” at coarse granularity − This misses fine grain memory interactions



COTSon buffers events and delivers them interleaved to the CPU timing models



Problem: Hard to scale up  OS? BIOS? Core 2 Core 3

Interleaver

SimNow

1 2 3 4

Core 4

Model CPU 1 Model CPU 2 Model CPU 3 Model CPU 4

Simulator Front-End 9

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Interconnect/Memory Model

Core 1

Simulator Back-End

Interleaving Fundamentals 0

MP functional simulation runs “sequentially interleaved” at coarse granularity.

1

This may miss fine-grain memory interactions

2

We buffer events at every MP quantum and deliver them interleaved to the timers

1 2

Buffer and coalesce

0

1

2 Interleave

0

1

2

0 1

MP quantum

… 10

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 1

Interleaved based on the CPUs IPC

To timing model

Timing Feedback •

Problem: feed back timing information to the functional emulator − Give the simulated application an illusion of approximate time (functional time corresponding to simulated time)



Define the IPC of a quantum based on previous history − “Classic” time-series prediction problem, with unknown model



Current model: simple predictor − The IPC is fed back to the functional simulator − The application being simulated acts as if execution is faster or slower Previous y observed and predicted CPIs Emulate (functional) CPI=1.0

11

October 2008

Simulate (timing)

GT Talk

Current CPI=2.0

Predict CPI

Emulate (functional) CPI=1.8

Many-core simulation M. Monchiero, J.-H. Ahn, A. Falcón, D. Ortega, and P. Faraboschi, “How to simulate 1000 cores”, dasCMP’08



Translate SW thread-level into simulated core-level parallelism − Identify and separate the instruction streams of the different threads at the OS level (context switches) − Dynamically map each instruction flow to the corresponding core of the target multicore architecture, taking into account application-level thread synchronization Model CPU n

···

SimNow (1 core) Thread ID (from guest OS)

Thread 1

Thread 2

Thread 3

Model CPU 3 Model CPU 2 Model CPU 1

Simulator Front-End 12

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Interconnect/Memory Model

OS context switches

Simulator Back-End

Multi-node simulation •

Simulate a computer cluster as a cluster of full-system simulators − Each node of the cluster is simulated with a full-system simulator − Network simulator used to simulate network topology



Problems: − Time skew between nodes needs to be controlled with quanta − Quantum size must be chosen carefully • Small quanta  Bad simulation speed • Large quanta  Bad simulation accuracy

13

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Adaptive Synchronization A. Falcón, P. Faraboschi, and D. Ortega, “An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters ”, in Procs. of ISPASS’08



Basic idea: dynamically adjust the quantum for maximum speed at a controlled accuracy loss − Quantum increases/decreases depending on packet traffic − Slow Acceleration, fast deceleration (“driving over speed bumps”) 45

Packets

40

Quantum

800 700

30

600

25

500

20

400

15

300

10

200

5

100

0

0

Time  14

900

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Quantum

Packets

35

1000

Speed vs. Accuracy Tradeoffs •

We can play the speed vs. accuracy game at several control points − Within a node: dynamic sampling sensitivity − At cluster level: adaptive quantum range



By choosing the appropriate values we can reach − Single node accuracy in the order of 11%–15% error (simple CPU model) − Networking accuracy (microbenchmark) up to 15 Gb/s − All of the above with self-relative slowdown (vs. native) of ~15x-30x



Improvement Areas − − − −

15

SMP and cluster validation on larger applications Better CPU models (if needed), especially in the SMP coherency area Distributed simulation sometimes “unstable” for large clusters (> 50 nodes) “Canned recipes” for non-expert users for accuracy/speed requirements

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Success stories •

Fault isolation for commodity architectures study − “Configurable isolation: building high-availability systems with commodity multi-core processors” (ISCA’07) − “Isolation in Commodity Multicore Processors” (IEEE MICRO’07)



Nanophotonics architecture investigation − “Corona: System implications of emerging nanophotonic technology” (ISCA’08)



Last level cache technologies study (CACTI-D) − “A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies” (ISCA’08)



Web 2.0 workload analysis − “Microblades and megaservers: system architectures for emerging Web 2.0 / internet workloads” (ISCA’08)

• 16

…and some other internal projects at HP Labs 9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Putting it all together IPC Acc. IPC over time of 800 nodes running NAMD

17

9 November 2008

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

Network traffic

COTSon Labs

COTSon Labs – Experiments Functional simulation Simple timers

1. 2.

− −

Memory tracer Timing feedback Samplers

3. 4. 5.

− − 6. 7. 8.

dump_to in_order

Random sampling Dynamic sampling

Selective tracing Network simulation Disk simulation

Functional simulation (I) cotson-node

Lua file

Lua command

Lua file

cotson-node

Lua command

Lua file

21

7 November 2008

Lua file

Functional simulation (II) • How

to start a (deterministic) simulation

− Send keystrokes to SimNow − xtools  using SimNow hacks − Network access − Pre-started application

22

7 November 2008

Simple timer: “dump_to” •

Use COTSon SDK to create your own timing or sampling module



Experiment: − − −

Instructions from SimNow are disassembled and dumped to a file No time feedback Output fields (disasm) pid

tid

cr3

PC

(length)

Opcodes

disasm

[load|store]

virtual @

physical @

(length)

[load|store]

virtual @

physical @

(length)

Simple timer: “in-order” •

3-stage in-order pipeline + cache stalls



Memory hierarchy in Lua CPU 0 I$

CPU 1

D$

I$

L2$

D$ L2$ MOESI BUS

Memory

Memory tracer •

Transparent memory −

Dump to file/display

CPU 0 I$

CPU 1

D$

I$

L2$

D$ L2$ memory tracer

Memory

Timing feedback With timing feedback 2

CPU 1 CPU 2

1.8 1.6 1.4

IPC

1.2 1 0.8 0.6 0.4 0.2 0 0

26

7 November 2008

500

1000

time

1500

2000

Timing feedback Without timing feedback 1

IPC

0.8

0.6

0.4

CPU 1 CPU 2

0.2

0 0

27

7 November 2008

200

400

600

800

1000 time

1200

1400

1600

1800

2000

Random sampling •

Sampling states − Functional: pre-program IPC − Simple Warming: warm caches and branch predictor − Detailed Warming: simple warming + warm reorder buffer − Simulation: sample, full timing

Dynamic sampling (I)

29

7 November 2008

Dynamic sampling (II) 2

full dynamic

1.8 1.6 1.4

IPC

1.2 1 0.8 0.6 0.4 0.2 0 0

30

7 November 2008

500

1000 time

1500

2000

Selective Tracing •



Lets user determine which application(s) or part(s) of an application running inside SimNow is simulated with timing Combined with CR3 tracing, allows the user to skip instructions from OS or other applications

Ex: application instrumentation #include “cotson-tracer.h" int main(void) { COTSON_BEGIN_TRACE (1)

[benchmark code] COTSON_END_TRACE (1) }

− Change in CR3 register = context switch •

Uses SimNow tagging of instructions to communicate data between guest OS and COTSon − Via a reserved CPUID instruction

31

9 November 2008

Ex: OS instrumentation $> $> $> $> $> $> $>

cotson_tracer.sh benchmark1 cotson_tracer.sh … cotson_tracer.sh benchmark2 cotson_tracer.sh

COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial

begin 1 end 1 begin 2 end 2

Network simulation •

4-node cluster, 1 CPU per node − −

NAS benchmarks with mpich2 MPI library Node discovery, MPI boot and five NAS benchmarks (cg, ep, is, lu, mg) with 8 threads



Simple crossbar switch, 2Gb/s bandwidth



1 Gb/s NICs



Adaptive quantum synchronization 10:1000

Disk simulation •

Disksim integrated into COTSon http://www.pdl.cmu.edu/DiskSim



Experiment − −

No CPU timing  IPC=1 Disk model •

Seagate Cheetah 4LP  4.5 GB 10,033 rpm

COTSon

Nov 9, 2008 - 2008 Hewlett-Packard Development Company, L.P. ... time and power. Device function and. Software. Functional. Simulator. Timing. Simulator .... Give the simulated application an illusion of approximate time. (functional time ...

3MB Sizes 2 Downloads 29 Views

Recommend Documents

No documents