COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs ― Exascale Computing Lab MICRO-41 tutorial November 9, 2008
http://sites.google.com/site/hplabscotson © 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Lake Como, Italy
Core Concepts •
Functional Simulator (SimNow) − Sequences the behavioral simulation of CPUs and devices
•
Timers − Using functional events, it computer the target metrics (time, power)
•
Sampler − Decide when to turn on or off the Timers and for how long
•
Interleaver − Decides how to buffer and reorder functional events (SMP)
•
Time Predictor − Based on timer metrics evolution over time, decides how to feed the information back to the functional simulator
2
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Decoupling Simulation •
Functional Simulation (fast) − Emulates the behavior of all the components of our system • Disks, video, network cards, etc.
− Necessary to verify correctness, run software •
Timing Simulation (slow) − Models the timing of all the components − Used to measure performance (or power)
•
COTSon approach: “Functional Directed with sampling and time feedback” Device function and Software
3
9 November 2008
Events (instructions, …)
Functional Simulator
Time feedback (predicted IPC)
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Timing Simulator
Metrics, time and power
COTSon Components COTSon Node Timing feedback
SimNow (Functional)
CPU and Memory Timer Core 0
Northbridge
Sampling
Memory
Asynchronous Events Interleaver ... 4 3 2 1 0 ... 2 1 0
Core 1
Southbridge
C0
D$ I$
C1
D$ I$
L2$
Memory
Bus
Timing feedback NIC
HD 1
HD 0
NIC Timer
Disk Timer
Disk Timer
COTSon Node 0
Sampling
1
4
9 November 2008
0
Network Mediator Network Switch (Functional)
Network Timer
COTSon Node
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
0
Sampling
0
1
Timers • • • • •
(a.k.a. CPU/device models)
Accept instructions, process them and update metrics All timers share the memory hierarchy Some “must have” metrics: cycles and instructions Pluggable architecture Not only CPU models, but also: − Profiling − Trace generation − “Simpoint”-like analysis
•
Current models − − − −
5
Timer0: simple “linear” model + cache hierarchy Timer1: Timer0 + in-order pipeline Bandwidth: Only limited by memory bandwidth PTLSim (open source): linked to COTSon, full x86 OoO superscalar
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Samplers •
Decide when and how much to simulate and when to move from one simulation state to another − Functional: fast forward to the next state as quickly as possible − Warming (simple/detailed): get data in stateful structures (e.g., caches), but do not account for time − Simulation: account for time
• •
Pluggable architecture Many implementations − Smarts[1], SimPoint[2], Dynamic Sampling[3], Random, Interval-based, … [1] Wunderlich et al. SMARTS: Accelerating Microarchitecture Simulation Via Rigorous Statistical Sampling, ISCA'03 [2] B. Calder. Simpoint (www.cse.ucsd.edu/~calder/simpoint) [3] A. Falcón et al. Combining Simulation and Virtualization through Dynamic Sampling, ISPASS'07
•
Samplers are what provide the major acceleration component − Even for very accurate (hence slow) timing models, a good sampler only needs to invoke the timing model < 1% of the time.
6
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Single CPU simulation •
Fast and accurate single node simulation using Dynamic Sampling − Detect dynamically program phase changes − The challenge is to avoid disturbing the VM execution in the code cache during fast functional emulation − Phase changes are correlated with VM statistics (exceptions, I/O events, code cache invalidations, …) which are easy to get and don’t impact performance
IPC Exceptions
1 3
4
5
2
Instructions 7
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
6
Dynamic Sampling A. Falcón, P. Faraboschi, and D. Ortega, “Combining Simulation and Virtualization through Dynamic Sampling”, in Proceedings of ISPASS’07
• Allows
users to favor accuracy or speed, depending on their requirements − High accuracy: 0.4% accuracy error with 8.5x speedup − High speed: 309x speedup with 1.9% error
• Fully
dynamic
− Does not require any a priori analysis − Automatically detect code phases • Allows
for providing timing feedback to the functional simulator
8
9 November 2008
COTSon: Infrastructure for system-level simulation — MICRO-41
Multi-core simulation •
SimNow performs functional simulation of multi-cores − It simulates MP as “sequential interleaved” at coarse granularity − This misses fine grain memory interactions
•
COTSon buffers events and delivers them interleaved to the CPU timing models
•
Problem: Hard to scale up OS? BIOS? Core 2 Core 3
Interleaver
SimNow
1 2 3 4
Core 4
Model CPU 1 Model CPU 2 Model CPU 3 Model CPU 4
Simulator Front-End 9
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Interconnect/Memory Model
Core 1
Simulator Back-End
Interleaving Fundamentals 0
MP functional simulation runs “sequentially interleaved” at coarse granularity.
1
This may miss fine-grain memory interactions
2
We buffer events at every MP quantum and deliver them interleaved to the timers
1 2
Buffer and coalesce
0
1
2 Interleave
0
1
2
0 1
MP quantum
… 10
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 1
Interleaved based on the CPUs IPC
To timing model
Timing Feedback •
Problem: feed back timing information to the functional emulator − Give the simulated application an illusion of approximate time (functional time corresponding to simulated time)
•
Define the IPC of a quantum based on previous history − “Classic” time-series prediction problem, with unknown model
•
Current model: simple predictor − The IPC is fed back to the functional simulator − The application being simulated acts as if execution is faster or slower Previous y observed and predicted CPIs Emulate (functional) CPI=1.0
11
October 2008
Simulate (timing)
GT Talk
Current CPI=2.0
Predict CPI
Emulate (functional) CPI=1.8
Many-core simulation M. Monchiero, J.-H. Ahn, A. Falcón, D. Ortega, and P. Faraboschi, “How to simulate 1000 cores”, dasCMP’08
•
Translate SW thread-level into simulated core-level parallelism − Identify and separate the instruction streams of the different threads at the OS level (context switches) − Dynamically map each instruction flow to the corresponding core of the target multicore architecture, taking into account application-level thread synchronization Model CPU n
···
SimNow (1 core) Thread ID (from guest OS)
Thread 1
Thread 2
Thread 3
Model CPU 3 Model CPU 2 Model CPU 1
Simulator Front-End 12
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Interconnect/Memory Model
OS context switches
Simulator Back-End
Multi-node simulation •
Simulate a computer cluster as a cluster of full-system simulators − Each node of the cluster is simulated with a full-system simulator − Network simulator used to simulate network topology
•
Problems: − Time skew between nodes needs to be controlled with quanta − Quantum size must be chosen carefully • Small quanta Bad simulation speed • Large quanta Bad simulation accuracy
13
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Adaptive Synchronization A. Falcón, P. Faraboschi, and D. Ortega, “An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters ”, in Procs. of ISPASS’08
•
Basic idea: dynamically adjust the quantum for maximum speed at a controlled accuracy loss − Quantum increases/decreases depending on packet traffic − Slow Acceleration, fast deceleration (“driving over speed bumps”) 45
Packets
40
Quantum
800 700
30
600
25
500
20
400
15
300
10
200
5
100
0
0
Time 14
900
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Quantum
Packets
35
1000
Speed vs. Accuracy Tradeoffs •
We can play the speed vs. accuracy game at several control points − Within a node: dynamic sampling sensitivity − At cluster level: adaptive quantum range
•
By choosing the appropriate values we can reach − Single node accuracy in the order of 11%–15% error (simple CPU model) − Networking accuracy (microbenchmark) up to 15 Gb/s − All of the above with self-relative slowdown (vs. native) of ~15x-30x
•
Improvement Areas − − − −
15
SMP and cluster validation on larger applications Better CPU models (if needed), especially in the SMP coherency area Distributed simulation sometimes “unstable” for large clusters (> 50 nodes) “Canned recipes” for non-expert users for accuracy/speed requirements
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Success stories •
Fault isolation for commodity architectures study − “Configurable isolation: building high-availability systems with commodity multi-core processors” (ISCA’07) − “Isolation in Commodity Multicore Processors” (IEEE MICRO’07)
•
Nanophotonics architecture investigation − “Corona: System implications of emerging nanophotonic technology” (ISCA’08)
•
Last level cache technologies study (CACTI-D) − “A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies” (ISCA’08)
•
Web 2.0 workload analysis − “Microblades and megaservers: system architectures for emerging Web 2.0 / internet workloads” (ISCA’08)
• 16
…and some other internal projects at HP Labs 9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Putting it all together IPC Acc. IPC over time of 800 nodes running NAMD
17
9 November 2008
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
Network traffic
COTSon Labs
COTSon Labs – Experiments Functional simulation Simple timers
1. 2.
− −
Memory tracer Timing feedback Samplers
3. 4. 5.
− − 6. 7. 8.
dump_to in_order
Random sampling Dynamic sampling
Selective tracing Network simulation Disk simulation
Functional simulation (I) cotson-node
Lua file
Lua command
Lua file
cotson-node
Lua command
Lua file
21
7 November 2008
Lua file
Functional simulation (II) • How
to start a (deterministic) simulation
− Send keystrokes to SimNow − xtools using SimNow hacks − Network access − Pre-started application
22
7 November 2008
Simple timer: “dump_to” •
Use COTSon SDK to create your own timing or sampling module
•
Experiment: − − −
Instructions from SimNow are disassembled and dumped to a file No time feedback Output fields (disasm) pid
tid
cr3
PC
(length)
Opcodes
disasm
[load|store]
virtual @
physical @
(length)
[load|store]
virtual @
physical @
(length)
Simple timer: “in-order” •
3-stage in-order pipeline + cache stalls
•
Memory hierarchy in Lua CPU 0 I$
CPU 1
D$
I$
L2$
D$ L2$ MOESI BUS
Memory
Memory tracer •
Transparent memory −
Dump to file/display
CPU 0 I$
CPU 1
D$
I$
L2$
D$ L2$ memory tracer
Memory
Timing feedback With timing feedback 2
CPU 1 CPU 2
1.8 1.6 1.4
IPC
1.2 1 0.8 0.6 0.4 0.2 0 0
26
7 November 2008
500
1000
time
1500
2000
Timing feedback Without timing feedback 1
IPC
0.8
0.6
0.4
CPU 1 CPU 2
0.2
0 0
27
7 November 2008
200
400
600
800
1000 time
1200
1400
1600
1800
2000
Random sampling •
Sampling states − Functional: pre-program IPC − Simple Warming: warm caches and branch predictor − Detailed Warming: simple warming + warm reorder buffer − Simulation: sample, full timing
Dynamic sampling (I)
29
7 November 2008
Dynamic sampling (II) 2
full dynamic
1.8 1.6 1.4
IPC
1.2 1 0.8 0.6 0.4 0.2 0 0
30
7 November 2008
500
1000 time
1500
2000
Selective Tracing •
•
Lets user determine which application(s) or part(s) of an application running inside SimNow is simulated with timing Combined with CR3 tracing, allows the user to skip instructions from OS or other applications
Ex: application instrumentation #include “cotson-tracer.h" int main(void) { COTSON_BEGIN_TRACE (1)
[benchmark code] COTSON_END_TRACE (1) }
− Change in CR3 register = context switch •
Uses SimNow tagging of instructions to communicate data between guest OS and COTSon − Via a reserved CPUID instruction
31
9 November 2008
Ex: OS instrumentation $> $> $> $> $> $> $>
cotson_tracer.sh benchmark1 cotson_tracer.sh … cotson_tracer.sh benchmark2 cotson_tracer.sh
COTSon: Infrastructure for system-level simulation -- MICRO-41 tutorial
begin 1 end 1 begin 2 end 2
Network simulation •
4-node cluster, 1 CPU per node − −
NAS benchmarks with mpich2 MPI library Node discovery, MPI boot and five NAS benchmarks (cg, ep, is, lu, mg) with 8 threads
•
Simple crossbar switch, 2Gb/s bandwidth
•
1 Gb/s NICs
•
Adaptive quantum synchronization 10:1000
Disk simulation •
Disksim integrated into COTSon http://www.pdl.cmu.edu/DiskSim
•
Experiment − −
No CPU timing IPC=1 Disk model •
Seagate Cheetah 4LP 4.5 GB 10,033 rpm