NVM Heaps for Accelerating Browser-based Applications Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan, Center for Experimental Research in Computer Systems, Georgia Tech Sanjay Kumar, Intel Labs
Motivation ●
Browsers have become an indispensable computing platform
●
Used in devices ranging from mobile, to tablets, to PCs
●
●
●
Rich browser-based client apps. with growing processing capabilities ● e.g., Google native client (NaCl), Intel parallel JavaScript Increasing web application support to access system resources ● WebGL support to access GPUs ● HTML5 I/O, Native Client pepper I/O for storage access Increasing local data access/storage needs ● Simple key value store, ● JavaScript (JS) based SQLite interface, ● Synchronous and asynchronous POSIX I/O for large blobs
Motivation ●
●
●
●
Recent studies on poor end-client storage performance blame ● Flash storage performance variation across devices ● Poor random write performance of Flash (Kim et al. FAST' 12) ● Replacing Flash with 100x faster NVM (PCM) should speedup apps? Question: Problem solved by replacing Flash with NVM? Answer: No! Reason: Multiple levels of indirection impact storage performance ● Specifically: sandboxing overheads for browser I/O Contributions: ● Use NVM as persistent heap to reduce sandboxing cost in browser ●
Develop appropriate OS and User level library data structures
Sandboxing ●
●
●
Isolates applications from code and data of other applications, needed for untested and untrusted code Different methods of sandboxing ● Rule-based, virtual machine emulation (Android), static profiling Native Client (NaCl) ● Sandboxing technology ●
●
Allows running native code from a web browser
Sandboxing methods in Native Client (NaCl) ●
●
Inner sandbox - binary validation with static analysis - restricts unsafe instructions Outer sandbox - system calls intercepted by trusted region
Sandboxing in NaCl – Multiple Levels of Indirections Untrusted Components HTML & JavaScript WriteBuff(bytes)
SRPC/IMC
NaCl (.nexe) app
Utility libs
fwrite(fd, bytes)
stack/context switch
write(secure_fd, bytes)
OS
Trusted service runtime
User-Kernel switch Expensive system call – Stack switch + User-Kernel switch time
=> Frequent system resource access will affect performance
NaCl Sandboxing I/O impact
B ro w s e r I/O v s . N a tiv e I/O 1 40 00
T im e (m ic ro s e c )
1 20 00 1 00 00 8000
N a tiv e B ro w se r W rite c h u n k s iz e -5 1 2 b y te s
6000 4000 2000 0
B y te s w ritte n
Proposed Solution ●
Key goal is to reduce multiple levels of indirection
●
Expose NVM as persistent heap rather than block storage
●
●
Applications access heap with byte addressable interface avoiding frequent user-kernel and stack switching Rely on NVM hardware page protection by enforcing what untrusted browser applications can access
NVM as a heap – Reducing Multiple levels of Indirections Untrusted Components HTML & JavaScript
SRPC/IMC
WriteBuff(bytes)
NaCl (.nexe) app
Utility libs
nvmalloc(bytes, id)
OS User-Kernel switch
Trusted service runtime
stack/context switch
Programming Model /*NVM persistent allocation*/ Image**imgdb = nvmalloc(“img_root“,size); for each new image: Image *imgdb[cnt]= nvmalloc(size, NULL); cnt++; …… /* persistent read, implicit load of all child ptrs*/ img = nvread (“img_root“,&size);
NVM as a Heap – High Level Design Chrome browser (Native Client) NVM user lib sys_nvmmap() Kernel layer
Mem. Mgr DRAM NVM Shared LLC
DRAM Node
Mem. Bus
NVM Node Persistent Region
Non Persistent Region
Design - OS Support for NVM Heap ●
NVM as a special `node’ in a heterogeneous memory system
●
Custom Linux-based NVM manager to control page allocation
●
Maintains per process persistent page tree (metadata)
●
Page tree loaded during application/restart
●
Persistent pages accessed when application experiences faults
●
Exports nvmmap system calls for higher layers
●
Every nvmmap call results in creation of compartments
●
Compartments are similar to VMA structure and provide isolation among threads
Design - OS Support for NVM Heap ●
Every compartment contains a RB page tree
●
Application hints if compartments of threads can be merged
●
Provides isolation for browser threads (e.g., main browser and ad threads, inspired from Firefox user allocation) Process 1
Compartment 1
Process 2
Process 3
Compartment2
Pages Uses process id, compartment id and fault address to identify the page RB tree 1 bit for each NVM page flag and 1 bit flush flag
Design - User level Support for NVM Heap ●
Transitions between NaCl trusted-untrusted component expensive
●
Solution: NVM allocator split across two components
●
Untrusted allocator component:
●
Provides byte-addressable heap interfaces to applications ● nvmalloc(), nvfree(), nvcommit()
●
Manages untrusted application's persistent memory state
●
NVM heap reference obtained from trusted component
●
Untrusted component restricted from direct OS system calls ● e.g., allocator cannot call nvmmap() directly
Design - User level Support for NVM Heap Accessible NVM Permission address range
Guard
10000 Untrusted Components Browser NaCl app. nvmalloc()
20000 Guard
0x10000 - 0x20000
Read/Write
0x20000 - 0x25000
Read
Untrusted trusted context switch User libraries NVM allocator
nacl_mmap()
sys_mmap Browser thread/app specific compartments (memory VMA’s)
mapped NVM range to app. access table Trusted Component User-kernel switch NVM Kernel Manager
Def. DRAM Manager
Design - User level Support for NVM Heap ●
Trusted allocator component
●
Provides indirect access to system level NVM interfaces
●
Maintains per application NVM access region table
●
Table contains address range with different protection levels
●
Access tables are persistent and identified using unique keys
●
Same unique keys supplied by application across restarts
●
Handles 'out-of-bound' memory access protection faults
●
After every map/unmap operation, address region in the access region tables are updated
Experimental Goals ●
Is storage device primarily responsible for slow browser I/O ?
●
Impact of storage interfaces on a sandboxed environment?
●
Benefits of treating NVM as a non-volatile heap as opposed to a block storage device? Methodology:
●
Dual-core 1.66 GHz D510 Atom-based development kit
●
2GB DDR2 DRAM, Intel 520 120GB SSD, 1 MB L2 Cache
●
Pin-based binary instrumentation for NVM load, store analysis
●
Hardware counters for NVM access misses (in the paper)
●
Currently, we use MACSim based simulation
Experimental Workloads 1. WebShootbench - Open source NaCl benchmark from Google ●
●
Derived from the Computer Language Benchmarks Game For Storage analysis, we use ● Fasta (FS) – generates random DNA sequences ● Revcomp (RC) – reverse-complement of DNA sequence ● kNucleotide (KN) – generates hashtable from DNA sequence ● Spell Check (SC) – wordnet dictionary (16 MB*4 dictionaries)
2. Snappy Compression ● High performance compression/decompression library ●
Preference over speed than compression size
●
Ported to NaCl in ~2 hours, uses 500 MB of browser cache
Experimental Workloads 3. User Personalization: Email Classifier ●
●
●
●
Bayesian-based email classifier with learning data CMU text learning group dataset for user personalization ● Contains 10 news-group email categories like sports, economics, movies, etc. ● We randomly choose 100 emails as input ● Learning data generated from prior classifications Extracts feature points from new emails,loads training data and compares the input feature points and training data set Evaluation abbreviations: NV – NVM, RD – RamDisk
Benchmark Analysis – Storage Device Impact Benchmark
I/O time (%)
Fasta
●
●
41.2
Revcomp
49.33
kNucleotide
12.32
SpellCheck
19.89
Reducing I/O calls in applications can reduce sandboxing overheads substantially Benefits due to fast storage alone is relatively less (RD vs. SSD)
Application – Snappy Access Interface Evaluation
~2.5 x reduction compared to RamDisk
Evaluation – Snappy User-Kernel Transitions
User-Kernel transition for mmap vs. block I/O ●
●
●
When using mmap, every file to be compressed, needs to mapped, compressed and unmapped mmap is a system call, every map/munmap call results in user-kernel switch POSIX block I/O's are library calls, not all calls cause user-kernel transition
Evaluation – Snappy Stack Switching Overhead
Why is RD Block slower than RD Block ? ●
●
RD Block has lesser user-kernel transition but higher stack switching overhead Stack switching is an expensive operation in Sandboxed codes.
P a g e L o a d T im e (m s )
Email Classifier – Impact on Web page Load Time 1 1 1 1
6 4 2 0 8 6 4 2
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
R D
2
N V M
4 8 12 # . o f E m a il C a te g o r ie s
16
Summary ●
●
In sandboxed environments like end-client browsers, ● Impact of software I/O overheads >> Hardware storage cost Using NVM as a heap shows ● Upto 2.5x improvements in browser storage performance ● Reduces sandboxing impact without compromising security
l
●
Gains are consistent across most browser workloads
Future Work ●
Studying additional applications ● E.g., Games accessing graphical as well as user data
●
Using NVM on browser components like database and cache
●
Addressing sandboxing overheads in Android
Question/ Comments?
Thank You