NVM Heaps for Accelerating Browser-based Applications Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan, Center for Experimental Research in Computer Systems, Georgia Tech Sanjay Kumar, Intel Labs

Motivation ●

Browsers have become an indispensable computing platform



Used in devices ranging from mobile, to tablets, to PCs







Rich browser-based client apps. with growing processing capabilities ● e.g., Google native client (NaCl), Intel parallel JavaScript Increasing web application support to access system resources ● WebGL support to access GPUs ● HTML5 I/O, Native Client pepper I/O for storage access Increasing local data access/storage needs ● Simple key value store, ● JavaScript (JS) based SQLite interface, ● Synchronous and asynchronous POSIX I/O for large blobs

Motivation ●







Recent studies on poor end-client storage performance blame ● Flash storage performance variation across devices ● Poor random write performance of Flash (Kim et al. FAST' 12) ● Replacing Flash with 100x faster NVM (PCM) should speedup apps? Question: Problem solved by replacing Flash with NVM? Answer: No! Reason: Multiple levels of indirection impact storage performance ● Specifically: sandboxing overheads for browser I/O Contributions: ● Use NVM as persistent heap to reduce sandboxing cost in browser ●

Develop appropriate OS and User level library data structures

Sandboxing ●





Isolates applications from code and data of other applications, needed for untested and untrusted code Different methods of sandboxing ● Rule-based, virtual machine emulation (Android), static profiling Native Client (NaCl) ● Sandboxing technology ●



Allows running native code from a web browser

Sandboxing methods in Native Client (NaCl) ●



Inner sandbox - binary validation with static analysis - restricts unsafe instructions Outer sandbox - system calls intercepted by trusted region

Sandboxing in NaCl – Multiple Levels of Indirections Untrusted Components HTML & JavaScript WriteBuff(bytes)

SRPC/IMC

NaCl (.nexe) app

Utility libs

fwrite(fd, bytes)

stack/context switch

write(secure_fd, bytes)

OS

Trusted service runtime

User-Kernel switch Expensive system call – Stack switch + User-Kernel switch time

=> Frequent system resource access will affect performance

NaCl Sandboxing I/O impact

B ro w s e r I/O v s . N a tiv e I/O 1 40 00

T im e (m ic ro s e c )

1 20 00 1 00 00 8000

N a tiv e B ro w se r W rite c h u n k s iz e -5 1 2 b y te s

6000 4000 2000 0

B y te s w ritte n

Proposed Solution ●

Key goal is to reduce multiple levels of indirection



Expose NVM as persistent heap rather than block storage





Applications access heap with byte addressable interface avoiding frequent user-kernel and stack switching Rely on NVM hardware page protection by enforcing what untrusted browser applications can access

NVM as a heap – Reducing Multiple levels of Indirections Untrusted Components HTML & JavaScript

SRPC/IMC

WriteBuff(bytes)

NaCl (.nexe) app

Utility libs

nvmalloc(bytes, id)

OS User-Kernel switch

Trusted service runtime

stack/context switch

Programming Model /*NVM persistent allocation*/ Image**imgdb = nvmalloc(“img_root“,size); for each new image: Image *imgdb[cnt]= nvmalloc(size, NULL); cnt++; …… /* persistent read, implicit load of all child ptrs*/ img = nvread (“img_root“,&size);

NVM as a Heap – High Level Design Chrome browser (Native Client) NVM user lib sys_nvmmap() Kernel layer

Mem. Mgr DRAM NVM Shared LLC

DRAM Node

Mem. Bus

NVM Node Persistent Region

Non Persistent Region

Design - OS Support for NVM Heap ●

NVM as a special `node’ in a heterogeneous memory system



Custom Linux-based NVM manager to control page allocation



Maintains per process persistent page tree (metadata)



Page tree loaded during application/restart



Persistent pages accessed when application experiences faults



Exports nvmmap system calls for higher layers



Every nvmmap call results in creation of compartments



Compartments are similar to VMA structure and provide isolation among threads

Design - OS Support for NVM Heap ●

Every compartment contains a RB page tree



Application hints if compartments of threads can be merged



Provides isolation for browser threads (e.g., main browser and ad threads, inspired from Firefox user allocation) Process 1

Compartment 1

Process 2

Process 3

Compartment2

Pages Uses process id, compartment id and fault address to identify the page RB tree 1 bit for each NVM page flag and 1 bit flush flag

Design - User level Support for NVM Heap ●

Transitions between NaCl trusted-untrusted component expensive



Solution: NVM allocator split across two components



Untrusted allocator component:



Provides byte-addressable heap interfaces to applications ● nvmalloc(), nvfree(), nvcommit()



Manages untrusted application's persistent memory state



NVM heap reference obtained from trusted component



Untrusted component restricted from direct OS system calls ● e.g., allocator cannot call nvmmap() directly

Design - User level Support for NVM Heap Accessible NVM Permission address range

Guard

10000 Untrusted Components Browser NaCl app. nvmalloc()

20000 Guard

0x10000 - 0x20000

Read/Write

0x20000 - 0x25000

Read

Untrusted trusted context switch User libraries NVM allocator

nacl_mmap()

sys_mmap Browser thread/app specific compartments (memory VMA’s)

mapped NVM range to app. access table Trusted Component User-kernel switch NVM Kernel Manager

Def. DRAM Manager

Design - User level Support for NVM Heap ●

Trusted allocator component



Provides indirect access to system level NVM interfaces



Maintains per application NVM access region table



Table contains address range with different protection levels



Access tables are persistent and identified using unique keys



Same unique keys supplied by application across restarts



Handles 'out-of-bound' memory access protection faults



After every map/unmap operation, address region in the access region tables are updated

Experimental Goals ●

Is storage device primarily responsible for slow browser I/O ?



Impact of storage interfaces on a sandboxed environment?



Benefits of treating NVM as a non-volatile heap as opposed to a block storage device? Methodology:



Dual-core 1.66 GHz D510 Atom-based development kit



2GB DDR2 DRAM, Intel 520 120GB SSD, 1 MB L2 Cache



Pin-based binary instrumentation for NVM load, store analysis



Hardware counters for NVM access misses (in the paper)



Currently, we use MACSim based simulation

Experimental Workloads 1. WebShootbench - Open source NaCl benchmark from Google ●



Derived from the Computer Language Benchmarks Game For Storage analysis, we use ● Fasta (FS) – generates random DNA sequences ● Revcomp (RC) – reverse-complement of DNA sequence ● kNucleotide (KN) – generates hashtable from DNA sequence ● Spell Check (SC) – wordnet dictionary (16 MB*4 dictionaries)

2. Snappy Compression ● High performance compression/decompression library ●

Preference over speed than compression size



Ported to NaCl in ~2 hours, uses 500 MB of browser cache

Experimental Workloads 3. User Personalization: Email Classifier ●







Bayesian-based email classifier with learning data CMU text learning group dataset for user personalization ● Contains 10 news-group email categories like sports, economics, movies, etc. ● We randomly choose 100 emails as input ● Learning data generated from prior classifications Extracts feature points from new emails,loads training data and compares the input feature points and training data set Evaluation abbreviations: NV – NVM, RD – RamDisk

Benchmark Analysis – Storage Device Impact Benchmark

I/O time (%)

Fasta





41.2

Revcomp

49.33

kNucleotide

12.32

SpellCheck

19.89

Reducing I/O calls in applications can reduce sandboxing overheads substantially Benefits due to fast storage alone is relatively less (RD vs. SSD)

Application – Snappy Access Interface Evaluation

~2.5 x reduction compared to RamDisk

Evaluation – Snappy User-Kernel Transitions

User-Kernel transition for mmap vs. block I/O ●





When using mmap, every file to be compressed, needs to mapped, compressed and unmapped mmap is a system call, every map/munmap call results in user-kernel switch POSIX block I/O's are library calls, not all calls cause user-kernel transition

Evaluation – Snappy Stack Switching Overhead

Why is RD Block slower than RD Block ? ●



RD Block has lesser user-kernel transition but higher stack switching overhead Stack switching is an expensive operation in Sandboxed codes.

P a g e L o a d T im e (m s )

Email Classifier – Impact on Web page Load Time 1 1 1 1

6 4 2 0 8 6 4 2

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

R D

2

N V M

4 8 12 # . o f E m a il C a te g o r ie s

16

Summary ●



In sandboxed environments like end-client browsers, ● Impact of software I/O overheads >> Hardware storage cost Using NVM as a heap shows ● Upto 2.5x improvements in browser storage performance ● Reduces sandboxing impact without compromising security

l



Gains are consistent across most browser workloads

Future Work ●

Studying additional applications ● E.g., Games accessing graphical as well as user data



Using NVM on browser components like database and cache



Addressing sandboxing overheads in Android

Question/ Comments?

Thank You

NVM Heaps for Accelerating Browser-based Applications

Increasing web application support to access system resources. ○ WebGL support ... Develop appropriate OS and User level library data structures. Motivation ...

314KB Sizes 4 Downloads 97 Views

Recommend Documents

NVM Heaps for Accelerating Browser-based Applications
form for client devices, ranging from cell phones, laptops, .... database row insert test using the Android SQLite interface ..... In Proceedings of the 2010 43rd.

A Scalable Messaging System for Accelerating ... - Semantic Scholar
the staging area that comprises of a smaller set of dedicated cores, typically on .... server nodes, it transparently dissembles the data query and disseminates the ...

19106853-Reconfigurable-Computing-Accelerating-Computation ...
Connect more apps... Try one of the apps below to open or edit this item. 19106853-Reconfigurable-Computing-Accelerating-Computation-With-FPGAs.pdf.

Water, sanitation and hygiene for accelerating and ... - ReliefWeb
Water, sanitation and hygiene (WASH) are critical in the prevention and care for 16 of the 17 neglected ... For example, the facial cleanliness and environmental improvement components of the ..... The publication “WASH and the Neglected Tropical D

Water, sanitation and hygiene for accelerating and ... - ReliefWeb
What does Integration mean for WASH and NTDs? The term ... Although many NTDs are not fatal, affected individuals and their families can incur .... Promote application of best practices in support of regional .... intelligence collected and used by l

A Scalable Messaging System for Accelerating ... - Semantic Scholar
using these HEC platforms in the paper. Keywords-associative messaging system, publish/subscribe, in- situ/in-transit analytics, data staging. I. INTRODUCTION.

Accelerating Hessian-Free Optimization for Deep ... - Semantic Scholar
study aims at speeding up Hessian-free training, both by means of decreasing the amount of .... In the HF algorithm, the CG search is truncated, based upon the rel- ... for this domain problem HF algorithms manage to capture salient features of ...

Accelerating Virtual Machine Storage I/O for Multicore ...
the I/O request, a completion notification is delivered to the guest OS by ... due to cache pollution results from executing guest OS and VMM on a single CPU.

Accelerating Hessian-Free Optimization for Deep ... - Semantic Scholar
study aims at speeding up Hessian-free training, both by means of decreasing the amount ... type of preconditioner that works best is problem specific. Second, .... for this domain problem HF algorithms manage to capture salient features of the ...

Accelerating Light Beams along Arbitrary Convex Trajectories
May 25, 2011 - invariant (non-diffracting) yields the Airy beam solution, which carries ..... at z ј 0, coincides with the phase of the analytic expansion of the Ai ...

Accelerating Multimodal Sequence Retrieval with ...
In this paper, we will show that this framework is .... This allows us to obtain binary hash vectors by testing whether each output dimension ... ing Whetlab, which was a web API implementing the techniques described in [19]. .... In Proceedings of t

Accelerating MATLAB Image Processing Toolbox ...
Mar 14, 2010 - works on using GPUs to accelerate programs in MATLAB [21] .... Register File. A high number of registers (1k float4 registers or. 16kB) per core implies more computational work in each core. A relatively small number of registers (2k f

Call for Applications - WACCI
MPhil Seed Science and Technology ... to train Plant Breeders for the West and Central African sub-region (for more information visit: http://www.wacci.edu.gh ).

Call for Applications - WACCI
University of Ghana. Call for Applications. MPhil Seed Science and Technology. The West Africa Centre for Crop Improvement (WACCI) was established in June ...

Accelerating SSL with GPUs
eavesdropping, and enables authentication of end hosts. Nowadays,. SSL plays an essential role in online-banking, e-commerce, and other Internet services to ...

Accelerating defocus blur magnification - Research at Google
µk ∈ R3 is the mean and Σk ∈ R3×3 is the covariance matrix of the pixels ... Since the matting Laplacian is symmetric and positive definite (we just need to ...

ACCELERATING COMPUTER VISION ALGORITHMS ...
by a camera phone before sharing it on the internet [3, 4]. However, long processing time ..... the proposed implementation, we developed an interactive OpenCL Android demo on the test .... Processing (ICASSP), March 2010, pp. 2494 –2497.