Parallel fread() and data.table update Bay Area R User Group Meetup 11 April 2017, Google Matt Dowle H2O.ai Machine Intelligence

Overview data.table::fread() is now parallel and available in dev; please test and report problems. Highlights from recent versions

H2O.ai Machine Intelligence

2

Try dev 1.10.5 install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table") Windows binary on AppVeyor. See Installation.

H2O.ai Machine Intelligence

3

H2O.ai Machine Intelligence

4

Not new. Prior art.

h2o.importFile

3 years

spark-csv

2 years

Python’s Paratext

1 year

H2O.ai Machine Intelligence

5

verbose=TRUE 44GB 872k rows x 12,875 columns

10,000 row sample at 100 jump points m & sd sample => nrow estimate

Reads straight from file to final result via tiny buffers in a single pass. If any out-of-sample type exceptions, just those columns auto reread H2O.ai Machine Intelligence

WIP. Not done yet. Still dev.6

data.table::fwrite so fast that it was deemed on par with binary and included Should now be faster

H2O.ai Machine Intelligence

7

Beware of cache when benchmarking First timing longest (OS reads from device) free -g (OS cached file in RAM) sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches' Aside: HD has cache too (burst vs sustained)

sudo lshw -class disk sudo hdparm -t /dev/sda HD 150MB/s ; SSD 800MB/s ; NVMe 3GB/s H2O.ai Machine Intelligence

8

R CMD INSTALL ~/data.table_1.10.4.tar.gz # CRAN perfbar & htop system.time(fread("~/X3e8_2c.csv",verbose=TRUE))

# 5.6GB file

# 27s first time, 23s second time. # Awful! CPU not IO bound.

R CMD INSTALL ~/data.table_1.10.5.tar.gz # dev system.time(fread("~/X3e8_2c.csv",verbose=TRUE)) # 7s first time (5.6GB file size / 800MB/s SSD speed == 7s) # 3.5s second time free -g system("sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'") free -g system.time(fread("~/X3e8_2c.csv",verbose=TRUE)) # 7s first time, 3.5s second time

H2O.ai Machine Intelligence

9

Arun’s update, Jan 2017 Amsterdam

H2O.ai Machine Intelligence

10

Highlights Already on CRAN : No longer need with=FALSE setkey() partially parallel keyby= much faster than by=

H2O.ai Machine Intelligence

11

Parallel fread() and data.table update - GitHub

Apr 11, 2017 - 44GB 872k rows x 12,875 columns verbose=TRUE. 10,000 row sample at 100 jump points m & sd sample => nrow estimate ... First timing longest (OS reads from device) free -g (OS cached file in RAM) sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'. Aside: HD has cache too (burst vs sustained) sudo lshw ...

3MB Sizes 159 Downloads 193 Views

Recommend Documents

Heterogeneous Parallel Programming - GitHub
The course covers data parallel execution models, memory ... PLEASE NOTE: THE ONLINE COURSERA OFFERING OF THIS CLASS DOES NOT ... DOES NOT CONFER AN ILLINOIS DEGREE; AND IT DOES NOT VERIFY THE IDENTITY OF ...

Parallel Plate Capacitor and Battery - GitHub
For a capacitor the energy can be written as: Or also as: 3). The battery is now disconnected from the plates and the separation of the plates is doubled ( = 0.76.

Cluster-parallel learning with VW - GitHub
´runvw.sh ´ -reducer NONE. Each mapper runs VW. Model stored in /model on HDFS runvw.sh calls VW, used to modify VW ...

2015 FREAD Newsletter.pdf
Sungnam & Pyungtak Multicultural Center in Korea, Tanzania Orphanage, Uganda. Orphanage, the Pope Complex in Philippines, and South Sudan Refugee ...

May 26 Status Update Widgets - GitHub
MissionPlanner.app ... but we estimate 6pm. ... for the unpaid hours of washing machine labor at a cost that seems unreasonable for a .... st ...

CPIC Guideline Update on PharmGKB - GitHub
5Departments of Pharmaceutics and Pharmacy, School of Pharmacy, University of .... reviewed and updated guidelines will be published online. ..... (http://aidsinfo.nih.gov/contentfiles/AdultandAdolescentGL.pdf): 'strong', where “the evidence.

PARALLEL INTERPOLATION, SPLITTING, AND ...
author was at MIMS, School of Mathematics, the University of Manchester in February–March 2006 ..... (Stanford, California) (Lawrence Moss, Jonathan Ginzburg, and Maarten de Rijke, editors), vol ... DEPARTMENT OF COMPUTER SCIENCE.

PARALLEL AND DISTRIBUTED TRAINING OF ...
architectures from data distributed over a network, is still a challenging open problem. In fact, we are aware of only a few very recent works dealing with distributed algorithms for nonconvex optimization, see, e.g., [9, 10]. For this rea- son, up t

Parallel and Perpendicular Lines
November 01, 2012. Parallel and Perpendicular Lines. Graph the Lines. Slope. Rise. Run y intercept. Graph the Lines. Slope. Rise. Run y intercept. Conclusion: ...

Clojure and Android - GitHub
Improving the Clojure/Android experience. Closing thoughts. Page 5. Clojure and. Android. Daniel Solano. Gómez. Android and the. Dalvik VM ... Page 10 ...

Categories and Haskell - GitHub
This is often summarized as a side-effect free function. More generally ... The composition g ◦ f is only defined on arrows f and g if the domain of g is equal to the codomain of f. ...... http://files.meetup.com/3866232/foldListProduct.pdf ... Pag

HOW TO DRAW PARALLEL AND PERPENDICULAR LINES.pdf
Hold the big set square steadily because you are about to start. Page 4 of 13. UNIT 6.2 - HOW TO DRAW PARALLEL AND PERPENDICULAR LINES.pdf.

Evolution: Convergent and Parallel Evolution
by various authors to create strict definitions for terms that describe the ... duplication), differences steadily accumulate in the sepa- rate lineages (here, the ...

Explore and Challenge - GitHub
Select the Variables tab and add a new variable by pressing the "Make a variable" button, call it Score and set it to be For all sprites. We will also need to create a list to hold our sequence of lights, we will call it GameList: Press the "Make a l

stack and heap - GitHub
class Tetromino : public cocos2d::Node. { public: static Tetromino* createWithType(TetrominoType type); void rotate(bool right); int getHeightInBlocks() const;.

Explore and Challenge - GitHub
Explore and Challenge Scratch GPIO: Pi-Stop Traffic Sequence - Create your own ... Once you have started the Raspberry Pi desktop, open Scratch using the ...

Hardware and Representation - GitHub
E.g. CPU can access rows in one module, hard disk / another CPU access row in ... (b) Data Bus: bidirectional, sends a word from CPU to main memory or.

Explore and Challenge - GitHub
WORKSHEET: Tick the checkbox marked "I've created the Pi-Stop STOP and GO sequences". The Final Program - Changing Lights. At the moment our program ...