Beyond Hive – Pig and Python

What is Pig?

• Pig performs a series of transformations to data relations based on Pig Latin statements • Relations are loaded using schema on read semantics to project table structure at runtime

• You can run Pig Latin statements interactively in the Grunt shell, or save a script file and run them as a batch

• A relation is an outer bag – A bag is a collection of tuples – A tuple is an ordered set of fields – A field is a data item

• A field can contain an inner bag • A bag can contain tuples with nonmatching schema

{ (a, 1) (b, 2) (c, 3) } (d, {(4, 5), (6,7)}) } (e) (f, 8, 9) }

What kinds of things can I do with Pig?

2013-06-01,12 2013-06-01,14 2013-06-01,16 2013-06-02,9 2013-06-02,12 2013-06-02,9 ... -- Load comma-delimited source data Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date:chararray, temp:long); -- Group the tuples by date GroupedReadings = GROUP Readings BY date; -- Get the average temp value for each date grouping GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; -- Ungroup the dates with the average temp AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; -- Sort the results by date SortedResults = ORDER AvgWeather BY date ASC; -- Save the results in the /weather/summary folder STORE SortedResults INTO '/weather/summary';

2013-06-01 2013-06-02

14.00 10.00

Common Pig Latin Operations • LOAD

• GROUP

• FILTER

• FLATTEN

• FOR EACH … GENERATE

• LIMIT

• ORDER

• DUMP

• JOIN

• STORE

• Pig generates Map and Reduce operations from Pig Latin • Jobs are generated on: – DUMP – STORE Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date, temp:long); GroupedReadings = GROUP Readings BY date; GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; SortedResults = ORDER AvgWeather BY date ASC; STORE SortedResults INTO '/weather/summary';

Job generated here

How do I run a Pig script?

1. Save a Pig Latin script file 2. Run the script using Pig

pig wasb:///scripts/myscript.pig

3. Consume the results using any Azure storage client – For example, Excel or Power BI – Default output does not include schema – just data

What are UDFs?

• User-Defined Functions (UDFs) extend the capabilities of Hive and Pig • Simpler than writing custom MapReduce components

• Can be implemented using many languages, for example: – Java – C# – Python

SELECT… UDF

LOAD… UDF

Python is a (relatively) simple scripting language – ideal for UDFs – Intuitive syntax – Dynamic typing – Interpreted execution

x = 1 while x < 11: print (x) x = x + 1

Python is pre-installed on HDInsight clusters – Python 2.7 supports streaming from Hive – Jython (a Java implementation of Python) has native support in Pig

How do I use a Python UDF in Pig?

Pig natively supports Jython – Define the output schema as a Pig bag – Declare a Python function that receives an input parameter from Pig – Return results as fields based on the output schema @outputSchema("result: {(a:chararray, b:int)}") Def myfunction(i): ... return a, b

Use the Pig FOREACH…GENERATE statement to invoke a UDF REGISTER 'wasb:///scripts/myscript.py' using jython as myscript; src = LOAD '/data/source' AS (row:chararray); res = FOREACH src GENERATE myscript.myfunction(row);

How do I use a Python UDF in Hive?

Hive exchanges data with Python using a streaming technique – Rows from Hive are passed to Python through STDIN – Processed rows from Python are passed to Hive through STDOUT

line = sys.stdin.readline() ... print processed_row

Use the Hive TRANSFORM statement to invoke a UDF add file wasb:///scripts/myscript.py; SELECT TRANSFORM (col1, col2, col3) USING 'python myscript.py' AS(col1 string, col2 int, col3 string) FROM mytable ORDER BY col1;

©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Beyond Hive – Pig and Python - GitHub

Pig performs a series of transformations to data relations based on Pig Latin statements. • Relations are loaded using schema on read semantics to project table structure at runtime. • You can run Pig Latin statements interactively in the Grunt shell, or save a script file and run them as a batch ...

714KB Sizes 3 Downloads 78 Views

Recommend Documents

Pig - GitHub
Laptop with VMware Player or Oracle VirtualBox installed. ❖ Please copy the VMware image ... Page 10 ... http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf ...

Covers Python 3 and Python 2 - GitHub
Setting a custom figure size. You can make your plot as big or small as you want. Before plotting your data, add the following code. The dpi argument is optional ...

Covers Python 3 and Python 2 - GitHub
You can add as much data as you want when making a ... chart.add('Squares', squares) .... Some built-in styles accept a custom color, then generate a theme.

Processing Big Data with Hive - GitHub
Processing Big Data with Hive ... Defines schema metadata to be projected onto data in a folder when ... STORED AS TEXTFILE LOCATION '/data/table2';.

ES6 and Beyond Cheat Sheet - GitHub
Warning! If array or object, the reference is kept constant. If the constant is a reference to an object, you can still modify the content, but never change the variable ...

Python Cryptography Toolkit - GitHub
Jun 30, 2008 - 1 Introduction. 1.1 Design Goals. The Python cryptography toolkit is intended to provide a reliable and stable base for writing Python programs that require cryptographic functions. ... If you're implementing an important system, don't

Scientific python + IPython intro - GitHub
2. Tutorial course on wavefront propagation simulations, 28/11/2013, XFEL, ... written for Python 2, and it is still the most wide- ... Generate html and pdf reports.

Annotated Algorithms in Python - GitHub
Jun 6, 2017 - 2.1.1 Python versus Java and C++ syntax . . . . . . . . 24. 2.1.2 help, dir ..... 10 years at the School of Computing of DePaul University. The lectures.

Matrices and matrix operations in R and Python - GitHub
To calculate matrix inverses in Python you need to import the numpy.linalg .... it for relatively small subsets of variables (maybe up to 7 or 8 variables at a time).

Optimizations which made Python 3.6 faster than Python 3.5 - GitHub
Benchmarks rewritten using perf: new project performance ... in debug hooks. Only numy misused the API (fixed) ... The Python test suite is now used, rather than ...

Dan Dietz Greenville Django + Python Meetup - GitHub
Awaken your home: Python and the. Internet of Things. PyCon 2016. • Architecture. • Switch programming. • Automation component. Paulus Schoutsen's talk: ...

Hive-3800-pdf__
Page 2 of 7. Read Now and Download Hive Book at Our Online Library. Get Hive PDF Book For FREE From Our Library. DOWNLOAD HIVE BOOK PDF #3800 ...

QuTiP: Quantum Toolbox in Python - GitHub
Good support for object-oriented and modular programming, packaging and reuse of code, ... integration with operating systems and other software packages.

Introduction to Scientific Computing in Python - GitHub
Apr 16, 2016 - 1 Introduction to scientific computing with Python ...... Support for multiple parallel back-end processes, that can run on computing clusters or cloud services .... system, file I/O, string management, network communication, and ...

Pig Dissection.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Pig Dissection.

Pig Pancake puppets.pdf
Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Pig Pancake puppets.pdf. Pig Pancake pupp

the hive and the honey bee pdf
Page 1 of 1. File: The hive and the honey bee pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. the hive and the honey ...

Pig Pancake Penmanship.pdf
Clipart by www.scrappindoodles.com Blog License: 67042. http://fairytalesandfictionby2.blogspot.com/. by Pink Cat ... Page 3 of 3. Pig Pancake Penmanship.pdf.

Pig Pancake Sequencing.pdf
Page 1 of 44. If You Give A Pig A Pancake. Story Sequencing. by Regina Davis. Clipart by www.scrappindoodles.com Blog License: ...

pig 3 coloured.pdf
Page 1 of 6. Page 1 of 6. Page 2 of 6. Page 2 of 6. Page 3 of 6. Page 3 of 6. pig 3 coloured.pdf. pig 3 coloured.pdf. Open. Extract. Open with. Sign In. Main menu.