An Introduction to BigQuery (in less than 10 minutes) brought to you by

The ISB Cancer Genomics Cloud

This is what you should see the first time you go to the BigQuery Web UI at bigquery.cloud.google.com At the top of the left panel are three buttons: • Compose Query • Query History • Job History

Beneath these buttons is your project space. Since it’s your first visit, there are no datasets. Finally you’ll see public datasets that you may have access to. Initially you will see a few datasets that Google has made public. Next, we’ll show you how to make the ISB-CGC datasets appear here for easy access.

In this screen-shot, this particular project has no datasets of its own. Your project might look different. In order to see datasets that are owned by another project but might be publicly-accessible, you need to “add” that project to your BigQuery view. (You’ll only need to do this once – next time you come back to BigQuery they will already be there.) Click the down arrow icon next to your project name, select “Switch to project”, and then “Display project…”

Enter isb-cgc into the pop-up window, and click OK

Now you can see six ISB-CGC datasets (arranged alphabetically):

• • • • • •

ccle_201602_alpha genome_reference platform_reference tcga_201510_alpha tcga_cohorts tcga_seq_metadata

You can expand any of the datasets by clicking on the right arrow icon next to the dataset name, to see a list of tables in that dataset. For example, the ISB-CGC “genome_reference” dataset currently contains the following tables:

• • • • • •

GENCODE_r19 GO_Annotations GO_Ontology Kaviar_160113_Public_hg19 miRBase_v20 miRTarBase

Let’s take a closer look at the GENCODE_r19 table. Select it from the left panel and click.

In the main “workspace” portion of the BigQuery Web UI you will see the “Table Details” for the table you just selected. The table Schema shows the name of each field (column) in the table, the data type (STRING, INTEGER, etc), mode (REQUIRED or NULLABLE), and the field description.

The table Details shows you the table Description and additional information including the table ID (this is how you will refer to it in a SQL query), the table size, number of rows, creation- and lastmodified-times, and data location.

Finally, the Preview allows you see to see and scroll through the table contents without having to explicitly do a query.

This is equivalent to the following SQL query: SELECT * FROM [isb-cgc:genome_reference.GENCODE_r19]

Now let’s try a query. You can click on the “Query Table” button in the main panel or in the “Compose Query” button in the upper left corner.

If you’re following on in your own browser, cut and paste this SQL into the New Query text area: SELECT feature, gene_type, COUNT(*) AS n FROM [isb-cgc:genome_reference.GENCODE_r19] GROUP BY feature, gene_type ORDER BY n DESC

3 2

6

4

1

Before we continue, we’d like to highlight some of the features in the BigQuery Web UI: 1. As you type your query into the Query Editor, the “query validator” is automatically running, and will show you either a green check mark or a red exclamation point. You can click on either of these to see more information about your query. 2. Format Query will “pretty print” your SQL. 3. To go beyond SQL, power users can toggle between the Query Editor and the UDF Editor and write custom user-defined functions in JavaScript.

5

4. The panes are resizable, so if want to be able to see more of a long query you can drag the sash handle down.

5. You can toggle between a Table-view or JSON when viewing results. 6. Once you have the green light from the query validator, click the red Run Query button.

When you click the Run Query button, your query is submitted to a massively parallel engine (and the Run Query button becomes a Cancel Query button.) A timer will indicate how long the query has been running, until it completes (or until it encounters an error that the query validator was not able to catch).

Once the query completes successfully, the results are immediately shown in the lower pane. 55.2 MB of data were processed in 4 seconds, and we can see that the most common type of feature in GENCODE is “exon”, followed by “CDS” etc

Here is another example query, which asks for information about genes on chr17 between positions 7000000 and 8000000. This query processed 176 MB in just 2.2 seconds, returning 89 genes. A word about BigQuery costs. The owner of a table is charged for the cost of the storage, and this GENCODE table costs about 7 cents per year to store. The person who runs a query gets charged the cost of the query. For most queries, this charge is based on how much data is “scanned” to respond to the query. This means only columns that are directly referenced in the query count towards the cost. This particular query, which processed 176 MB of data would cost less than one cent (if you’ve already used up your free $5 worth of queries this month).

BigQuery is a massively parallel engine which distributes your query across hundreds or thousands of “workers” and can scan terabytes of data in seconds. The Explanation feature shows you how your query was broken down into a series of stages, the relative amount of time spent waiting / reading / computing / writing by the “workers”, and the number of input and output rows at each stage. This information can help you optimize your query.

What Next? The ISB-CGC BigQuery datasets include TCGA data from six different platforms, and other genome- and platform-reference tables. We’re continuously adding to these resources and welcome your feedback. You can also easily upload your own data to BigQuery and analyze it side-by-side with the TCGA data.

The ISB-CGC platform includes an interactive Web App, over a Petabyte of TCGA data in Google Genomics and Cloud Storage, and tutorials and code examples on GitHub to get you started. Documentation for the ISB-CGC platform and Google Genomics can be found on readthedocs.

An Introduction to BigQuery - GitHub

The ISB-CGC platform includes an interactive Web App, over a Petabyte of TCGA data in Google Genomics and Cloud Storage, and tutorials and code ...

4MB Sizes 6 Downloads 473 Views

Recommend Documents

Introduction to Algorithms - GitHub
Each cut is free. The management of Serling ..... scalar multiplications to compute the 100 50 matrix product A2A3, plus another. 10 100 50 D 50,000 scalar ..... Optimal substructure varies across problem domains in two ways: 1. how many ...

Introduction to R - GitHub
Nov 30, 2015 - 6 Next steps ... equals, ==, for equality comparison. .... invoked with some number of positional arguments, which are always given, plus some ...

Introduction To DCA - GitHub
Maximum-Entropy Probability Model. Joint & Conditional Entropy. Joint & Conditional Entropy. • Joint Entropy: H(X,Y ). • Conditional Entropy: H(Y |X). H(X,Y ) ...

An introduction to pplex and the Simplex Method - GitHub
Nov 16, 2012 - include: simple command line interface, visualization (two variables), file input in ... program is brought into the following form, called a dictionary in [2]: ζ. = x + ..... [7] http://campuscgi.princeton.edu/~rvdb/JAVA/pivot/simple

OWL 2 Profiles: An Introduction to Lightweight Ontology ... - GitHub
The three ontology language standards are sublanguages of OWL DL that are restricted in ways ... expert knowledge in a formal way, and as a logical language, it can be used to draw conclusions from ..... We call such features syntactic sugar.

Introduction to phylogenetics using - GitHub
Oct 6, 2016 - 2.2 Building trees . ... Limitations: no model comparison (can't test for the 'best' tree, or the 'best' model of evolution); may be .... more efficient data reduction can be achieved using the bit-level coding of polymorphic sites ....

Introduction to Fluid Simulation - GitHub
upon the notes for a Siggraph course on Fluid Simulation[Bridson. 2007]. I also used .... “At each time step all the fluid properties are moved by the flow field u.

122COM: Introduction to C++ - GitHub
All students are expected to learn some C++. .... Going to be learning C++ (approved. ). ..... Computer Science - C++ provides direct memory access, allowing.

Introduction to NumPy arrays - GitHub
www.scipy-lectures.org. Python. Matplotlib. SciKits. Numpy. SciPy. IPython. IP[y]:. Cython. 2015 ..... numbers and determine the fraction of pairs which has ... origin as a function of time. 3. Plot the variance of the trajectories as a function of t

Introduction to NumPy arrays - GitHub
we want our code to run fast. ▷ we want support for linear algebra ... 7. 8 a[0:5] a[5:8]. ▷ if step=1. ▷ slice contains the elements start to stop-1 .... Indexing and slicing in higher dimensions. 0. 8. 16. 24. 32. 1. 9. 17. 25. 33. 2. 10. 18.

Introduction to Framework One - GitHub
Introduction to Framework One [email protected] ... Event Management, Logging, Caching, . ... Extend framework.cfc in your Application.cfc. 3. Done. (or in the ... All controllers are passed the argument rc containing the request.context, and all v

introduction - GitHub
warehouse to assemble himself. Pain-staking and time-consuming... almost like building your own base container images. This piggy purchased high- quality ...

Introduction - GitHub
software to automate routine labor, understand speech or images, make diagnoses ..... Shaded boxes indicate components that are able to learn from data. 10 ...... is now used by many top technology companies including Google, Microsoft,.

Introduction - GitHub
data. There are many ways to learn functions, but one particularly elegant way is ... data helps to guard against over-fitting. .... Gaussian processes for big data.

Introduction - GitHub
For the case that your PDF viewer does not support this, there is a list of all the descriptions on ...... 10. Other Formats. 10.1. AMS-TEX. AMS-TEX2.0. A macro package provided by the American .... A TeX Live port for Android OS. Based on ...

Introduction - GitHub
them each year. In an aggregate travel demand model, this would be represented as 100/365.25 = 0.2737851 trucks per day. In the simulation by contrast, this is represented as ... based on the distance traveled (Table 3.3). 2FAF3 Freight Traffic Analy

An Inside Look at Google BigQuery Cloud Platform
an interactive response time of 10 seconds in most cases. 223,163,387. Here ... Tracking install data for applications in the Android Market. • Crash reporting for ...

Course: Introduction to Intelligent Transportation Systems - GitHub
... Introduction to Intelligent Transportation Systems. University of Tartu, Institute of Computer Science. Project: Automatic Plate Number. Recognition (APNR).

Introduction to REST and RestHUB - GitHub
2. RestHUBанаRESTful API for Oracle DB querying. 2.1. Overview. RestHub was designed .... For example we want to create a simple HTML + Javascript page.

A Beginner's Introduction to CoffeeKup - GitHub
the buffer, then calls the title function which adds it s own HTML to the buffer, and ... Now it is starting to look like real HTML you d find on an ugly web page. 2 ...

Introduction to RestKit Blake Watters - GitHub
Sep 14, 2011 - Multi-part params via RKParams. RKParams* params = [RKParams paramsWithDictionary:paramsDictionary];. NSData* imageData .... This is typically configured as a secondary target on your project. // Dump your seed data out of your backend

Introduction to Scientific Computing in Python - GitHub
Apr 16, 2016 - 1 Introduction to scientific computing with Python ...... Support for multiple parallel back-end processes, that can run on computing clusters or cloud services .... system, file I/O, string management, network communication, and ...

Emscripten: An LLVM-to-JavaScript Compiler - GitHub
May 14, 2013 - Emscripten, or (2) Compile a language's entire runtime into ...... html.) • Poppler and FreeType: Poppler12 is an open source. PDF rendering ...