Reading in data

Programming in R for Data Science Anders Stockmarr, Kasper Kristensen, Anders Nielsen

Data Import

R can import data many ways. Packages exists that handles import from software systems like

I

EXCEL;

I

Plain text files;

I

SAS;

I

SPSS;

I

STATA;

I

etc.

Issues that you must attend to is in most cases similar; Excel may present specific problems. We shall look at import from plain text files.

Package installation

For your specific data type, find the relevant package and install it:

I

Open the R GUI;

I

Click on the ’packages’ tab;

I

Choose the package to install;

I

Load the package into R with the library() function.

The package Hmisc contains functions that handles import from SPSS. Once installed, the package contents can be loaded into R (made available to the R system) with the function call > library(Hmisc)

Reading data from a text file I

Frequently data is collected in white space separated columns, where the first line indicate the variable name: x1 x2 x3 2 0.3 0.01 2 1.0 0.11 3 2.1 0.04 3 2.2 0.02 1 0.1 0.10 1 0.2 0.06

I

The function read.table() is designed to read this format > mydat <- read.table("c:/datadir/filename.dat", header = TRUE)

I

The data frame mydat now contains > mydat x1 x2 x3 1 2 0.3 0.01 2 2 1.0 0.11 3 3 2.1 0.04 4 3 2.2 0.02 5 1 0.1 0.10 6 1 0.2 0.06

The R working directory R has a search path, the R working directory, where it stores its workspace and look for files. You can locate the working directory with the ’get working directory’ command, > getwd() [1] "C:/datadir" The working directory can be changed with the ’set working directory’ command: > setwd("c:/otherdatadir") > getwd() [1] "C:/otherdatadir" For files stored in the working directory or subfolders, you can just specifiy the path from the working directory when reading them. Example: I

If the data is located in the ’Data’ folder in your working directory, write mydat<-read.table("Data/filename.mydat", header=TRUE)

The read.table() function I

The read.table() function has a lot of optional arguments: > args(read.table) function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) NULL

I

Some of the important ones are: I I I I I I I

header: Is the first line variable names or not? sep: What character is used to separate the columns? dec: What character is used as decimal separator? nrows: How many rows do we want to read? na.strings: What string represent a missing value? skip: How many lines to skip before start reading? comment.char: What char in the beginning of a line should indicate that the line should be skipped?

read.table() example 1

Consider the data file

This file has a bit of text and an empty line before the data a b c 1 2 3 4 5 6 and then some more text at the end

> dat<-read.table("Data/testdat1.dat", header=TRUE, skip=5, nrow=2) > dat a b c 1 1 2 3 2 4 5 6

read.table() example 2

Now, look at the data file

A 1 4 1 ; 5

B C 2 3 3,2 2 5 . below this line is the extended data 4 6

> dat<-read.table("Data/testdat2.dat", header=TRUE, na.strings=".", + comment.char=";", dec=",") > dat 1 2 3 4

A 1 4 1 5

B C 2.0 3 3.2 2 5.0 NA 4.0 6

Variants of read.table()

I

Other functions which are useful for reading data frames from files are: I I I

I

read.csv() comma separated, dot as decimal point read.csv2() sep=”;” and dec=”,” read.fwf() fixed width format

Additional arguments are similar to those of read.table()

read.csv() and read.csv2() are adapted to Excel tables saved as csv files. Which one you need to use depends on your system’s regional settings; this machine adheres to Western European locales, and matches read.csv2().

Reading text files from Excel

How to read in a table from Excel in text format:

I

Access the sheet in your Excel file where your table is;

I

Save the active sheet in csv (MS-DOS) format;

I

Read in the table with read.csv2().

Saving in other text formats works as well, just use the appropriate reader function.

Reading from more complicated files

I

scan() can be a little tricky to use, but is very flexible.

I

Its simplest use is: 4.141593 5.141593 6.141593 7.141593 8.141593

> vec<-scan("scantest.txt") > vec [1] 4.141593 5.141593 6.141593 7.141593 8.141593

Reading from more complicated files

I

readLines() Reads entire lines. A B C 1.324654 2.324654 3.324654 4.324654 5.324654 How many roads

> vec<-readLines("readlinestest.txt") > vec [1] "A B C" [2] "1.324654 2.324654 3.324654 4.324654 5.324654" [3] "How many roads" > strsplit(vec[2]," ") [[1]] [1] "1.324654" "2.324654" "3.324654" "4.324654" "5.324654" > as.numeric(strsplit(vec[2]," ")[[1]]) [1] 1.324654 2.324654 3.324654 4.324654 5.324654

File connections

I

File connections can open a file for reading different sections in different ways. Consider: > f1<-file("readlinestest.txt", open="r") > scan(f1,what="",nlines=1) [1] "A" "B" "C" > scan(f1,what=double(),nlines=1) [1] 1.324654 2.324654 3.324654 4.324654 5.324654 > readLines(f1) [1] "How many roads" > close(f1)

Reading in data - GitHub

... handles import from SPSS. Once installed, the package contents can be loaded into R (made available to the R system) with the function call. > library(Hmisc) ...

703KB Sizes 1 Downloads 463 Views

Recommend Documents

Reading from SQL databases - GitHub
Description. odbcDriverConnect() Open a connection to an ODBC database. sqlQuery(). Submit a query to an ODBC database and return the results. sqlTables(). List Tables on an ODBC Connection. sqlFetch(). Read a table from an ODBC database into a data

IDS Data Server in AWS Setup - GitHub
The “Template URL” must match the region you've ... hcp://region.s3.amazonaws.com/ids-‐dataserver-‐template.cf ... When you get an email back, you will.

Funded Research Projects in Data Science - GitHub
logs, social media posts and blogs, satellites ... Due to the proliferation of social media, sensors, and the Internet of. Things .... “troll” on an Internet Web site simply ..... levels of “bad” cholesterol that sometimes kills adults in the

Data reading apparatus
Jan 11, 2011 - Manufacturers of digital check scanners for the ?nancial industry around the .... the check, con?rming the date, and verifying the signature,.

Javascript Data Exploration - GitHub
Apr 20, 2016 - Designers. I'm a sort of. « social data scientist ». Paris. Sciences Po médialab. I just received a CSV. Let me grab my laptop ... Page 9 ...

Tabloid data set - GitHub
The Predictive Analytics team builds a model for the probability the customer responds given ... 3 Summary statistics .... Predictions are stored for later analysis.

RStudio Data Import - GitHub
“A data model in which the data is organized into a tree-like structure” - Wikipedia. Page 10. WHAT IS XML, HTML AND JSON? XML: Extensible Markup ...

Data Science - GitHub
Exploratory Data Analysis ... The Data Science Specialization covers the concepts and tools for ... a degree or official status at the Johns Hopkins University.

My precious data - GitHub
Open Science Course 2016 ... It's part of my contribution to science community ... Exports several formats (pdf, docx, csv, text, json, html, xml) ... http://dataverse.org/blog/scientific-data-now-recommends-harvard-dataverse-all-areas-s · cience ...

Open Data Canvas - GitHub
Top need for accessing data online. What data is most needed? Solution. How would you solve this problem? ... How big is the universe of users? Format/Use.

data tables - GitHub
fwrite - parallel file writer. SOURCE: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/ ... SOURCE: https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf length.

meteor's data layer - GitHub
Full-stack JavaScript Framework for both Web and. Mobile. □. Built on top of the NodeJs. □. Open Source. □ ... Meteor doesn't send HTML over the network. The server sends data ... All layers, from database to template, update themselves ...

Research Data Management Training - GitHub
Overview. Research Data management Training Working Group: Approach and. Methodology ... CC Australia ported licence) licence. ... http://www.griffith.edu.au/__data/assets/pdf_file/0009/528993/Best_Practice_Guidelines.pdf. University of ...

RN-171 Data Sheet - GitHub
Jan 27, 2012 - 171 is perfect for mobile wireless applications such as asset monitoring ... development of your application. ... sensor data to a web server.

Introduction to visualising spatial data in R - GitHub
An up-to-date pdf version of this tutorial is maintained for teaching purposes in the file ... 1. Introduction: provides a guide to R's syntax and preparing for the tutorial .... To check the classes of all the variables in a spatial dataset, you can

Recommendations for in-situ data Near Real Time Quality ... - GitHub
data centre has some hope to be able to correct them in .... different from adjacent ones, is a spike in both size .... average values is more than 1°C then all.

Processing Big Data With Hadoop In Azure HDInsight - GitHub
Enter the following command to query the table, and verify that no rows are returned: SELECT * FROM rawlog;. Load the Source Data into the Raw Log Table. 1. In the Hive command line interface, enter the following HiveQL statement to move the log file

Processing Big Data With Hadoop In Azure HDInsight - GitHub
Name: DataDB. • Subscription: Select your Azure subscription. • Resource Group: Select the resource group you created previously. • Select source: Blank database. • Server: Create a new server with the following settings: • Server name: Ent

Concepts in Crypto - GitHub
to check your email. ○. Enigmail: Thunderbird addon that adds OpenPGP ... you its certificate, which includes its public key ... iOS, Android: ChatSecure ...

HTML5 in Action - GitHub
on the client machine between sessions. The data can be overwritten or erased only by the application itself or by the user performing a manual clear down of the local stor- age area. The API of the sessionStorage attribute is identical to that of th