Basics of R Programming Yanchang Zhao http://www.RDataMining.com Tutorial on Machine Learning with R The Melbourne Data Science Week 2017

1 June 2017

1 / 38

Quiz

I

Have you ever used R before?

2 / 38

Quiz

I

Have you ever used R before?

I

Are you familiar with data mining and machine learning algorithms?

2 / 38

Quiz

I

Have you ever used R before?

I

Are you familiar with data mining and machine learning algorithms?

I

Have you used R for data mining and analytics in your work?

2 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 3 / 38

What is R? I

R 1 is a free software environment for statistical computing and graphics.

I

R can be easily extended with 10,000+ packages available on CRAN2 (as of May 2017).

I

Many other packages provided on Bioconductor3 , R-Forge4 , GitHub5 , etc. R manuals on CRAN6

I

I I I I

An Introduction to R The R Language Definition R Data Import/Export ...

1

http://www.r-project.org/ http://cran.r-project.org/ 3 http://www.bioconductor.org/ 4 http://r-forge.r-project.org/ 5 https://github.com/ 6 http://cran.r-project.org/manuals.html 2

4 / 38

Why R? I

R is widely used in both academia and industry.

I

R was ranked #1 in the KDnuggets 2016 poll on Top Analytics, Data Science software 7 (actually R has been #1 in a row from 2011 to 2016!). The CRAN Task Views 8 provide collections of packages for different tasks.

I

I I I I I I

Machine learning & atatistical learning Cluster analysis & finite mixture models Time series analysis Multivariate statistics Analysis of spatial data ...

7 http: //kdnuggets.com/2016/06/r-python-top-analytics-data-mining-data-science-software.html 8 http://cran.r-project.org/web/views/ 5 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 6 / 38

RStudio9

I

An integrated development environment (IDE) for R

I

Runs on various operating systems like Windows, Mac OS X and Linux RStudio project, with suggested folders

I

I I I I I

9

code: source code data: raw data, cleaned data figures: charts and graphs docs: documents and reports models: analytics models

https://www.rstudio.com/products/rstudio/ 7 / 38

RStudio

8 / 38

RStudio Keyboard Shortcuts

I

Run current line or selection: Ctrl + enter

I

Comment / uncomment selection: Ctrl + Shift + C

I

Clear console: Ctrl + L

I

Reindent selection: Ctrl + I

9 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 10 / 38

Data Types and Structures

I

Data types I I I I I

I

Integer Numeric Character Factor Logical

Data structures I I I I

Vector Matrix Data frame List

11 / 38

Vector

## integer vector x <- 1:10 print(x) ## [1] 1 2 3 4

5

6

7

8

9 10

## numeric vector, generated randomly from a uniform distribution y <- runif(5) y ## [1] 0.5923091 0.6782441 0.5266127 0.1358263 0.7433572 ## character vector (z <- c("abc", "d", "ef", "g")) ## [1] "abc" "d" "ef" "g"

12 / 38

Matrix ## create a matrix with 4 rows, from a vector of 1:20 m <- matrix(1:20, nrow = 4, byrow = T) m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 ## [3,] 11 12 13 14 15 ## [4,] 16 17 18 19 20 ## matrix subtraction m - diag(nrow = 4, ncol = 5) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 2 3 4 5 ## [2,] 6 6 8 9 10 ## [3,] 11 12 12 14 15 ## [4,] 16 17 18 18 20

13 / 38

Data Frame age <- c(45, 22, 61, 14, 37) gender <- c("Female", "Male", "Male", "Female", "Male") height <- c(1.68, 1.85, 1.8, 1.66, 1.72) married <- c(T, F, T, F, F) (df <- data.frame(age, gender, height, married)) ## age gender height married ## 1 45 Female 1.68 TRUE ## 2 22 Male 1.85 FALSE ## 3 61 Male 1.80 TRUE ## 4 14 Female 1.66 FALSE ## 5 37 Male 1.72 FALSE str(df) ## 'data.frame': 5 obs. of 4 variables: ## $ age : num 45 22 61 14 37 ## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 ## $ height : num 1.68 1.85 1.8 1.66 1.72 ## $ married: logi TRUE FALSE TRUE FALSE FALSE 14 / 38

List x <- 1:10 y <- c("abc", "d", "ef", "g") (ls <- list(x, y)) ## [[1]] ## [1] 1 2 3 4 5 6 7 8 ## ## [[2]] ## [1] "abc" "d" "ef" "g"

9 10

## retrieve an element in a list ls[[2]] ## [1] "abc" "d" "ef" "g" ls[[2]][1] ## [1] "abc"

15 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 16 / 38

Conditional control

I

if . . . else . . . score <- 4 if (score >= 3) { print("pass") } else { print("fail") } ## [1] "pass"

I

ifelse() score <- 1:5 ifelse(score >= 3, "pass", "fail") ## [1] "fail" "fail" "pass" "pass" "pass"

17 / 38

Loop control

I

for, while, repeat

I

break, next for (i in 1:5) { print(i^2) } ## [1] 1 ## [1] 4 ## [1] 9 ## [1] 16 ## [1] 25

18 / 38

Apply Functions

I

apply(): apply a function to margins of an array or matrix

I

lapply(): apply a function to every item in a list or vector and return a list

I

sapply(): similar to lapply, but return a vector or matrix

I

vapply(): similar to sapply, but as a pre-specified type of return value

19 / 38

Loop vs lapply ## for loop x <- 1:10 y <- rep(NA, 10) for (i in 1:length(x)) { y[i] <- log(x[i]) } y ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79... ## [7] 1.9459101 2.0794415 2.1972246 2.3025851 ## apply a function (log) to every element of x tmp <- lapply(x, log) (y <- do.call("c", tmp)) ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79... ## [7] 1.9459101 2.0794415 2.1972246 2.3025851

20 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 21 / 38

Parallel Computing ## on Linux or Mac machines library(parallel) (n.cores <- detectCores() - 1) tmp <- mclapply(x, log, mc.cores=n.cores) y <- do.call("c", tmp) ## on Windows machines library(parallel) ## set up cluster cluster <- makeCluster(n.cores) ## run jobs in parallel tmp <- parLapply(cluster, x, log) ## stop cluster stopCluster(cluster) # collect results y <- do.call("c", tmp)

22 / 38

Parallel Computing (cont.) On Windows machines, libraries and global variables used by a function to run in parallel have to be explicited exported to all nodes. ## on Windows machines library(parallel) ## set up cluster cluster <- makeCluster(n.cores) ## load required libraries, if any, on all nodes tmp <- clusterEvalQ(cluster, library(igraph)) ## export required variables, if any, to all nodes clusterExport(cluster, "myvar") ## run jobs in parallel tmp <- parLapply(cluster, x, myfunc) ## stop cluster stopCluster(cluster) # collect results y <- do.call("c", tmp)

23 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 24 / 38

Functions

Define your own function: calculate the arithmetic average of a numeric vector average <- function(x) { y <- sum(x) n <- length(x) z <- y/n return(z) } ## calcuate the average of 1:10 average(1:10) ## [1] 5.5

25 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 26 / 38

Pipe Operations

I

Load library magrittr for pipe operations

I

Avoid nested function calls

I

Make code easy to understand

I

Supported by dplyr and ggplot2

library(magrittr) ## for pipe operations ## traditional way b <- func3(func2(func1(a), p2)) ## the above can be rewritten to b <- a %>% func1() %>% func2(p2) %>% func3()

27 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 28 / 38

Data Import and Export

10

Read data from and write data to I

R native formats (incl. Rdata and RDS)

I

CSV files

I

EXCEL files

I

ODBC databases

I

SAS databases

R Data Import/Export: I http://cran.r-project.org/doc/manuals/R-data.pdf

10

Chapter 2: Data Import and Export, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf 29 / 38

Save and Load R Objects I

save(): save R objects into a .Rdata file

I

load(): read R objects from a .Rdata file

I

rm(): remove objects from R

a <- 1:10 save(a, file = "./data/dumData.Rdata") rm(a) a ## Error in eval(expr, envir, enclos): load("./data/dumData.Rdata") a ## [1] 1 2 3 4 5 6 7

8

object ’a’ not found

9 10

30 / 38

Save and Load R Objects - More Functions

I

save.image(): save current workspace to a file It saves everything!

I

readRDS(): read a single R object from a .rds file

I

saveRDS(): save a single R object to a file

I

Advantage of readRDS() and saveRDS(): You can restore the data under a different object name.

I

Advantage of load() and save(): You can save multiple R objects to one file.

31 / 38

Import from and Export to .CSV Files I I

write.csv(): write an R object to a .CSV file read.csv(): read an R object from a .CSV file

# create a data frame var1 <- 1:5 var2 <- (1:5)/10 var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies") df1 <- data.frame(var1, var2, var3) names(df1) <- c("VarInt", "VarReal", "VarChar") # save to a csv file write.csv(df1, "./data/dummmyData.csv", row.names = FALSE) # read from a csv file df2 <- read.csv("./data/dummmyData.csv") print(df2) ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies 32 / 38

Import from and Export to EXCEL Files Package xlsx: read, write, format Excel 2007 and Excel 97/2000/XP/2003 files library(xlsx) xlsx.file <- "./data/dummmyData.xlsx" write.xlsx(df2, xlsx.file, sheetName = "sheet1", row.names = F) df3 <- read.xlsx(xlsx.file, sheetName = "sheet1") df3 ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies

33 / 38

Read from Databases I

Package RODBC : provides connection to ODBC databases.

I

Function odbcConnect(): sets up a connection to database

I

sqlQuery(): sends an SQL query to the database

I

odbcClose() closes the connection.

library(RODBC) db <- odbcConnect(dsn = "servername", uid = "userid", pwd = "******") sql <- "SELECT * FROM lib.table WHERE ..." # or read query from file sql <- readChar("myQuery.sql", nchars=99999) myData <- sqlQuery(db, sql, errors=TRUE) odbcClose(db)

34 / 38

Read from Databases I

Package RODBC : provides connection to ODBC databases.

I

Function odbcConnect(): sets up a connection to database

I

sqlQuery(): sends an SQL query to the database

I

odbcClose() closes the connection.

library(RODBC) db <- odbcConnect(dsn = "servername", uid = "userid", pwd = "******") sql <- "SELECT * FROM lib.table WHERE ..." # or read query from file sql <- readChar("myQuery.sql", nchars=99999) myData <- sqlQuery(db, sql, errors=TRUE) odbcClose(db)

Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write or update a table in an ODBC database 34 / 38

Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome <- "C:/Program Files/SAS/SASFoundation/9.2" filepath <- "./data" # filename should be no more than 8 characters, without extension fileName <- "dumData" # read data from a SAS dataset a <- read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, "sas.exe"))

35 / 38

Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome <- "C:/Program Files/SAS/SASFoundation/9.2" filepath <- "./data" # filename should be no more than 8 characters, without extension fileName <- "dumData" # read data from a SAS dataset a <- read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, "sas.exe"))

Another way: using function read.xport() to read a file in SAS Transport (XPORT) format

35 / 38

Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 36 / 38

Online Resources

I

Chapter 2: Data Import/Export, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining-book.pdf

I

RDataMining Reference Card http://www.rdatamining.com/docs/RDataMining-reference-card.pdf

I

Free online courses and documents http://www.rdatamining.com/resources/

I

RDataMining Group on LinkedIn (24,000+ members) http://group.rdatamining.com

I

Twitter (3,000+ followers) @RDataMining

37 / 38

The End

Thanks! Email: yanchang(at)RDataMining.com Twitter: @RDataMining 38 / 38

Basics of R Programming

Rdata and RDS). ▷ CSV files. ▷ EXCEL files. ▷ ODBC databases. ▷ SAS databases. R Data Import/Export: ▷ http://cran.r-project.org/doc/manuals/R-data.pdf. 10Chapter 2: Data Import and Export, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf. 29 / 38 ...

742KB Sizes 5 Downloads 116 Views

Recommend Documents

Chapter 1 Microprocessor, Microcontroller and Programming Basics
Microprocessor, Microcontroller and Programming Basics. Course objectives .... o Used mainly in “embedded” applications and often involves real-time control.

the art of r programming pdf
Try one of the apps below to open or edit this item. the art of r programming pdf. the art of r programming pdf. Open. Extract. Open with. Sign In. Main menu.

r programming language tutorial pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. r programming language tutorial pdf. r programming language tutorial pdf.

R Fundamentals and Programming Techniques
Most programs written in one dialect can be translated straightforwardly to the other. • Most large programs will need some translation. • R has a very successful package system for distributing code and data. ..... hist(apipop$api99,col="peachpu

DownloadPDF Beginning R: The Statistical Programming Language ...
Book Synopsis. Conquer the complexities of this open source statistical language R is fast becoming the de facto standard for statistical computing and analysis ...

Bayesian Basics: A conceptual introduction with application in R and ...
exploring Bayesian data analysis for themselves, assuming they have the requisite .... of the world, expressed as a mathematical model (such as the linear ..... such as N and K, can then be used subsequently, as we did to specify dimensions. From the

Bayesian Basics: A conceptual introduction with application in R and ...
CENTER FOR STATISTICAL CONSULTATION AND RESEARCH. UNIVERSITY OF MICHIGAN. BAYESIAN ... exploring Bayesian data analysis for themselves, assuming they have the requisite context to begin with. ..... and for the next blocks, we declare the type and dim

Read eBook The Art of R Programming: A Tour of ...
Book synopsis. Matloff takes readers on a guided tour of this powerful language, from basic object types and data structures to graphing, parallel processing ...