Basics of R Programming Yanchang Zhao http://www.RDataMining.com Tutorial on Machine Learning with R The Melbourne Data Science Week 2017
1 June 2017
1 / 38
Quiz
I
Have you ever used R before?
2 / 38
Quiz
I
Have you ever used R before?
I
Are you familiar with data mining and machine learning algorithms?
2 / 38
Quiz
I
Have you ever used R before?
I
Are you familiar with data mining and machine learning algorithms?
I
Have you used R for data mining and analytics in your work?
2 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 3 / 38
What is R? I
R 1 is a free software environment for statistical computing and graphics.
I
R can be easily extended with 10,000+ packages available on CRAN2 (as of May 2017).
I
Many other packages provided on Bioconductor3 , R-Forge4 , GitHub5 , etc. R manuals on CRAN6
I
I I I I
An Introduction to R The R Language Definition R Data Import/Export ...
1
http://www.r-project.org/ http://cran.r-project.org/ 3 http://www.bioconductor.org/ 4 http://r-forge.r-project.org/ 5 https://github.com/ 6 http://cran.r-project.org/manuals.html 2
4 / 38
Why R? I
R is widely used in both academia and industry.
I
R was ranked #1 in the KDnuggets 2016 poll on Top Analytics, Data Science software 7 (actually R has been #1 in a row from 2011 to 2016!). The CRAN Task Views 8 provide collections of packages for different tasks.
I
I I I I I I
Machine learning & atatistical learning Cluster analysis & finite mixture models Time series analysis Multivariate statistics Analysis of spatial data ...
7 http: //kdnuggets.com/2016/06/r-python-top-analytics-data-mining-data-science-software.html 8 http://cran.r-project.org/web/views/ 5 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 6 / 38
RStudio9
I
An integrated development environment (IDE) for R
I
Runs on various operating systems like Windows, Mac OS X and Linux RStudio project, with suggested folders
I
I I I I I
9
code: source code data: raw data, cleaned data figures: charts and graphs docs: documents and reports models: analytics models
https://www.rstudio.com/products/rstudio/ 7 / 38
RStudio
8 / 38
RStudio Keyboard Shortcuts
I
Run current line or selection: Ctrl + enter
I
Comment / uncomment selection: Ctrl + Shift + C
I
Clear console: Ctrl + L
I
Reindent selection: Ctrl + I
9 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 10 / 38
Data Types and Structures
I
Data types I I I I I
I
Integer Numeric Character Factor Logical
Data structures I I I I
Vector Matrix Data frame List
11 / 38
Vector
## integer vector x <- 1:10 print(x) ## [1] 1 2 3 4
5
6
7
8
9 10
## numeric vector, generated randomly from a uniform distribution y <- runif(5) y ## [1] 0.5923091 0.6782441 0.5266127 0.1358263 0.7433572 ## character vector (z <- c("abc", "d", "ef", "g")) ## [1] "abc" "d" "ef" "g"
12 / 38
Matrix ## create a matrix with 4 rows, from a vector of 1:20 m <- matrix(1:20, nrow = 4, byrow = T) m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 ## [3,] 11 12 13 14 15 ## [4,] 16 17 18 19 20 ## matrix subtraction m - diag(nrow = 4, ncol = 5) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 2 3 4 5 ## [2,] 6 6 8 9 10 ## [3,] 11 12 12 14 15 ## [4,] 16 17 18 18 20
13 / 38
Data Frame age <- c(45, 22, 61, 14, 37) gender <- c("Female", "Male", "Male", "Female", "Male") height <- c(1.68, 1.85, 1.8, 1.66, 1.72) married <- c(T, F, T, F, F) (df <- data.frame(age, gender, height, married)) ## age gender height married ## 1 45 Female 1.68 TRUE ## 2 22 Male 1.85 FALSE ## 3 61 Male 1.80 TRUE ## 4 14 Female 1.66 FALSE ## 5 37 Male 1.72 FALSE str(df) ## 'data.frame': 5 obs. of 4 variables: ## $ age : num 45 22 61 14 37 ## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 ## $ height : num 1.68 1.85 1.8 1.66 1.72 ## $ married: logi TRUE FALSE TRUE FALSE FALSE 14 / 38
List x <- 1:10 y <- c("abc", "d", "ef", "g") (ls <- list(x, y)) ## [[1]] ## [1] 1 2 3 4 5 6 7 8 ## ## [[2]] ## [1] "abc" "d" "ef" "g"
9 10
## retrieve an element in a list ls[[2]] ## [1] "abc" "d" "ef" "g" ls[[2]][1] ## [1] "abc"
15 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 16 / 38
Conditional control
I
if . . . else . . . score <- 4 if (score >= 3) { print("pass") } else { print("fail") } ## [1] "pass"
I
ifelse() score <- 1:5 ifelse(score >= 3, "pass", "fail") ## [1] "fail" "fail" "pass" "pass" "pass"
17 / 38
Loop control
I
for, while, repeat
I
break, next for (i in 1:5) { print(i^2) } ## [1] 1 ## [1] 4 ## [1] 9 ## [1] 16 ## [1] 25
18 / 38
Apply Functions
I
apply(): apply a function to margins of an array or matrix
I
lapply(): apply a function to every item in a list or vector and return a list
I
sapply(): similar to lapply, but return a vector or matrix
I
vapply(): similar to sapply, but as a pre-specified type of return value
19 / 38
Loop vs lapply ## for loop x <- 1:10 y <- rep(NA, 10) for (i in 1:length(x)) { y[i] <- log(x[i]) } y ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79... ## [7] 1.9459101 2.0794415 2.1972246 2.3025851 ## apply a function (log) to every element of x tmp <- lapply(x, log) (y <- do.call("c", tmp)) ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79... ## [7] 1.9459101 2.0794415 2.1972246 2.3025851
20 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 21 / 38
Parallel Computing ## on Linux or Mac machines library(parallel) (n.cores <- detectCores() - 1) tmp <- mclapply(x, log, mc.cores=n.cores) y <- do.call("c", tmp) ## on Windows machines library(parallel) ## set up cluster cluster <- makeCluster(n.cores) ## run jobs in parallel tmp <- parLapply(cluster, x, log) ## stop cluster stopCluster(cluster) # collect results y <- do.call("c", tmp)
22 / 38
Parallel Computing (cont.) On Windows machines, libraries and global variables used by a function to run in parallel have to be explicited exported to all nodes. ## on Windows machines library(parallel) ## set up cluster cluster <- makeCluster(n.cores) ## load required libraries, if any, on all nodes tmp <- clusterEvalQ(cluster, library(igraph)) ## export required variables, if any, to all nodes clusterExport(cluster, "myvar") ## run jobs in parallel tmp <- parLapply(cluster, x, myfunc) ## stop cluster stopCluster(cluster) # collect results y <- do.call("c", tmp)
23 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 24 / 38
Functions
Define your own function: calculate the arithmetic average of a numeric vector average <- function(x) { y <- sum(x) n <- length(x) z <- y/n return(z) } ## calcuate the average of 1:10 average(1:10) ## [1] 5.5
25 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 26 / 38
Pipe Operations
I
Load library magrittr for pipe operations
I
Avoid nested function calls
I
Make code easy to understand
I
Supported by dplyr and ggplot2
library(magrittr) ## for pipe operations ## traditional way b <- func3(func2(func1(a), p2)) ## the above can be rewritten to b <- a %>% func1() %>% func2(p2) %>% func3()
27 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 28 / 38
Data Import and Export
10
Read data from and write data to I
R native formats (incl. Rdata and RDS)
I
CSV files
I
EXCEL files
I
ODBC databases
I
SAS databases
R Data Import/Export: I http://cran.r-project.org/doc/manuals/R-data.pdf
10
Chapter 2: Data Import and Export, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf 29 / 38
Save and Load R Objects I
save(): save R objects into a .Rdata file
I
load(): read R objects from a .Rdata file
I
rm(): remove objects from R
a <- 1:10 save(a, file = "./data/dumData.Rdata") rm(a) a ## Error in eval(expr, envir, enclos): load("./data/dumData.Rdata") a ## [1] 1 2 3 4 5 6 7
8
object ’a’ not found
9 10
30 / 38
Save and Load R Objects - More Functions
I
save.image(): save current workspace to a file It saves everything!
I
readRDS(): read a single R object from a .rds file
I
saveRDS(): save a single R object to a file
I
Advantage of readRDS() and saveRDS(): You can restore the data under a different object name.
I
Advantage of load() and save(): You can save multiple R objects to one file.
31 / 38
Import from and Export to .CSV Files I I
write.csv(): write an R object to a .CSV file read.csv(): read an R object from a .CSV file
# create a data frame var1 <- 1:5 var2 <- (1:5)/10 var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies") df1 <- data.frame(var1, var2, var3) names(df1) <- c("VarInt", "VarReal", "VarChar") # save to a csv file write.csv(df1, "./data/dummmyData.csv", row.names = FALSE) # read from a csv file df2 <- read.csv("./data/dummmyData.csv") print(df2) ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies 32 / 38
Import from and Export to EXCEL Files Package xlsx: read, write, format Excel 2007 and Excel 97/2000/XP/2003 files library(xlsx) xlsx.file <- "./data/dummmyData.xlsx" write.xlsx(df2, xlsx.file, sheetName = "sheet1", row.names = F) df3 <- read.xlsx(xlsx.file, sheetName = "sheet1") df3 ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies
33 / 38
Read from Databases I
Package RODBC : provides connection to ODBC databases.
I
Function odbcConnect(): sets up a connection to database
I
sqlQuery(): sends an SQL query to the database
I
odbcClose() closes the connection.
library(RODBC) db <- odbcConnect(dsn = "servername", uid = "userid", pwd = "******") sql <- "SELECT * FROM lib.table WHERE ..." # or read query from file sql <- readChar("myQuery.sql", nchars=99999) myData <- sqlQuery(db, sql, errors=TRUE) odbcClose(db)
34 / 38
Read from Databases I
Package RODBC : provides connection to ODBC databases.
I
Function odbcConnect(): sets up a connection to database
I
sqlQuery(): sends an SQL query to the database
I
odbcClose() closes the connection.
library(RODBC) db <- odbcConnect(dsn = "servername", uid = "userid", pwd = "******") sql <- "SELECT * FROM lib.table WHERE ..." # or read query from file sql <- readChar("myQuery.sql", nchars=99999) myData <- sqlQuery(db, sql, errors=TRUE) odbcClose(db)
Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write or update a table in an ODBC database 34 / 38
Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome <- "C:/Program Files/SAS/SASFoundation/9.2" filepath <- "./data" # filename should be no more than 8 characters, without extension fileName <- "dumData" # read data from a SAS dataset a <- read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, "sas.exe"))
35 / 38
Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome <- "C:/Program Files/SAS/SASFoundation/9.2" filepath <- "./data" # filename should be no more than 8 characters, without extension fileName <- "dumData" # read data from a SAS dataset a <- read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, "sas.exe"))
Another way: using function read.xport() to read a file in SAS Transport (XPORT) format
35 / 38
Outline Introduction to R RStudio Data Objects Control Flow Parallel Computing Functions Pipe Operations Data Import and Export Online Resources 36 / 38
Online Resources
I
Chapter 2: Data Import/Export, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining-book.pdf
I
RDataMining Reference Card http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
I
Free online courses and documents http://www.rdatamining.com/resources/
I
RDataMining Group on LinkedIn (24,000+ members) http://group.rdatamining.com
I
Twitter (3,000+ followers) @RDataMining
37 / 38
The End
Thanks! Email: yanchang(at)RDataMining.com Twitter: @RDataMining 38 / 38