Data Processing with PC-‐SAS PubH 6325
J. Michael Oakes, PhD
Associate Professor Division of Epidemiology University of Minnesota
[email protected]
Lecture 3/4
Remember! Unless you want to DON’T WRITE OVER your “master” dataset!
data temp2; set temp1; rename x = y; run;
Lecture 3
DATA SET MANIPULATION FUNCTIONS FORMATS DATES
PC SAS
Data
SAS Program
SAS Reads Data as per instruc1ons
Output SAS Writes Data and/or Text as per instruc1ons
Manipula\ng Data Sets
Append
A
append
B
A B
Both A and B should have (approx.) same variables.
Append SAS Append (some\mes called concatenate or adding observa\ons)
Use the SET statement
data temp_both; set temp1 temp2; run;
SAS Append Complica\ons: Datasets do not have same variables!
Dataset 1 Dataset 2
X Y Z X Y P
Missing Values are inserted
Append SAS Append (some1mes called concatenate or adding observa1ons)
Use the SET statement
data temp_both; set temp1 temp2; run;
Manipula\ng Data Sets Merge
A
merge
B
A B By some linking Variable (e.g., ID)
SAS Merge SAS Merge
(some\mes called adding variables)
data temp_both; merge demog employee; run;
Unless you’re perfectly, totally, absolutely, undoubteldy, quite, uferly, and wholly 100% CERTAIN of what you’re doing, NEVER EVER SHEVER KEVER BLEVER merge without a merge “key” -‐-‐ some\mes called a one-‐to-‐one merge.
SAS Merge Use a merge key to link/match/connect observa\ons (e.g., subjects) from one data set to another.
Merge keys (eg, ID) should be unique and in string format; they may be several variables long (eg, state, county, tract).
SAS Merge Use by statement with merge key to link/match/ connect observa\ons (e.g., subjects) from one data set to another.
data temp_both; merge demog employee; by ID; run;
SAS Merge You must sort each dataset by merge key before afemping to merge. Think about why this is wrt the pdv! proc sort data=demog; by ID; proc sort data=employee; by ID; Data combined; merge demog employee by ID; run;
SAS Merge As with Append, missing values fill-‐in the PDV and hence the output file if merge-‐keys in each dataset don’t perfectly conform. Advanced topics: • Mul\ple variable merge keys • Iden\fying “matches” with (IN= ) • Renaming merge keys • Fuzzy merges with soundex func\on • Proc Sort op\ons
Manipula\ng Data Sets Collapse (for hierarchical data)
A Collapsed and sta\s\cs (e.g., sums, means) are generated.
A’
Collapse ID CLINIC AGE SPB
Sta1s1cs
SEX
1
1
46
116
M
2
1
53
150
F
3
1
38
112
F
4
1
63
.
M
5
2
18
112
M
6
2
24
121
M
7
2
19
145
M
CLINIC
n
age
sbp
male
1
4
50
126
50
2
3
20.3
126
100
Collapsed to clinic level
Collapse SAS Collapse? Use Proc Means! PROC MEANS DATA=clnc_dat; CLASS clinic; VAR age; OUTPUT OUT=fam2 MEAN(age)=age; RUN;
Proc Means Procedure Name/Iden\fy statement Original dataset
PROC MEANS DATA=clnc_dat; CLASS clinic;
Variable(s) to collapse on
VAR age;
Variable(s) to get generate summary stat(s) for
OUTPUT OUT=fam2 MEAN(age)= age;
Name of output sta\s\c in output dataset Output sta\s\c And its target variable
Name/loca\on of Output Dataset Request for output (collapsed) dataset
Collapse PROC MEANS DATA=clnc_dat; CLASS clinic; VAR age; OUTPUT OUT=fam2 MEAN(age)=mean_age MEDIAN(age)=med_age; RUN;
Keywords for Common Output Sta\s\cs N (Number of non-‐missing values) NMISS (Number of missing values) MEAN (Mean/average) MIN (Minimum value) MAX (Maximum value) SUM (Sum of values) STDDEV|STD (Standard devia\on of values) VAR (Variance of values) STDERR (Standard error of Mean) CLM (Two-‐sided confidence limit for the mean) LCLM (Lower confidence limit for the mean) UCLM (Upper confidence limit for the mean) MEDIAN|P50 (Median) Q1|P25 (First quar\le) Q3|P75 (Third quar\le) QRANGE (Interquar\le range)
SAS Func\on Junc\on Useful func\ons: y = f (x)
Numeric String
SAS Func\on Junc\on A Few of my favorite func\ons: Log ( ) Upper( ) Exp( ) Lower( ) Round( ) Substr( ) Ceil( ) Length( ) Mean( ) Year( ) ranuni( )
Put( )
Random subseqng/sampling Many possibli\es, but you can: • Generate random number for each obs; sort on the random number; keep if _n < N, where N is desired sample size. • Proc surveyselect
SAS Formats A format is an instruc\on that SAS uses to write data values. You use formats to control the wrifen appearance of data values, or, in some cases, to group data values together for analysis. For example, the WORDS22. format, which converts numeric values to their equivalent in words, writes the numeric value 692 as six hundred ninety-‐two.
SAS Formats Syntax: <$>format
.
where “$” indicates a character format; its absence indicates a numeric format; “w” specifies the format width, which for most formats is the number of columns in the output data; and “d” specifies an op\onal decimal scaling factor (ie, many decimal places) in the numeric formats. Formats always contain a period (.) as a part of the name and never change or truncate the internal stored data values.
SAS Formats Examples Syntax: <$>format.
X = 1257000 Format x best12.
1257000
Format x comma12.0
1,257,000
format x best6.
1.26E6
format x best3.
1E6
format x 10.2
1257000.00
format x 10.3
1257000.000
SAS Formats Examples Syntax: <$>format.
y = “Grateful Dead” Format y $Char13.
Grateful Dead
Format y $13.
Grateful Dead
Format y $Char3.
Gra
Format y $3.
Gra
SAS Formats Common Formats
BESTw. COMMAw.d w.d $CHARw. $w. PERCENTw.d PVALUEw.d
(default) (numeric comma separated) (basic numeric) (string of length w) (same as $CHARw.) (percent, %) (p-‐value looking)
SAS Formats Make your own formats and use them to categorize con\nuous variables.
SAS Dates Dates are stored as number of days since January 1, 1960. Dates since then are posi\ve (+) Dates prior to then are nega\ve (-‐)
SAS Dates 7/4/1776
-‐67019
1/1/1959
-‐365
1/1/1960
0
1/1/1961
366
1/17/2002
15357
SAS Dates The “trick” is to covert a given raw string date into a SAS data and then format the SAS date for your easy reading/ understanding. • Do calcula\ons and such with SAS dates. • Use formafed SAS dates for condi\onal statements.
Example Program See ‘day 3 programs.sas’
Lab 3 (First Hour) Directed Learning
• Appending, Merging, Collapsing • Func\ons • Formats • Dates • Procs (Second Hour) Lab Assignment
• Write and execute program for appending, merging, collapsing, and for using func\ons and formats; build freq tables.