ST 307 : Statistical Programming
"Dataskin" / "Stat"
"Dataskin" changes the look of the bars in a graph. "Stat" sets the statistic for the y-axis. (examples are Freq, Mean, Median, Percent, Sum). Ex: Vbar Type / Dataskin = matte Stat = percent; Run;
"Limits" / "Limitstat"
"Limits" are where you want error bars to be drawn. (relates to alpha and confidence interval for data). "Limitstat" changes the limit type (Stderr, Stdev, CLM...)
"Markerattrs" (Markerfillattrs, Markeroutlineattrs...)
"Marker Attributes." -Affects the color, size, symbol, outline etc. of the marker, or points on the graph. Ex: PROC SGplot Data = something; Scatter X = xvariable Y = yvariable; Markerattrs=(Color = blue Symbol=diamondfilled); Run;
IF/Then/Else Statement in SAS Example
... IF (Energy > 5) AND (Start = 1) THEN Quality= "Good"; ELSE IF (Start = 1) THEN Quality = "Ok"; ELSE Quality = "Other"; Run; -the if/then/else statement helps create NEW variables
"SET"
A data step used to create a new data set from an old existing SAS one. Ex: Data mydata.cars; SET sashelp.cars; Run;
"Datalines"
A datalines statement allows you to input the actual data yourself as you would want it. Ex: Datalines; Fedora Blue 10 TopHat Red 15 ; Run;
Density Plot
A density plot is basically a smooth histogram. -default normal distribution -if you want to overlay graphs, you put them in the statement in order you want them to be Ex. Proc SGplot data = sashelp.cars; Histogram msrp / Dataskin = sheen; Density msrp; Density msrp / type = kernel; Run;
Correlation
A unitless measure of strength and direction of the linear relationship between 2 variables. -between -1 (exact negative relationship) and 1
"Datalabel"
Adds labels. -default is FREQ
Missing Y Trick
Allows us to get the confidence / prediction interval for a value of a variable NOT in the data set. -2 statements Ex. Data temp; Input syrup rep $ l a b; Datalines; 49 1 . . . ; Proc Datasets; Append Base=mydata.cheese data=temp; Run;
"Proc Datasets"
Allows us to view the Descriptor portion of a data set. -copy, rename, delete sas files -list all files in a library -edit some variable attributes (name, labels...) Ex: Proc Datasets library = sashelp.heart; ContentsData = dataset <options>; QUIT;
Hypothesis Tests
Answers the question of whether or not a particular value are reasonable for "u" or if the data contradicts that theory.
"DSD"
Automatically changes the delimiter to a comma. -helps if there is more than 1 comma in a row
"CLPARM" / "CLPARM" / "CLI"
CLPARM = gives confidence intervals for Beta's CLM = give confidence intervals for MEAN RESPONSE at each set of predictor values in data set CLI = gives prediction intervals for a new response at each set of predictor values in the data set -future observation (lots of variation) -CLM and CLI CAN'T BE in the same statement
"Proc Freq"
Calculates summary statistics for categorical variables. PROC FREQ: -A single variable uses a 1-way contingency (frequency) table -Multiple variables use 2-way contingency tables
Delimiter
Character that separates data values. -"3,Hello,ST" (comma) -"3 Hello ST" (space "09"x)
"CLPARM"
Creates confidence interval statistics in your tables.
"Trim"
Cuts out outer data. -determine how much to cut off Ex. Trim = 0.05
Confidence Interval Options in SAS
Data = input data set Alpha = (1 - Confidence Level) H0 = "H naught" variable, null hypothesis Sides = 1 or 2-sided test (U=upper 1-sided, L=lower 1-sided, 2=2-sided) CI = confidence interval for st. deviation -The default test for alpha = 0.05 (95% confident), 2-sided, and tests that the null hypothesis = 0
Keeping / Dropping Data from a Set Example
Data mydata.chis; SET mydata.chis; Where BMI > 20; Run; -this keeps data ONLY where BMI > 20 -Can also add a "DROP" statement Data mydata.chis; SET mydata.chis; Where Asian EQ "0"; Run; -this only keeps variables in data set where Asian variable is EQUAL to 0
Example of a FULL Infile Step
Data mydata.student; Infile "C:\Users\student.txt" Firstobs=2 DLM=","; Length Name $ 12; Input s_perc Percent6. gpa stat : Comma10. Format s_perc Percent8.2 Graddate ddmmyy10.; Label s_perc="Percent Stat Completed" Run;
How to Store Data in SAS
Data sets can be temporary or permanent. -stored in "library," which is a collection of SAS files that are stored in the same directory
Population / Sample
Entire group of units you are studying. sample = a subset of the population
Subset Variables
Ex. DROP/KEEP statement, drop/keep options
Subset Observations
Ex. WHERE statement, IF statement
Multiple Linear Regression
Fits a best "plane" through 3D data. -model with 2 predictors -fits more flexible surface through data
"Proc Univariate"
Helps create data sets / histograms / tables...
P-Value
Helps you determine the significance of your results. It is between 0 and 1. -If the p-value is less than alpha, you reject the null (original) hypothesis. If it is greater, you "fail to reject" the null hypothesis (alternative could be true).
Colon
Indicates that the value is to be read in from the next nonblank column (or delimiter) or end of a dataline.
Simple Linear Regression
Is there a linear relationship between x and y. (asks same question as correlation). -Between the RESPONSE (L) and the COVARIATE (syrup) Ex. Proc GLM data=mydata.cheese PLOTS = All; Model L = Syrup; Run; QUIT;
Contingency Tables
Main way to numerically summarize categorical data. -bar plot -comparative bar plot (multiple variables) -Proc SGplot
One-level vs. Two-level names
One-Level Names assume the data set is in the work library. Ex: Data = housedata Two-Level Names specify the library and the data set names. Ex: Data = sashelp.cars -in sashelp library and we're using the "cars" data set within it
Operators
Operators specify arithmetic operations or when to keep or drop data. Ex: NE (not equal to), GT (greater than), LT (less than), GE (greater than or equal to), IN (in a list)...
Confidence Intervals
Provide a range of values for which we are "confident" contain the true mean.
Reading External Data
Read in external files with data steps or proc steps. -different methods for TXT/DAT/CSV (comma separated values), XLS... -use an INFILE statement for EXTERNAL DATA ONLY
Inference
Relating a sample to a population.
Fisher
Requests confidence intervals for correlation and p-values under a specified null hypothesis. -includes Pearson / Spearman correlations
"RespAsc" / "RespDesc"
RespAsc stacks or places data in ascending order. Desc does descending order.
Content & Descriptor Portion of Data Sets
SAS data sets are composed of 2 pieces. Content Portion = a collection of variables on each record. -variables are stored as columns, records (observations) are stored as the rows Descriptor Portion = information about the data set. -number of observations -type of each variable -name / length of variables -format, informat, label...
"Nendpoints"
Specifies how many endpoints you want.
Permanent Data Set
Stored in a SAS library created and named by you. -usable in current / future sas sessions -saved as .sas7bdat extension -created with a "Libname" statement Ex: Libname Mydata "C:\Users\Desktop"; Run;
Temporary Data Set
Stored in the SAS folder "Work Library." -usable only if current sas session is open -it is lost when the sas session closes
Formats
Tells SAS how to DISPLAY a variable in the new chart or graph. -"label" statement
Class Statement
Tells SAS which variable in the data set specifies the 2 different populations. -only take on 2 values
F-Value
Tells if a GROUP of variables are jointly significant. -T-test is for ONE variable only.
"Firstobs"
Tells you which row the "First Observation" or record is on. -commonly firstobs=2 (to skip the title blocks)
Procttest / Proc GLM
Test a mean from a normal population. -Proc GLM lets you specify any degree of interaction and nested effects options = Class, Paired, By, Var, Freq, Weight...
Categorical Data
The values represent a category (M/F, Yes/No...) -attributes or labels -CHAR = character data (no numbers) -mathematical operations are not meaningful here
Informats
These tell SAS how to READ IN a variable. -character informats start with a $ sign ($13.) Ex: 123,456 = "comma7." (to say there is a comma and 7 characters, including the comma as one) 123,456.00 = "comma10.2" (10 characters total, with 2 numbers after the decimal) 01/04/98 = "ddmmyy8." (8 characters, can also have 10)
Quantitative Data
Uses numbers. Measures of Center : Mean, Median Measures of Spread : Variance, St. Deviation, Inter - Quartile Range... Proc Univariate is for a single variable. Proc Corr (elation) is for multiple variables Ex: Proc Univariate data = sashelp.heart; By type; Var heigh weight; Histogram msrp; Run;
Spearman's Correlation Coefficient
Uses ranks of data points instead of actual points. -sample mean is NOT robust to outliers -uses the ranks to determine correlation -values LESS EFFECTED by outliers -Ex. VAR rankx ranky (this tells us its Spearman w/o specifying it elsewhere in the statement)