SAS Base Exam Concepts
Types of SAS Output
-SAS dataset (inc. values + descriptive info) -SAS log -Report or simple listing (proc print, tables, etc.) -Other SAS files (e.g. catalogs, settings) -External files or entries directly into other databases (e.g. proc export, direct entry into Oracle, etc.)
SAS Language Elements
-Statements -Expressions -Functions & CALL routines -Options -Formats (for formatting SAS data in output) -Informats (for reading formatted raw data into usable SAS data)
SAS Error Types
-Syntax errors -Semantic errors -Execution-time errors -Data errors -Macro-related errors (newer in 9.4 documentation)
SAS QC Checklist for Preventing Errors
1. Check Syntax (esp. semicolons, quotes, keywords & spelling, ending every DO or SELECT with END;) 2. Check order 3. Check INPUT & data
2 Phases of reading in a datset
1. Compilation Phase 2. Execution Phase
Methods to Combine Datasets
1. Concatenate 2. Interleaving 3. Merging (1-to-1, match merge) 4. Update 5. Modify
Base SAS Components
1. Data management facility (ability to organize data in tabular form) 2. Programming language 3. Data analysis & reporting utilities (PROCS)
What does the placement of DROP= in the code below do? data new (drop=ID); merge data1 data2 (drop = var2) data3; by ID; run;
1. Drop= in the first DATA line (drop=ID) causes ID to be dropped in the final output dataset, but KEEPS it for the initial merge 2. Drop= in the MERGE line drops it from the initial merge so it never gets merged in the first place Option 1 might be good e.g. if you need to merge then deidentify. Option 2 is more efficient if there are variables you just don't need at all.
Duplicate BY Values in MODIFY - What happens?
1. If duplicates are in MASTER dataset, the transaction DS modifies FIRST occurrence in master (leaves others unchanged) 2. If duplicates are in TRANSACTIONAL dataset, then observations overwrite each other (final value = last obs with that BY value) 3. If duplicates are in BOTH master & transactional, SAS still applies each transactional value to the FIRST occurrence in the master dataset (first occurrence will overwrite itself a bunch of times, subsequent occurrences nothing will happen).
The 6 SAS Variable Attributes
1. Name 2. Type 3. Length 4. Informat 5. Format 6. Label
Standard Numeric Formats
1234 (integer) 123.456 (decimal) -123E4 (scientific notation)
LIBNAME
Assigns new permanent library (folder) for storing data, output, etc. Statement: LIBNAME libref 'filepath'; libref = short name for library Must use 2-level name to call this in future (libref.data)
_IORC_
Automatic variable generated during a MODIFY statement. Values: _SOK = observation from transaction dataset was found in master. _DSENMR and _DSEMTR: Observations from transactional dataset were not found in master. If consecutive obs in the same BY group were not found, the first occurrence will say _IORC_ = _DSENMR and subsequent occurrences will be _DSEMTR USES: E.g. select if _iorc_ = _sok
like operator, _ and %
LIKE searches for a matching pattern and is case sensitive. _ substitutes for one character (insert anything), and % is a wildcard (anything can go there). Example: proc print data = emp; where Name like '_R%'; run; Searches Name variable for the pattern '<1 character>capital R<anything>'
LOWCASE / UPCASE/PROPCASE
Converts case of a character variable to lower case (LOWCASE), upper case (UPCASE), or proper case (PROPCASE) e.g. if x="my Pinneapple", UPCASE(x) = PINNEAPPLE, LOWCASE(x) = my pinneapple, PROPCASE(x) = My Pinneapple
Macro Variables (Static/Simple)
Create a macro variable with a static value that can be called again & again in a program. When you change the assigned value, it will change throughout the program. Assign it - %LET: %LET mymacvar = OK; Call it - &: Title "Is &mymacvar"; *Note: Don't use quotes when ASSIGNING macvar (even if character) - use DOUBLE QUOTES when calling it.
Basic Syntax to Create & Call a [Local] Macro Routine
Create a macro: %MACRO macname; ... <insert regular SAS code for macro>; %MEND macname; Call a macro - &macname: DATA mydata; SET myset; &macname; run;
Automatic Variables
Created in the PDV during a DATA step: _N_: counts the # of times a DATA step iterates _ERROR_: signals an error caused by execution (if _ERROR_ = 1, there's an error such as a character in a numeric variable)
proc sgpanel
Creates a panel/grid of charts based on one "panelby" variable (e.g. instead of multiple boxplots in one graph, boxplot for each group is in a different panel/box). Syntax: proc sgpanel data=mydata; panelby groupvar / layout=lattice; hbox numvar; run;
IN= option
Creates a temporary variable indicating whether an observation came from the specified dataset. Used mainly when concatenating datasets, e.g.: data combo; set data1 (IN=yep) data2; by ID; if yep then source = 'first dataset'; else source = 'second dataset'; run; *First creates binary, temporary 'yep' variable where yep=1 if observation is in data1 or 0 if not. (Could also create a different 1/0 variable for data2 called something different, but would be redundant for just 2 datasets). Since 'yep' is only in PDV/temp, also creates a variable called source which shows the output.
SAS Data Types
Numeric or character
Default & Max Storage Space for Vars
Numeric: 8 bytes Character: Length of first occurrence/data value (e.g. "blue" -> 4 chars), max 32,767 characters
Missing Data Notation for SAS Datasets (numeric vs character)
Numeric: Missing data stored as period (".") Character: Missing stored as blank, EXCEPT when entering list data (DATALINES) - must use period to represent missing data.
proc print obs= vs (obs=)
OBS= Used to label the auto-generated observation number column, e.g. proc print data=mydata obs='My Obs'; *noobs suppresses auto-obs column (OBS=) in parentheses tells SAS the maximum obs # to print e.g. proc print data=mydata (obs=10); *firstobs specifies 1st obs to print
END= option
Creates a temporary variable which flags the last observation in a dataset (1=last obs, 0=not "). Written to PDV only, so have to write new var or print to see. data combo; set data1 data2 END=flag; by ID; if FLAG then output; run; (*use proc print to see*)
Invalue
Creates an informat for character values, which allows you to tell SAS how to read in special character values. Used in PROC FORMAT. E.g. if have a numeric var where missing was marked as "M": proc format; invalue varname 'M' = . other = _SAME_; run;
options DATESTYLE=
DATESTYLE=mdy, dmy, ymd Used to clarify date formats when could be ambiguous (e.g. European vs American dates)
PROC DATASETS - Delete Files
DELETE SELECTED FILES: proc datasets library=mylib; delete mydata; run; DELETE ALL BUT SELECTED: proc datasets lib=mylib; save mydata; run; DELETE ALL FILES IN SAS LIB: proc datasets library=mylib kill; run; quit;
DO WHILE & DO UNTIL
DO WHILE: Processes a do loop as long as a condition is met (doesn't initiate if the condition is never met) DO UNTIL: Starts do loop & continues until a condition is met
SAS Dataset Components
Data Values & Descriptor Portion (metadata)
Add formats & styles in PROC TABULATE
Formats: In TABLE statement, add asterisk + "f=<format.>" E.g. TABLE groupvar, numvar*mean*f=comma7.; Style: Add in class, var, or table statements. E.g. class var1 / style=[font_style=italic]; table var1*[style=[background=cyan font_weight=bold];
COMPRESS to join variables
Join variables & remove spaces (e.g. to create a unique ID) newvar=compress(oldvar1||oldvar2||oldvar3)
Merge: What happens if multiple obs with the same BY value?
Keeps all obs from both datasets
Using the : Colon
-Denote a series of integers (1:10 = 1,2,3...10) -Variable wildcard (e.g. "drop pre:;" = drop all variables with prefix 'pre') -In combo with comparison operators (e.g. =: means "starts with", e.g. if lastname =: "Ha", then order = 1; else order=2;) -Denote array dimensions (e.g. array(1:10) var1-var10;)
Pointer Controls
var n1-n2: Column, specifies col widths @n var: Absolute, specifies start of var is @n var1 +n var2: Relative, tells SAS to move pointer n places to right of the preceding variable var1 / var2: Jump to next line for var2 #n var1: Go to nth line of input buffer to find var1
Sum Statement
variable + expression (e.g. used to create running totals - variable = cumulative total var you are calculating, expression = var you want to create a running total OF) e.g. cumulativecases + cases
COMPRESS modifiers
Specific chars let you remove groups of variables. SYNTAX: compress(string,char,mod) compress(string, ,'a'); -> removes all letters compress(string, ,'kd'); -> keeps digits only compress(string,'czy', 'i'); -> removes czy, ignores case compress(string, , 'pd'); -> removes punctuation & digits)
Infile
Specifies an external file to read with an INPUT statement, valid in a DATA step
firstobs & obs options
Specifies line # of first (FIRSTOBS=) and/or last (OBS=) observation to read in. Example: data new; set old (firstobs=5 obs=10); run; /*reads in observations 5-10*/
proc import - GUESSINGROWS option
Specifies the number of rows to scan to determine the appropriate data type and length for a variable. EXAMPLE:
PUT statement
Specifies to write something to the log. E.g. data concat; set data1 data2; by ID; if type <=: 'm' then put id; run; (Will output ID for any obs where type starts with a-m in the log)
SAS DateTime Value
Stored as number of SECONDS between Jan 1, 1960 and a specified date time (to the nearest second)
SAS Time Value
Stored as numeric number of seconds since midnight (00:00:00am) the current day. Time values range from 0 to 86400.
Options dms/nodms
Turns on/off windows (switches to line mode?)
Assign Data Variable to a Macro Variable
Use CALL SYMPUT: E.g. this code assigns the variable "species" to a macro variable 'fishtype': DATA fish; SET sashelp.fish; CALL SYMPUT('fishtype',species); RUN; *NOTE: Must use SINGLE quotes to assign macro var in call symput.
proc chart - report mean or sum instead of default (count)
Use SUMVAR & TYPE= options. E.g. proc chart data=mydata; vbar charvar / sumvar=numvar type=means; run; *Will show bar chart of mean numvar by charvar groups. **Can also do TYPE=SUM.
Conditional formatting for cell colors based on value
Use proc format + call in proc freq, tabulate, etc. EXAMPLE: proc format; value flag low-49 = 'red' 50-99 = 'yellow'; run; proc tabulate data=mydata; class female; table (female ALL)*(n*f=5. pctn*f=pctfmt.*[style=[background=flag.]]); run;
Digit selectors
Used in proc format/PICTURE. In the desired numeric "picture" (e.g. '999-99-9999'), the 9s are called DIGIT SELECTORS. Can be either 0 (will not print leading zeros) or nonzero 1-9 (will print leading zerios; most coders use 9).
Create Running Total (Whole Dataset)
data newdata; set olddata; totalvar + origvar; run; Totalvar = A NEW running total variable (will retain its value for each new obs then add the next one) Origvar = Variable you want to create a running total of (e.g. daily cases -> cumulative cases)
INDEX
index(charvar, 'target') Searches & returns the position (index) of a specified character or substring. Can also be used to check if a character or substring is present in a string by testing if index > 0. e.g. if index(charvar, 'ok') > 0, then flag=1; else flag=0 -> if 'ok' is found in a string then flag=1
Basic ODS Syntax
ods <outputtype> <file>; proc... (whatever you want to run); ods <outputtype> close; E.g. ods html "my_file.htm"; proc means data=mydata; ... run; ... ods html close; ***Output only shows AFTER YOU CLOSE.
PROC DATASETS - Change Dataset Names
proc datasets library=mylib; CHANGE olddata1=newdata1 olddata2=new2...; run;
PROC OPTIONS;
proc options; run; *Returns current values for all system options in SAS
Sorting Datasets (1 var vs Multiple)
proc sort data=mydata; BY sortvar1 sortvar2 sortvar3; run; *Sorts by var3 w/in var2 w/in var1 **Default is ascending order ***To sort all descending, must place "descending" before EACH sortvar, i.e. "BY descending sortvar1 descending sortvar2 descending sortvar3"
Rounding
round(var,n) where n = the place you want to round to. e.g. newvar = round(oldvar,10) creates newvar that rounds oldvar to the nearest 10. int(var) rounds to nearest whole integer
ODS Trace
"Trace" all output objects from an ODS statement, including Name, Label, Template, and Path. ODS trace ON; (turn on trace) ODS trace OFF; (turn off trace)
A Few Common SAS Informats
$ = Character variable comma5. = Read 5 in characters w/ commas & convert to SAS numeric (e.g. 1,234 -> 1234) comma10. = Read up to 10 chars w/ commas & convert to SAS numeric (1,234,567 -> 1234567) percent3. = Reads in % as numeric (98% -> 0.98) dollar4. = Reads in 4 characters w/ dollar sign ($125 -> 125) mmddyy10. = Converts mm/dd/yyyy to SAS date mmddyy8. = Converts mm/dd/yy to SAS date
Comparing CharVars/ If variable starts with...
'=:' -> 'starts with' (e.g. if var1 =: 'L', then...) NOTE1: Letters have numeric value "under the hood" where " "<a<b<...z<A<B<C....". NOTE2: Case sensitive! Upper case > lower Letters have >Therefore, "if x >=: 'm'" ~ "if x starts with m-z"
Create a Date Constant
(Create variable with a constant date, e.g. local index case date for COVID) Use 'INFORMAT'd; E.g. data covid; set orig; input cases date mmddyy10.; indexdate = '01MAR2020'd; run;
Calculate Duration/Date Difference
***PREVIOUS: Use date1-date2 (formats in days). BETTER: DATDIF = Calculate date diff in days: datdif(startdatevar, enddatevar, 'actual') YRDIF = Calculate date diff in years yrdif(startdate, enddate, 'actual') *'actual' is the basis - could also set to assume 360 day years/30 day months, etc. **to set a constant start/end date, use "'01Jan2017'd" or similar ***To call current date, use TODAY(), e.g. yrdif(startdate, TODAY(), 'actual')
Positional Macro
*ORDER MATTERS* when listing macro parameters (macro vars) Syntax: %macro mymacro (macvar1, macvar2, macvar3); ***sas code blabla***; %mend mymacro; Then when you call it, you list what each of the macvars are, IN SAME ORDER AS WHEN CREATING MACRO. I.e. if you want macvar1=4 macvar2='OK' & macvar3 = 46: %mymacro (4, 'OK', 46) *NOTE: Macro variables written this way are LOCAL macvars - they work only for this macro.
Running Total by Group
*SORT FIRST BY GROUP* proc sort data=data out=sorted; BY group; run; data total; set sorted; by group; IF first.group then runningtotal=0; runnintotal + origvar; run; HOW IT WORKS: Resets the running total to 0 at the beginning of each group
Numeric Operators
+, -, *, /, ** : add, subtract, multiply, divide, exponent >=/ge or <=/le : Greater/less than or equal to ~=, -=, ^= or ne: Not equal to Combine conditions using AND (&) or OR (|). If combining many conditions, SAS by default evaluates "and"s first, then "ors". Use parentheses to group as needed.
Arrays
-A temporary variable grouping (exists only for that data step) for acting on multiple vars at once -Each variable in the array is assigned a # position based on the order you list them ARRAY myarray[n] var3 var1 var4 var2... myarray = name of array n = # of elements (variables) in the array Positions are 1 (var3), 2 (var1), 3 (var4) and 4 (var2)
Common Time & DateTime Informats/Formats
ANYDTDTEw. = Extracts many diff date/time formats DATETIMEw. = Reads ddmmmyyyy hh:mm:ss $N8601Bw.d = Reads ISO 8601 date formats (e.g. 1994-11-05T08:15:30-05:00) HHMMSSw. = Reads hh:mm:ss form B8601TMw.d = Read ISO 8601 times (hhmmss<ffffff>)
Create Separate Histograms by Group
Add CLASS statement. E.g.: proc univariate data=mydata; histogram numvar; CLASS charvar; run;
Permanently assign an output format in the DATA step
Add FORMAT line into data/set step: data new; set old; format date mmddyy10.; run;
Label variables in PROC TABULATE
Add labels in TABLE statement using var1='Label'. E.g. TABLE groupvar1="My Group", numvar1="Age"*mean numvar2="Height"*median;
Order Rows in PROC REPORT
Add to DEFINE statement: proc report data=... ; column var1 var2... ; DEFINE var1 /order order=internal; run; ORDER OPTIONS: Data = Same order as data Freq = Descending by frequency Internal = Like proc sort (???) raw values Formatted = Sort by formatted values
Order Class Vars in PROC TABULATE
Add to PROC TABULATE statement. Options: ORDER=DATA -> Follows dataset order ORDER=FREQ -> Sorts by descending freq ORDER=FORMATTED -> by formatted values ORDER=UNFORMATTED -> by raw values (ascending) e.g. proc tabulate data=mydata order=freq;
Add Descriptive Stats in PROC TABULATE
Add to each analysis variable using asterisks in TABLE statement. Examples: TABLE groupvar, numvar*mean; TABLE groupvar, numvar1*min numvar1*max; TABLE groupvar, numvar1(min max mean std)*f=dollar9.2; Common options: mean,median,std,min,max
Formatted Input
Allows SAS user to read in nonstandard data formats using informats (e.g. dates, numbers with commas or dollar signs) & pointer controls (to specify variable positions). SAS reads until it reaches the last column specified, not the last space (like col input). LIMITATIONS: -Must specify where the pointer should look for a variable using column widths/pointer controls (e.g. var1 1-6 var2 $ 8-12), absolute pointer controls (e.g. @1 var1 @8 var2), or relative pointer controls (e.g. var1 +2 var2)
Graph Template Language (GTL)
An extension of ODS for creating "sophisticated graphics". Graphics are generated using templates created in PROC TEMPLATE, then graphs are rendered using SGRENDER. proc template; define statgraph histogram; begingraph; layout overlay; histogram weight; endlayout; endgraph; end; run; ods graphics / width=450px; proc sgrender data=sashelp.class template=histogram; run;
FIRST.var and LAST.var
Automatically-created variables during PROC SORT; BY;. FIRST.var = 1 if it's the first observation with a given value of the BY variable; LAST.var = 1 if it's the last observation with the BY variable, else both=0. Use to unduplicate in PROC SORT: proc sort data=mydata out=clean NODUPRECS; BY sortvar; run; Write to Dataset (after proc sort out=sorted): data newdata; set sorted; by sortvar; first = first.sortvar; last = last.sortvar; run;
Group Observations
BY <groupvar>; *Only for SAS datasets - obs must be sorted
BY variable vs BY value vs BY group
BY variable: The variable data is being sorted/ grouped by (e.g. ID) BY value: For one obs, the value of its by var (e.g. ID=1001) BY group: The group of all obs with the same BY value (e.g. if there are 5 obs with ID=1001)
ODS HTML - Files produced
Body File: Main table or graph Contents: Contains links to body files for each HTML table/graphic, with anchor tags based on NAME attribute (anchor: <A href='myfile.htm#TABLEID>) Page: Links to body file for each pg Frame: Provides simultaneous view of body + contents and/or page.
%eval
Built-in macro to evaluate arithmetic & logical operations (use SAS like calculator). Ex. %let d=%eval(10+20); %let k = 1; %let tot = %eval(&k + 1); %put &tot; (writes answer to log)
Temporary vs Permanent Datasets
By default, all data stored as temporary in the WORK library. Must specify 2-level name (library.dataset) to store permanently
UPDATEMODE=NOMISSINGCHECK (MODIFY statement)
By default, in MODIFY statements if a value is missing in the transactional dataset, SAS does not replace the master value. You can override this with UPDATEMODE = NOMISSINGCHECK so transactional values override master, even if missing.
Informats & Input Style
COLUMN input: Add formats in input statement (e.g. var1 mmddyy8. var2 $ 10-13 var3...) LIST input: Add separate INFORMAT statement OR colon. 1. INFORMAT stmt data... informat datevar mmddyy8.; input datevar; 2. COLON input datevar : mmddyy8.;
COMPBL & COMPRESS
COMPBL collapses 2+ spaces to 1. COMPRESS removes spaces (default) or other character. e.g. if x = "can't / touch this": COMPBL(x) = "can't/touch this" COMPRESS(x) = "can't/touchthis" COMPRESS(x,'/') = "can't touch this"
PROC DATASETS - Copy/Move Datasets
COPY ALL DATASETS: proc datasets; copy out=outlib in=inlib; run; MOVE ALL DATASETS (deletes from inlib): copy out=outlib in=inlib move; MOVE/COPY ONLY SELECTED DATASETS: copy in=inlib out=outlib; select data1; MOVE/COPY ALL BUT EXCLUDED DS: copy in=inlib out=outlib; exclude data1;
WEEKDAY(date)
Calculates day of week for a given date (though SAS recommends testing)
Notepad window
Can also be used to edit/develop code (honestly don't see the point on this and internet is not helpful)
RETAIN
Causes an observation to retain its value from one data step iteration to the next (instead of being reset to missing each time). (don't fully understand use cases though... possibly means subsequent manipulations in the dataset will not apply because the original values are retained?)
VERIFY function
Checks for invalid characters, returns index position of first invalid character if present, otherwise 0. SYNTAX: verify(charvar, 'validchars'); e.g. if x values are ABCCE, BEAXI, EBBB, verify(x,'ABCDE') -> 0, 4, 0
Interleaving Datasets
Combines SORTED datasets vertically, & sorts them in the new dataset on a common variable. Syntax: [1st proc sort data1 and data2] data combined; set data1 data2; BY sortvar; run;
Merging
Combines datasets horizontally. ***DATASETS MUST BE SORTED ON IDVAR FIRST.** 1-to-1 Merge: When datasets have the exact same observations in the same order (exact match) - you can basically just glue datasets together side-by-side. (data combo; merge data1 data2; run; - no BY statement). Mainly use if order of merge doesn't matter Match Merge: Merging on a common identifier/BY variable (data combo; merge data1 data2; BY byvar; run;)
SPEDIS
Computes likelihood of 2 words matching, where 0=perfect match, higher score = worse match. SYNTAX: spedis(charvar, "match") E.g. Data match; Input word $12.; Match = spedis(word, 'Epidemiology'); Datalines; Epidemiology Encyclopedia Epidemic Epistemology ; Run; Yields Match=0,87,37,20
Concatenate Datasets (2 Methods)
Concatenate datasets = Stack vertically. Method 1: DATA/SET (any # of datasets) data new; set old1 old2; run; Method 2: PROC APPEND (2 datasets max) proc append base=basedata data=newdata; run; (optional FORCE option)
TODAY()
Sets date program is run as today's date. Can also assign variable using this statement. E.g. data new; set old; today = TODAY(); days = deptdate - today; run;
proc print advance options
Default: All vars & obs print ID <var>; Emphasizes 1 variable & makes it first column for all "pages"/tables WHERE: Used to subset data FORMAT: Add format st (proc print... ; var... ; format var1 comma7. var2 mmddyy10. etc. SUM: Specify vars to sum (will add that column to the output if not in VARS) BY: Group output using BY <groupvar> - must be sorted 1st! -> can also break out by page using PAGEBY
ODS Table Definitions
Define table formatting - not required. Can run ODS without them. May include column order, data formats, headers, fonts, etc.
Semantic Errors
Definition: Correct language elements but incorrect usage, e.g. wrong # of arguments, using a numeric variable where only character is vaild. When Detected: Usually compilation, sometimes execution Consequences: SAS enters syntax check mode (only ltd processing but SAS runs through the whole program to validate it.)
Execution-Time Errors
Definition: SAS attempts to execute program but fails. E.g. invalid arguments to functions, invalid math operations, obs in wrong order for BY group, out of storage space. When Detected: Execution Consequences: Usually runs & produces warnings or notes, but stops processing for serious errors.
Data Errors
Definition: Some data values are not appropriate (e.g. some character values in a numeric var) When Detected: Execution phase Consequences: SAS generates _ERROR_=1 for that obs but keeps running, writes invalid values to log.
Syntax Errors
Definition: Statements don't conform to SAS rules (e.g. missing semicolon/ quote, misspelled or incorrect keyword). When detected: Compilation phase Consequence: SAS first attempts to try and autocorrect/interpret. If can't prints error message & stops running.
Macro-related errors
Definition: When a macro facility is used incorrectly When detected: Macro compile time or execution Consequences: Depends
Reading in Data Files: Direct vs Indirect Method
Direct: Include filepath in DATA step. E.g.: data temp; infile '/folders/myfolders/temp.dat'; input... Indirect: Set up fileref at the beginning for easy reference later. E.g.: filename temp '/folders/myfolders/temp.dat; ... data temp; infile temp; input..."
Double-click & Right-click if you have no mouse
Double-Click: Type x or x next to the item, then enter Right-Click: Type ? next to the item, then enter
ENDSAS; or BYE;
Ends SAS session (I would imagine best for batch processing/large jobs running in the background)
More SAS Number Formats
Ew. = Scientific format (2.35E-3) Fract. = Fraction (1/4) Percentw.d = Percent (75.25%)
Combining DO Loops + Array
Executes multiple actions on groups of variables - use i= statement to specify which positions in the array to act on. EXAMPLE: data fish; set sasuser.fish; ARRAY fisharray[3] height weight length; DO i = 1 to 3; IF fisharray[i] = 0 then fisharray[i] = .; IF fisharray[i] > 1000 then fisharray[i] = .; END; RUN; *Recall must add END statement to complete DO
How to Handle Missing or Varying Data (e.g. supposed to be 5 characters, it's really 3)
FLOWOVER (default) - looks to next record STOPOVER (only for supposedly perfect data) MISSOVER (converts all to missing) TRUNCOVER (allows shorter data values) *All used after INFILE e.g. infile filename 'filepath' truncover; infile datalines truncover;
Ways to Run SAS
FOREGROUND (mostly interactive, can't do anything else in SAS while running) & BACKGROUND processing. *All are foreground except batch. 1. Windowing enviro (usual, most flexible) 2. Noninteractive (prep a file, run it then it goes - can't use your work station or modify while running) 3. Batch mode (runs larger operation in "batches" in the background of your computer; you can do other tasks at work station simultaneously) 4. Interactive line mode: Enter one line of a program at a time, SAS executes automatically when end of line is reached
Compilation Phase (3 things produced)
First stage in reading in a dataset. Checks for errors & compiles statements. The compilation stage produces: - Input buffer - Program Data Vector (PDV) - Descriptor Information
File Paths for Data or Libs (Windows)
For SAS myfolders: 'folders/myfolders/libref'; 'folders/myfolders/data.dat'; For other location on DPH computer (example): 'C:\Users\employeeid\Documents\test.csv'
proc sgplot
For creating more advanced graphics. Basic syntax: proc sgplot data=mydata; <charttype> var1 <options> / <response= var2>; run; Chart types include histogram, vbar, hbar, dot, series (lines), density, hbox, vbox
Z formats
Formats standard numeric variables **with leading zeros** (otherwise leading zeros will be removed). e.g. if x values are 00552, 1234, 0873 input x Z.; (will keep leading zeros)
proc import vs infile/input
From SAS communities: -INFILE + INPUT is more flexible, allows more control in how data is read in -IMPORT proc tries to "guess" data type, width, format, etc. https://communities.sas.com/t5/General-SAS-Programming/Difference-between-INFILE-AND-IMPORT/td-p/102115#:~:text=INFILE%20together%20with%20INPUT%20in,column%20type%2C%20width%20and%20informats.
proc univariate/histogram options
GRID: Add reference grid to chart (cgrid=grid color, lgrid=line type. E.G. histogram numvar / grid cgrid=blue;) VSCALE: Sets units for y axis (options are COUNT, PERCENT, or PROPORTION). VAXIS: Sets increments for y axis (e.g. histogram numvar / vscale=count vaxis=0 to 10 by 2) MIDPOINTS: Set bin width (e.g. histogram var1 / midpoints = 0 to 500 by 100;) VAXISLABEL: Label vertical axis (e.g. histogram var1 / vaxislabel='My Label' INSET: Adds mini tabel within graph with key stats. (e.g. proc univariate data=mydata; histogram var1; inset n mean min max; run;)
NOSOURCE, NONOTES and ERRORS=
Global options statements which suppress log output. NOSOURCE = Suppress SAS statements in log NONOTES = Suppress notes in log ERRORS = #: Tells SS the max # of observations for which to print error messages Enter as standalone (global) options line: e.g. options nosource nonotes errors=5;
OPTIONS missing=
Global statement, sets missing value for output e.g. OPTIONS missing='missing';
PRINTTO=
Global statement, specifies where to print log or output: OUTPUT: proc printto print='output-file' new;run; (reset to default: proc printto print=print; run;) LOG: proc printto log='alt-log-file'; run; (reset to default: proc printto log=log; run)
Missing Data in PROC TABULATE
If class var has missing data, use MISSING option. If analysis var ", will exclude from analysis. Can label missing values using MISSTEXT= option. Ex. proc tabulate data=mydata missing; class x y z; table x all y all / misstext = 'No Data'; run;
WHERE/IN vs IF/THEN
In a DATA step, WHERE/IN statements are applied before the input buffer, vs IF/THEN conditions are applied between the PDV & final output dataset. As a consequence: -IF is better when accessing raw data, for using new vars or automatic vars (_N_, _ERROR_), for using dataset options (e.g. OBS, FIRSTOBS), or to conditionally execute a statement. -WHERE is better for special operators (LIKE, CONTAINS), subsetting as a dataset option, w/ PROC SQL, to improve efficiency. NOTE: When merging datasets, WHERE/IN is better applied BEFORE merge, vs IF/THEN is better AFTER.
LIST statement
In a DATA/INPUT statement, tells SAS to list observations in the log as it's reading them in. E.g.: data vehicles; input id type $ wheels; if wheels > 4 then list; datalines; 1001 car 4 ... etc.
Rules for SAS Names
Includes datasets, varnames, librefs, etc.: -Can contain 1-32 characters -First character must be a letter or underscore -Subsequent chars may be letter, number or underscore -No blanks or other special characters -[Varnames only] Not case sensitive when calling/ using, but SAS will always output using the original varname case entered.
Input (function) vs Input (keyword)
Input (function): Used to convert character values to numeric OR other character e.g. phone = input(phonenumber,5.) Input (keyword: Instructs SAS how to write the values of SAS variables e.g. data new; input var1 var2 $ var3...
Informat
Instructs SAS how to read data, like date values, into SAS variables
Format
Instructs SAS how to write/format the values of SAS variables (in output)
PROC DATASETS basic functions & syntax
Interactive proc used to bulk-manage, modify, & view SAS libraries & datasets. Runs until QUIT; used. Basic Syntax: proc datasets library=mylib; <many diff sub-commands>; Run; Quit; ***If no library specified, assumes default/work KEY FUNCTIONS include copying/moving files across libraries, listing library contents or dataset attributes, renaming files, repairing files, deleting files, appending datasets, modifying datasets (e.g. rename vars, change/add/remove formats & labels), manage passwors
proc plot options
Label Axes: Add LABEL statement (proc plot...; plot y*x... ; LABEL yvar = 'Y Label' xvar = 'X Label'; run;) Set Axis Values: Add HAXIS or VAXIS statement (plot yvar*xvar / haxis = 0 to 200 by 5 vaxis = 0 to 3000 by 100;) Change Dots: Add ='plotchar', e.g. plot yvar*xvar='+'; (will plot dots as + signs) Suppress legend: NOLEGEND option (proc plot data=... nolegend;)
KEYLABEL statement (e.g. proc tabulate)
Lets you assign labels to keywords/auto-generated titles, e.g. proc tabulate data=mydata; class female ses; table female ALL, ses ALL; keylabel ALL = "Total" N = " "; /*2 spaces between... not sure why this works but it does*/ mean = "Mean" lclm = "95% Lower CL" uclm = "95% Upper CL"; run; ^Labels ALL total, removes N row from proc tabulate output.
Variable Assignment
Lets you create new variables or replace/modify old during the DATA step - e.g.: data new; set old; newvar1 = oldvar1*10; if oldvar2 = '1' then newvar2 = 'yellow'; if oldvar2 = '2' then newvar2 = 'blue'; OR (replace) if oldvar2='y' then oldvar2 = 'yes';
PROC TEMPLATE
Lets you create template, e.g. for ODS output. Basic syntax: PROC TEMPLATE; DEFINE <table defs>; HEADER... ; COLUMN ... ; END; RUN; *See pg 586-7 for example code
ODS Select/ODS Exclude
Lets you select specific objects to output (must know name first htough). E.g. ods select BasicMeasures Quantiles; proc univariate... ^Selects only 2 tables to output from proc univariate.
Modified List Input
Lets you use informats for list-format (delimited) data. RULES: -Must use colon between varname & informat (e.g. data test; input var1 : $10. var2 : mmddyy10.) -Can allow embedded nulls IF there are at least 2 spaces or other dlm between values using "&" (e.g. data test; input var1 & $18. var2 $;)
Data Input Types
List input Column input Formatted input (Also, modified list & mixed)
PROC DATASETS - Get library listing
Lists files ("members") in a SAS data library: proc datasets library=mylib; run; OPTIONS: MEMTYPE= Specifies file type (e.g. data, catalog) e.g. proc datasets library=mylib memtype=data.
Log Definition & Components
Log provides record of everything you do in a SAS session. Key components: -SAS statements -Error messages -Notes & warnings -Notes w/ # of obs and variables for each dataset created
Program Data Vector (PDV)
Logical memory area where SAS builds a dataset, one observation at a time. The PDV reads or creates data values (e.g. if missing #, -> ".") _N_ and _ERROR_ are also created in the PDV
Input Buffer
Logical memory area where SAS reads each record from a raw data file when the program executes (if reading another SAS data file, skip this step)
PROC TRANSPOSE: Long to Wide
Long to Wide - Basic Syntax: proc transpose data=flippy out=wide prefix=weight; BY id; ID visitnum; VAR weight; run; Here start with a long dataset with 3 columns: ID, visit number, and weight -> Transform to a dataset with 1 row per ID & 3 columns for weight at visits 1-3. *BY is the ID variable you want to be the unique row identifier in the wide DS (must sort data on BY var first). **ID tells SAS which variables or values you want transposed to new columns. Here, it's visit #. ***VAR is the VALUE you want to keep & transpose by ID (i.e. here, we want weight by visit number). Usually only specify one var for wide to long.
proc chart options
MIDPOINTS: Set axis intervals for continuous numeric variables (vbar var1 / midpoints=0 to 100 by 5); LEVELS: Sets # of cut points for bar charts of numeric variables (vbar var1 / levels=5); DISCRETE: Override SAS auto-grouping numeric variables & plot all discrete values. (e.g. vbar numvar1/ discrete) SUBGROUP: Set subgroups within main groups (e.g. vbar var1 / sugroup = var2;)
PROC REPORT missing data
MISSING option tells SAS to include missing values as a distinct value. No MISSING option will lead SAS to ignore or drop these values. E.g. proc report data=grocmiss missing; column sector manager N sales; define sector / group format=$sctrfmt.; define manager / group format=$mgrfmt.; define sales / format=dollar9.2; rbreak after / dol summarize; run;
Common Date Informats/Formats
MMDDYY8. -> mm/dd/yy MMDDYY10. -> mm/dd/yyyy DATE7. -> ddMMMyy (e.g. 25JUL07) DATE9. -> ddMMMyyyy WORDDATE18. -> January 25, 2020 WEEKDATE29. -> Saturday, March 8, 2020 ANYDTDTE10. -> Used to read in mixed/any date format (though not always perfect) *Note: Best practice to use 4-digit year
Descriptor Information
Metadata info stored in the descriptor portion of a SAS dataset. Includes variable and dataset attributes. Can be viewed using PROC CONTENTS
Missing Data (Numeric)
Missing numeric entered as period (".") -"." sorts low in PROC SORT or comparison (e.g. n>.) -Some procs auto-exclude missing, others do not. Be careful!
Proc Plot - Multiple Scatter Plots
Multiple Plots, Different Pages: Multiple y*x's e.g. proc plot data=mydata; plot yvar1*xvar1 yvar2*xvar2... ; run; Multiple Plots, Same Page: Use VPCT/HPCT, where VPCT = % of vertical space given to each graph, HPCT = % of horizontal space. e.g. proc plot data=mydata hpct=33; ...plot... run; Multiple Variables Plotted on Same Axis: Use OVERLAY option e.g. proc plot... ; plot yvar1*xvar1 yvar2*xvar1 / overlay; run;
Message characters
Non-numeric characters used in proc format/PICTURE that will be printed in numeric output. (E.g. hyphens for phone numbers or SSN)
Call routines
Similar to functions, but different in that you cannot use them in assignment statements or expressions
ODS
Output Delivery System. How SAS produces output in different formats, such as HTML, SAS, RTF (word), a non-SAS dataset. How it Works: 1. Data and Tabular definitions (with formatting instructions for data) are combined into the OUTPUT OBJECT. 2. Output object has a name, label, path, and usually formats. 3. Output object is directed to an ODS destination, which specifies the type of output & location 4. SAS creates final formatted output.
proc import basic syntax & options
PROC IMPORT datafile='C:\filepath" dbms=dlm out=mydata replace; delimiter='&'; getnames=yes; run; *DBMS specifies data type. Options include csv, xlsx, dlm (space- or other delimited file), tab (tab-delimited), access (MS access database), dbf (dBASE file) **OUT= specifies name (& lib) of output SAS dataset ***REPLACE: Overwrites existing SAS dataset ****GETNAMES= Pull variable names from 1st row of data
proc import vs infile/input - reading in csv
PROC IMPORT datafile='filepath' dbms=csv out=mydata replace; getnames=yes; run; DATA mydata2; infile 'filepath' dlm=',' firstobs=2; length var2 $12; input var1 mmddyy10. holiday $; run;
PROC REPORT basic structure
PROC REPORT data=mydata <options>; COLUMN var1 var2...; DEFINE... (basically used to define everything from formats to output statistics to which vars to group by. Options include GROUP, ANALYSIS, and more); RUN; Simplest PROC REPORT looks like PROC PRINT (default = all vars, all obs). Can also do summary stats like PROC TABULATE. Supposedly the most flexible.
PROC TABULATE structure
PROC TABULATE data=mydata <options>; CLASS var1 var2...; /*usually charvars*/ VAR numvar; /*analysis vars - numeric*/; TABLE var1*var2, numvar*mean*f=comma7.; run; TABLE statement tells SAS how to structure the output. Must specify column var at minimum - if 1 comma, goes TABLE row, col. If 2 commas, goes page,row,col. EVERY VARIABLE YOU WANT TO USE MUST BE SPECIFIED IN A CLASS OR VAR STATEMENT
PROC DATASETS - List dataset content
Proc datasets library=mylib memtype=data; CONTENTS data=mydata; run; Outputs header, name, engine, date created & last modified, var names & attributes, more. *NOTE: Can only specify 1 dataset per contents stmt MODIFIERS: VARNUM: Lists vars in order of their logical position in dataset (e.g. contents data=mydata varnum;) SHORT: Lists short 1-line list of vars (" short;) NODS: Provides listing only (" nods;)
Concatenation using SET vs PROC APPEND
Processing time: Proc append is faster (doesn't process 1st dataset) # of datasets it concatenates: 2 only for proc append, any number for set Handling Different Variables (e.g. dataset 1 has a variable dataset 2 doesn't): SET keeps all vars from all datasets, sets to missing as needed. APPEND uses all vars from BASE, sets to missing if not included in new data. Drops any new data vars not in BASE dataset. Handling Different Varnames or Types: SET doesn't run. BASE will treat diff names like diff variables, only run for type if you use FORCE option Handling Diff. Lengths: - SET: Uses length from 1st dataset in SET statement - PROC APPEND: If BASE length >= DATA, keeps BASE length. If DATA > BASE, doesn't run unless using FORCE option, in which case it truncates DATA. Handling Different Formats, Informats, & Labels: - SET: Explicitly defined overrides default, 1st dataset overrides if all are defined - PROC APPEND: BASE overrides
proc univariate histograms - purpose + basic syntax
Produces a histogram showing continuous NUMERIC variable distribution. Basic syntax: proc univariate data=mydata noprint; histogram numvar; run; *NOPRINT suppresses a bunch of tables auto-generated by proc univariate. **Auto-scales vertical axis & horizontal bins
proc plot purpose & basic syntax
Produces simple scatterplot of two (numeric) variables. Basic syntax: PROC PLOT data=mydata; PLOT yvar*xvar; Run; Default output includes x/y labels, legend, dots are alphabet where A=1 obs, B=2 obs, ... Z=26+ obs
proc chart function & basic syntax
Produces simple vertical & horizontal bar charts, block charts, pie charts, and star charts. Defaults to showing FREQUENCY (n) of each group. Basic syntax: PROC CHART data=mydata; <charttype> var1; run; *<charttype> can be VBAR, HBAR, BLOCK, PIE, or STAR **For numeric variables, SAS will auto-set cutpoints.
3 SAS Programming Windows
Program Editor, Log, and Output (Results Viewer is now default output destination for 9.3+) *5 most common windows also include Results & Explorer.
proc contents
Provides descriptor portion of dataset, or lets you view contents of an entire library. Simplest use: proc contents data=mydata; run;
Put (function) vs Put (keyword)
Put (function): . Returns a value using a spec. format, used to convert numeric to character e.g. id2=put(id, 8.) Put (keyword) Writes lines to the SAS log, output window, or to an external location specified in the most recent FILE stmt, valid in a DATA step e.g. data concat; set data1 data2; if type = 2 then put id; run;
Macro variable lists
Quick way to reference a list of variables. %LET varlist = var1 var2 var3 var4; proc freq data=lib.mydata; table &varlist; run;
PROC DATASETS Modify Variable Attributes
RENAME VARIABLES: proc datasets library=mylib; MODIFY mydata; RENAME oldvar1=newvar1 old2=new2... ; run; CHANGE/ASSIGN FORMATS: proc datasets... ; modify mydata; FORMAT datevar1-datevar3 mmddyy9. var4 $20...; run; CHANGE/ASSIGN LABELS: proc datasets...; modify mydata; LABEL var1='Label 1' var2='Label 2'... ; run; REMOVE FORMATS/LABELS: proc datasets...; modify mydata; ATTRIB _all_ label=''; ATTRIB _all_ format=; run;
ODS Output Keywords/Types
RTF = MS Word Output = SAS dataset Listing = SAS text output (classic) HTML = HTML, web (9.4 default) Printer = High-resolution output, specify printer path PDF = PDF
Column Input
Reads in data structured in fixed columns. ADVANTAGES: -Character variables CAN contain spaces or embedded blanks -Can create variables >8 bytes without using LENGTH statement -Lets you read columns in any order or skip some -Char vars can be up to 32,767 bytes LIMITATIONS: -Only standard numeric or character data allowed -Must specify column widths (e.g. var1 1-6 var2 $ 8-12).
List Input
Reads in delimited data where 1 row = 1 obs. Default delimiter is a space, for other delims specify "dlm =" (e.g. for csv do dlm = ','). RULES: -Only for data in standard numeric or char formats -No embedded blanks allowed -Must specify missing values using a non-space character such as '.' or '-' -Default length of char vars is 8 bytes
Conditional DELETE (Data step)
Remove variables that meet a condition. E.g.: data subset; set fulldata; if project = 'done' then delete; run;
UPDATE datasets
Replaces and/or appends data in a master dataset based on a new, transactional dataset. (If some idvars are missing from new dataset, does not delete them - only adds/modifies). data master; update master transaction; by ID; run; UPDATEMODE= option -> specifies what to do with missing data in new transactional data. UPDATEMODE=MISSINGCHECK (default) leaves master data values in place if they're missing in new dataset. UDPATEMODE=NOMISSINGCHECK replaces master data with missing if missing in new dataset. ***Also can create a new dataset
MODIFY
Replaces, deletes or appends existing dataset (similar to update, but cannot create NEW datasets, only replace). Can modify a dataset by itself, or modify a master using a transactional. data mydata; modify mydata; var1 = var1*.25; run; data master; modify master transaction; by id; run; *NOTE: Data do not need to be pre-sorted
INTCK
Returns # of specified intervals between any 2 dates. Format: intck(interval, startdate, enddate, <method(optional)>) Interval options include 'DAYS', 'MONTH', 'qtr', 'YEAR', 'WEEK', custom
INTNX
Returns date given an interval, start date, and number of intervals elapsed. Basic format: intnx(interval, startdate, increments) Intervals similar to INTCK - 'DAY', 'week', 'MONTH','YEAR','HOUR' (not case sens.)
COMPRESS
Returns string variable with specified characters removed (defaults to remove blanks if no character specified). E.g. compress(var,'1234567890') -> removes digits compress(var, "+-", modifier) Modifiers add a group to the list of modified characters
data _null_;
Runs a DATA step without generating a dataset. Often used with PUT (i.e. write to log instead). E.g. data _null_; input var1 var2 var3 var4; put var1 var2 var3 var4; /*write to log*/ datalines...
PICTURE + prefix option
SAS can't put character as the first digit automatically. Add prefix: proc format; picture ph low-high = '0999) 999-9999' (prefix = '(' ) ; run;
How SAS Codes Dates
SAS codes datases as the # of days before or after January 1, 1960. (1/1/1960 = 0, 1/2/1960 = 1, etc.) For data in mmddyy8. format, e.g. 01/07/70, SAS uses YEARCUTOFF to infer the first 2 digits of year. For v9.4, YEARCUTOFF defaults to 1940: if YY<40, SAS assumes it's in the 2000s (2000-2040). If YY>40, assumes it's in the 1900s (1941-1999).
SCAN function
Scans for the nth "word" separated by a specified delimiter: scan(source,n,delim) E.g. if x="my dog, she's the best, I swear!" scan(x,1,',') = "my dog" ("my dog" is the first "word" or chunk of variable before the first delimiter. scan(x,-1,',')=" I swear!" (n=-1 returns LAST segment)
INDEXC
Searches character for 1+ characters. Syntax: INDEXC (character-value/var, 'char1', 'char2', 'char3', 'charn') -> returns position (index #) of the FIRST occurrence of char1, char2, char3, charn; if none found, returns 0. Can also search for multiple chars using INDEXC(string, 'char1char2char3char4'); e.g. if string="ABCDEFG" indexc(string,'F','C','G') -> 3 (position of C) indexc(string,'FCG') -> 3 (position of C) indexc('x','y','z') -> 0 (not found) **Can use this to read in mixed date formats (see Fun w/ Functions exercise)
FIND
Searches for substring within stringvar & returns position # of first letter in substring. (Like index but for substrings?) *note: 'i' modifier = ignore case. e.g. find_silver=FIND(rich, 'silver','i');
Execution Phase
Second phase of a SAS DATA step, where SAS executes the data step statements ONE OBSERVATION AT A TIME, looping until it reaches the end of the dataset. Steps: 1. Begin data statements 2. Sets variable values to missing in PDV 3. Checks: is there a new record (obs) to read? If yes, continue to 4, if no, end DATA step & proceed to rest of program 4. Read 1 input record (obs) 5. Execute any additional data steps (e.g. formats, labels) for this 1 record 6. Write this 1 record to the SAS dataset 7. Loop back to step 1.
SAS 9.4 Universal Printer
Set printerpath, then just use ods printer. E.g.: options printerpath=pdf; ods printer; proc means data=mydata; ... ; run; ods printer close;
Subset Datasets
Subset OBSERVATIONS: 1. Delete: IF <condition> THEN DELETE; 2. Accept/Keep: IF <cond.>; *How it works: SAS does not write obs to output dataset; jumps to next obs. Generally use whichever requires least arguments (SAS prefs #2) **Can also use WHERE/IN - best for PROCs Subset VARIABLES: Use drop/keep
SUBSTR(left) and SUBSTR(right) vs SUBSTRN
Substr(left) and substr(right) refer to their position in an assignment statement. Right (usually to EXTRACT): newvar = substr(oldvar,1,5) Left (usually to REPLACE): substr(oldvar,1,3) = '123' (replaces first 3 characters of oldvar w/ 123) substrn like substr but a few extra features See https://support.sas.com/resources/papers/proceedings/proceedings/sugi25/25/cc/25p088.pdf
SUM(x)
Sums values, excluding missing E.g. totalcost = sum(cost,tax);
ALL option for PROC TABULATE
System option, adds total columns/rows for specified variables. (KEYLABEL lets you label sas keywords). E.g. proc tabulate data=mydata; class female ses; table female ALL, ses ALL; keylabel ALL = "Total"; run; ^Produces table with total row for gender, total col for SES
&SYSDATE
System-generated macro variable that calls the date the program is run, with default format 06NOV20. E.g. data _null_; CALL SYMPUT ('mydate', trim(put("&sysdate"d,mmddyy10.))); run; %put &mydate; Other automatic macro vars include &SYSDATE9, &SYSDAY, and more. See https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=mcrolref&docsetTarget=n0t7cm070dig9hn1kmz3892ebtmw.htm&locale=en.
Cross Elements in PROC TABULATE
TABLE var1*var2 (groups rows/columns first by var1, then var2 within var1)
DO group
Tells SAS to DO several actions based on a common condition, until an END statement ends that condition. E.g. data new; set old; IF city = 'amsterdam' then do; country = 'Netherlands'; museums = 15 walkscore = 82; end; ELSE IF city = 'roma' then do; ...
Formatting Proc Print Output
Title statement (TITLE 'My Title';) Labels: (LABEL var1 = 'mylabel' var2 = 'label2';) Double spacing: Add "double" option (proc print data=mydata double;) Footnotes: Like title statement (FOOTNOTE 'here is my footnote';) Uniform Column Widths: Add width=uniform (proc print data=mydata width=uniform;)
TRANSLATE & TRANWRD
Translate: Searches for & replaces a specific character in a character variable -> translate(variable, tochar, fromchar) Tranwrd: Searches & replaces a substring -> tranwrd(variable, target, replacement) E.g. if x = "My cat, she is fluffy, orange, and fat" translate(x,'!',',') = "My cat! she is fluffy! orange! and fat" tranwrd(x,'cat','enchilada') = "My enchilada, she is fluffy, orange, and fat"
LAG, LAG2 or LAGn
Used to compare/compute differences between subsequent OBSERVATIONS (vertically). LAG(x) by default returns the value for the observation immediately prior. LAG2(x) returns value for 2 obs prior. LAGn(x) returns value for n obs prior. E.g. Data lastime; Input day covtest; Prev_day=lag(covtest); Two_Days_Ago=lag2(covtest); Four_Days=lag4(covtest); Datalines; 1 40 2 165 3 256 4 145 ; Run; LAG values: - , 40, 165, 256 LAG2 values: -, -, 40, 165... etc. **Could use this to compute 3-day averages, etc.
INPUT + "trailing @" (input... @;)
Used to read in only observations that meet a specified condition: data fun; input conditionvar 3-5 @; IF conditionvar = 3; input var1 var2 var3 conditionvar var5; datalines; ...
keep vs drop
Used to select variables to keep or drop when reading in a dataset. Can apply either to the original set (KEEP= / DROP=) or new dataset (KEEP/DROP). Example 1 (only reads in kept variables to PDV): data new; set old (keep = var1 var2 var3); run; Example 2 (reads in all vars to PDV, only writes kept vars to new dataset): data new; set old; keep var1 var2 var3; *no equals sign* run;
LENGTH statement
Used to specify length of variables in the DATA step. MUST COME BEFORE input statements/new variable creation. E.g.: data new; set old; length newvar $10; if oldvar = 'purple' then newvar = 'cool'; else if oldvar = 'yellow' then newvar = 'totallyrad'; run; (Prevents SAS setting default length=4 and cutting off 'totallyrad' at just 'tota') NOTE1: Moves variables to the left of dataset NOTE2: Can be used to save storage for numeric vars (mainly integers), but more common for character
OUTPUT
Used to specify one or more output datasets. Note1: This command suppresses auto-output, so if using must write output statement for all obs. Note2: Order matters! Always put at the END of a data step (after variable assignments) E.g. (writes 4 output datasets): data italy other summer restofyear; set olddata; if country = 'Italy' then output italy; else output other; if season = 'Summer' then output summer; else output restofyear; run; **MAKE SURE TO PUT ALL OUTPUT DATASETS in first DATA line!!!!
INPUT + double trailing @ (input... @@;)
Used to split one record into multiple observations by telling SAS to loop the same variables until it finishes reading the whole record: data test; input gender $ age @@; datalines; m 13 f 22 m 21 f 65 m 18 f 34 ; run; *will run gender & age twice for each line to split it into one long dataset w/ age & gender.
PICTURE
Valid in proc format, lets user create custom numeric formats by showing SAS a 'picture' of what the output should look like. EXAMPLE: proc format; picture ssn low-high = '999-99-9999"; run; [call using format] NOTE1: Picture cannot start with symbol - use PREFIX option to start w/ symbol NOTE2: *ssn = name of format **low-high = tells SAS all values should be printed with that format
Read Multiple Records into 1 Obs
When data for one record is split on multiple lines: e.g. 1023 David blue 189 1033 Angela ... Can use multiple input statements, "/", or "#n": Option 1 (multiple inputs): data ppl; input id 1-4 name $ 6-23; input eyes $ 1-6; input weight 1-3; datalines... Option 2: "/" data ppl; input id 1-4 name $ 6-23 / eyes $ 1-6 / weight 1-3; datalines... Option 3 "#n" data ppl; input #1 id 1-4 name $6-23 #2 eyes 1-6 #3 weight 1-3 (or whatever order you want)
PROC TRANPOSE: Wide to Long
Wide to Long - basic syntax: proc transpose data=wide out=long; VAR weight1-weight3; BY id; run; Creates table with an id column, _NAME_ (weight1, weight2, or weight3), and values column (actual values for weight1-3). *BY is the unique identifier you want to repeat across rows *VAR specifies variables that contain similar data over different columns, e.g. for repeat measurements over time.
PROC TRANSPOSE - basic, no options
With no options specified, proc transpose will exactly transpose the data (flip rows & columns, putting all varnames in a column called _NAME_, and naming other cols COL1, COL2...COLN. E.g.: proc transpose data=flippy out=testy; run;
FILE PRINT <options>;
Writes PUT output directly to output window instead of log. E.g. data _null_; infile...; input var1 var2 var3... ; FILE PRINT notitles; put @1 var1 @6 var2 @15 var2... ; run;
%PUT
Writes specified text to the log. Examples: Code: %put My first macrovar is: &varlist9; Output: My first macrovar is: read9 math9 write9 Code: %put _all_; Output: Writes all macro vars to log & states if they're automatic, global (user-created)
ZIPNAME, ZIPCITY, ZIPSTATE
ZIPNAME(5digitzip) returns uppercase state names in which zip code is located ZIPCITY(zip) returns city ZIPSTATE(zip) returns state abbreviation ZIPFIPS(zip) returns state FIPS code
Concatenate Character Vars
x||y or cat(x,y): Concatenates exactly as they are catt(x,y): Also removes trailing spaces cats(x,y): Removes trailing & leading spaces catx(sep,x,y): Concatenates, removes spaces, and adds a separator in between. e.g. x="the dog ", y="& cat " x||y = cat(x,y) = "the dog & cat " catt(x,y) = "the dog & cat" cats(x,y) = "the dog& cat" catx(' ',x,y) = "the dog & cat" *NOTE: Watch for truncation - may need LENGTH