Combining Data Sets
How many columns with the following PROC SQL code produce in the resulting output? PROC SQL; SELECT COALESCE(d.ID, t.ID) AS ID, Gender, d.Data as DOB FORMAT=MMDDYY8., TestName, t.Date AS TestDt FORMAT=MMDDYY8. FROM sasdata.demo as d FULL JOIN (SELECT * FROM sasdata.test WHERE Date>=MDY(01,01,2005)) as t ON d.ID=t.ID ORDER BY ID; quit;
5
SQL Set Operations Keywords
ALL: does not remove duplicate rows and does not sort rows; not used with OUTER UNION CORR: overlays columns by name instead of position; with EXCEPT, INTERSECT, and UNION, removes columns that are not in both data sets; with OUTER UNION, all columns remain
Why is PROC APPEND more efficient than using a SET statement in the DATA step to combine two SAS data sets vertically?
APPEND only reads the data in the DATA= data set
Multidimensional Array Syntax
ARRAY <arrayname>(#, #);
Hash Object Syntax
DATA <combined dataset>; IF 0 THEN SET <hash data set>; IF _N_= 1 THEN DO; DECLARE HASH <hashref> (DATASET: <hash data set>); hashref.DEFINEKEY ("<key variable>"); hashref.DEFINEDATA ("<data>", "<data>"); hashref.DEFINEDONE ( ); END; SET <base data set>; hashref.FIND (KEY:<key variable>); <statements to assign hash to base data set, usually IF/THEN or something similar> RUN;
FILENAME syntax
FILENAME <file name ref> ('<location\filename.ext>' <'location\filename.ext' .... >);
Which hash object method allows you to retrieve data values?
Find( )
FORCE option
Force BASE= dataset to allow appending
Types of Vertical Set Operators in SQL
INTERSECT OUTER UNION UNION EXCEPT
Horizontal Joins in SQL
Inner Joins and Outer Joins Used for combining data sets that have some (or all) of the same observations - combine using key(s)
For which join can two different styles of PROC SQL syntax be used?
Inner join
PROC APPEND syntax
PROC APPEND BASE = <dataset> DATA = <data to add> <FORCE>; RUN;
SQL Set Operations Syntax
PROC SQL; SELECT <first query> FROM INTERSECT | OUTER UNION | UNION | EXCEPT SELECT <second query> FROM ; QUIT;
SQL Outer Join Syntax
PROC SQL; SELECT FROM LEFT JOIN | RIGHT JOIN | FULL JOIN ON; QUIT; Need to specify type of join ON clause provides expressions for the columns that join tables (similar to a BY statement in a DATA step)
SQL Inner Join Syntax
PROC SQL; SELECT FROM WHERE; QUIT; Data sets to be joined are listed in FROM clause separated by commas WHERE clause identifies key ids (WHERE dataset1.keyid = dataset2.keyid) and other subsetting conditions
PROC TRANSPOSE Syntax
PROC TRANSPOSE DATA=<input-data-set> OUT=<output-data-set> <NAME=prefix text> <PREFIX=variable-name>; BY <DESCENDING> variable-1 <...<DESCENDING> variable-n> <NOTSORTED>; VAR <variable(s)>; RUN; PREFIX= : assigns a column name rather than default of "Col" NAME= : assigns a column name to the columns that identify the transposed variable(s)
Suppose that you would like to combine a grand total across all observations (the result of PROC MEANS in an output data set) back with the original data set so that it repeats for each observation. Which type of statement would you use in a DATA step to complete this task? MEANS data set Total 17.50 Original data plus appended total ID Cost Total NSF-028 2.25 17.50 GBH-199 3.75 17.50 SFF-821 5.50 17.50 FFP-545 6.00 17.50
SET
INTERSECT Joins in SQL
SQL evaluates first combine queries each starting with a SELECT clause deletes duplicate rows combines columns by overlaying them based on position (not column name), columns must be the same type, column name comes from first column (unless no column name) Ven diagram of two circles stacked vertically with overlap, selects only where circles overlap
EXCEPT Set Operator in SQL
SQL evaluates last combine queries each starting with SELECT clause deletes duplicate rows combines columns by overlaying them based on position (not column name), columns must be the same type, column name comes from first column (unless no column name) Ven diagram of two circles stacked vertically with overlap, selects all of first circle minus what overlaps with second circle
OUTER UNION Joins in SQL
SQL evaluates second combine queries each starting with a SELECT clause Two circles stacked vertically with no overlap, selects all of both circles
UNION Set Operator in SQL
SQL evaluates third combine queries each starting with a SELECT clause deletes duplicate rows combines columns by overlaying them based on position (not column name), columns must be the same type, column name comes from first column (unless no column name) Ven diagram of two circles stacked vertically with overlap, selects all of both circles
PROC APPEND
Stack/Interweave two data sets Reads only DATA= dataset use FORCE keyword if DATA= dataset has more variables (or different variables) than BASE= dataset
Cartesian Product
every possible combination of two data sets merging can be very computer intense
Advantages of Hash Objects
faster because table stored in memory rather than in stored table similar to arrays but more flexible data doesn't need to be sorted object oriented approach
SQL Inner Joins
Think ven diagram of two circles. Inner joins combine the parts of the data sets that overlap combines tables on rows that match in both tables WHERE clause restricts rows of Cartesian product Maximum joining of 32 tables in one single inner join
SQL Outer Joins
Think ven diagram of two circles. Outer joins combine one circle completely with the overlapping part of the other circle or the entire other circle. combine tables on rows that match in both tables plus some rows that do not match from one or both tables Maximum of two tables to join
T/F: By default, PROC SQL does not overlay two columns of the same name.
True
IF _N_=1 THEN SET <data set>;
Use to prevent SAS from reaching end of file after the first iteration, values are retained
Which clause is added to a SQL join to restrict the rows of the Cartesian product that are displayed in the output?
WHERE
COALESCE
function to overlay common columns when combining data sets in PROC SQL
Hash Objects
a data structure that contains a list of items that can be used to map keys to their associated values Data step only can be hard coded in DATA step or loaded from a data set a component object (data element) that has -attributes: specify information associated with the object -methods: define the operations that the object can perform Key(s): must be unique, can be composite Data: can be multiple data points per key
What happens when the BASE = data set contains more variables than the DATA= data set when using PROC APPEND? a. A warning message is generated in the log b. An error message is generated in the log c. Appending is not done d. A loss of data occurs
a. A warning message is generated in the log
A set operation must have which of the following? a. At least two queries each beginning with a SELECT clause b. A WHERE clause c. The keyword ALL or CORR d. All of the above
a. At least two queries each beginning with a SELECT clause
Which of the following is an advantage of a SQL join? a. Data sets do not have to be sorted or indexed b. There is no limit to the number of input data sets c. There must be a common key variable in all the data sets d. Complex business logic can be hidden from the user
a. Data sets do not have to be sorted or indexed b is incorrect because SQL joins can only join 2 data sets together
Which set operator selects unique rows from the first table that are not found in the second table? a. EXCEPT b. INTERSECT c. UNION d. OUTER UNION
a. EXCEPT
Given the following SAS data sets One and Two: One X Y 1 A 5 B 7 C Two X Y Z 4 D 7 3 E 9 The following SAS program is submitted: proc append data=two base=one <insert option here>; run; Which of the following options enables the APPEND procedure to execute without errors? a. FORCE b. COMBINE c. MERGE d. no option is necessary
a. FORCE The FORCE option is required in this program because the DATA= data set, Two, has more variables than the BASE= data set, One. COMBINE and MERGE are not valid options in the APPEND procedure. If base= and data= datasets swapped, no option would be necessary since PROC APPEND does not read base data set.
For which join can two different styles of PROC SQL syntax be used? a. Inner join b. Left join c. Right join d. Full join
a. Inner join
Given the following SAS data sets One and Two: One Obs Common X 1 A 10 2 A 13 3 A 14 4 B 9 Two Obs Common Y 1 A 1 2 A 3 3 B 4 4 B 2 The following SAS program is submitted: data combine; merge one two; by common; run; What data values are stored in the data set Combine? a. Obs Common X Y 1 A 10 1 2 A 13 3 3 A 14 3 4 B 9 4 5 B 9 2 b. Obs Common X Y 1 A 10 1 2 A 13 3 3 B 9 4 c. Obs Common X Y 1 A 10 1 2 A 13 3 3 B 14 4 4 B 9 2 d. Obs Common X Y 1 A 10 1 2 A 13 1 3 A 14 1 4 A 10 3 5 A 13 3 6 A 14 3 7 B 9 4 8 B 9 2
a. Obs Common X Y 1 A 10 1 2 A 13 3 3 A 14 3 4 B 9 4 5 B 9 2 The MERGE statement in the DATA step processes data sequentially. The first observation of data set One is combined with the first observation of data set Two because the variable Common has a value of A in both data sets. The second observation of the data set One is combined with the second observation of the data set Two because the variable Common has a value of A in both data sets. Because there is an additional observation in the data set Two where the variable Common has a value of A, the program data vector is not reinitialized; the value of X is retained from observation two, and the values of Common and Y are read from the data set Two. Thus, there are three observations in the data set Combine where the variable Common has a value of A. The third observation of the data set One is combined with the fourth observation of the data set Two because the variable Common has a value of B in both data sets. Finally, the fourth observation of the data set One is combined with the last observation of the data set Two because the variable Common has a value of B in both data sets. Thus, there are two observations in the data set Combine where the variable Common has a value of B.
Table One has five million observations. Table Two has one thousand observations. These tables have identical column attributes. Concatenating tables One and Two should result in 5,001,000 observations. Which one of the following SAS techniques uses the least CPU time and fewest I/O operations to process? a. the APPEND procedure b. the SET statement in the DATA step c. the INSERT INTO statement in the SQL procedure d. the OUTER UNION CORR operator in the SQL procedure
a. the APPEND procedure does not read base data set, only reads data set that is being added on
Multidimensional Arrays
array like a table columns by rows SAS stores as single row though can be used for table lookups
Which of the following is a feature of using a hash object? a. It uses consecutive integers to address elements of the hash object b. It does not require the data to be sorted or indexed c. It exists outside the DATA step and can be used with any PROC step d. None of the above
b. It does not require the data to be sorted or indexed
Which of the following is an advantage of a DATA step join? a. Data sets do not have to be sorted or indexed b. There is no limit to the number of input data sets c. There must be a common key variable in all the data sets d. Complex business logic can be hidden from the user
b. There is no limit to the number of input data sets.
Which set operator is equivalent to stacking (aka concatenating) data sets with a SET statement in a DATA step? a. OUTER UNION b. OUTER UNION ALL c. OUTER UNION CORR d. None of the above
c. OUTER UNION CORR
FILENAME
can be used to associate a single raw data file with a fileref or concatenate multiple raw data files into one fileref FILENAME <fileref> ('rawfile1' 'rawfile2' ...);
PROC TRANSPOSE
creates an output data set by restructuring the values in a SAS data set. When the data set is restructured, selected variables are transposed into observations. can be used to alter one data set to combine with another
Which SQL join will return all rows generated by an inner join plus additional rows? a. Left join b. Right join c. Full join d. All of the above
d. All of the above
Which join will return all rows generated by an inner join plus additional rows? a. Left join b. Right join c. Full join d. All of the above
d. All of the above
Which set operator does not overlay columns by default? a. EXCEPT b. INTERSECT c. UNION d. OUTER UNION
d. OUTER UNION
Which one of the following is a characteristic of a hash object in a SAS DATA step? a. The hash object requires the data to be sorted. b. The hash object requires the data to be indexed. c. The data contained in the hash object can only be loaded from a SAS data set. d. The hash object can contain character or numeric data or it can contain a combination of both character and numeric data
d. The hash object can contain character or numeric data or it can contain a combination of both character and numeric data The DATA step hash object does not require that the data be sorted or indexed. The hash object can be loaded with constants by using the ADD method; therefore, it is not a requirement that the data contained in the hash object be loaded from a SAS data set. The hash object data items and key item can be character, numeric, or a combination of both.
Given the following SAS data sets Class1 and Class2: Class1 Name Course Lauren MATH1 Patel MATH1 Chang MATH1 Class2 Name Course Smith MATH2 Farmer MATH2 Patel MATH2 Hillier MATH2 The following SAS program is submitted: PROC SQL; SELECT name FROM class1 <insert procedure operator here> SELECT name from class2; QUIT; The following output is desired: Name Patel What is the SQL procedure operator that completes the program and produces the desired output?
intersect
FILENAME statement
load raw data can specify more than one dataset
Table Lookup Relationships
one-to-one match one-to-many match many-to-many match (might want to use multiple set statements, might need do loop) nonmatching data: dense match (almost all match), sparse match (many unmatched observations)
INFILE
read in raw file(s) INFILE <file-specification> <FILEVAR= <variable>>; FILEVAR= option: names a variable whose change in value causes the INFILE statement to close the current input file and open a new input file.
Table Lookups
same as horizontal join base table is primary source, look up tables are input sources, look up values are the data from the look up table may require lots of sorting - could be easier in SQL
Stacking data sets in a DATA step
specify multiple SAS datasets in SET statement use FILENAME with INFILE for raw data -not as efficient as PROC APPEND for stacking 2 data sets -more flexible than PROC APPEND -FILEVAR= : specify a variable that tells SAS to move to the next file -END= : help SAS recognize the last record in a file
Vertical Set Operations in PROC SQL
stacking data sets typically tables have similar content SQL evaluates in order of: INTERSECT, OUTER UNION, UNION, EXCEPT Exception is if using parentheses (like order of operations)