SAS Statistics, SAS Visual Analytics, SAS DataFlux
row-based business rule called Monitor for Nulls
A Data Quality Steward creates these items for the Supplier repository: - A row-based business rule called Monitor for Nulls - A set-based business rule called Percent of Verified Addresses - A group-based rule called Low Product Count - A task based on the row-based, set-based, and group-based rules called Monitor Supplier Data Which one of these can the Data Quality Steward apply in an Execute Business Rule node in a data job? Response: set-based business rule called Percent of Verified Addresses row-based business rule called Monitor for Nulls group-based rule called Low Product Count task based on the row-based, set-based, and group-based rules called Monitor Supplier Data
row-based business rule called Monitor for Nulls
A Data Quality Steward creates these items for the Supplier repository: • A row-based business rule called Monitor for Nulls • A set-based business rule called Percent of Verified Addresses • A group-based rule called Low Product Count • A task based on the row-based, set-based, and group-based rules called Monitor Supplier Data Which one of these can the Data Quality Steward apply in an Execute Business Rule node in a data job? A. set-based business rule called Percent of Verified Addresses B. row-based business rule called Monitor for Nulls C. group-based rule called Low Product Count D. task based on the row-based, set-based, and group-based rules called Monitor Supplier Data
proc logistic data = MYDIR.EMPLOYMENT descending; class Education (param=ref ref='3'); model Hired = Salary Education; run;
A Human Resource manager fits a logistic regression model with the following characteristics: - binary target Hired - continuous predictor Salary - categorical predictor Education (levels=1,2,3) The default odds ratio compares each level against the last class level for the variable Education. Which SAS program gives parameter estimates for Education that are consistent with the default odds ratios? proc logistic data = MYDIR.EMPLOYMENT descending; class Education (param=ref ref='3'); model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education; model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education (ref='3'); model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education Salary (param=ref ref='3'); model Hired = Salary Education; run;
Token
A __________ is an "atomically semantic" component of a data value. In other words, _____________ represent the smallest pieces of a data value that have some distinct meaning. A. Token B. Data Value C. Data Object
Path
A ____________ in a Sankey diagram represents a distinct sequence of events. Each _________ in the diagram consists of one or more transactions. A. Path B. direction indicator C. result
Adminsitrator
A ______________ has the Publish this collection for all users option in the Collections window. A. Publisher B. Administrator C. Report Developer
Residual
A ______________ is the difference between the observed value of the response and the predicted value of the response variable. A. ANOVA B. Mean C. Residual
Tree Map The size of each tile represents either the summarization of a measure or the frequency that is displayed as a count or percent.
A ______________ visualization enables you to display a category or hierarchy as a set of rectangular tiles. A. Bar Chart B. Tree Map C. Heat Map
Collection
A _______________ is a set of fields that are selected from tables that are accessed from different data connections. A _______________ provides a convenient way for users to build a dataset using those fields. A __________________ can be used as an input source for a profile in Data Management Studio. A. Collection B. Data Connection C. Master Data Foundation
Hierarchy In many cases, the levels of a hierarchy are arranged with more general information at the top and more specific information at the bottom.
A ________________ is a defined arrangement of categorical data items based on parent-child relationships. A. Lineage B. Ordinal Process C. Heirarchy
Scatter Plot
A ___________________ visualization enables you to examine the relationship between numeric data items. A. Bar B. Scatter Plot C. Histogram
Stop List
A ____________________ is a table of words that you want to ignore in your text analysis. A. Document Collection B. Stop List
Standardization A standardization definition has the following attributes: is more complex than a standardization scheme involves one or more standardization schemes can also parse data and apply regular expression libraries and casing
A ________________________ scheme is a simple find-and-replace table that specifies how data values will be standardized. A. Data Search B. Standardization
Visualization
A ___________________displays data values using one of several ___________________types. _________________types include tables, charts, plots, geographic maps, and more. A _________________can contain filters and other display properties. A. Data Source B. Visualization C. Exploration
Standardization A standardization scheme can be built from the profile report. When a scheme is applied, if the input data is equal to the value in the Data column, then the data is changed to the value in the Standard column. The standard value DataFlux was selected by the Scheme Builder because it was the permutation with the most occurrences in the profile report.
A _________________scheme takes various spellings or representations of a data value and lists a standard way to consistently write this value. A. Build B. Standardization
Preview Previewing does not create the output. The output is physically created only when the job is executed.
A _______________of a Data Output node does not show field name changes or deletions. This provides the flexibility to continue your data flow after a Data Output node. In addition, previewing a Data Output node does not create the output. You must run the data job to create the output. A. Export B. Import C. Preview
Transaction
A ______________is a sequence of events that are associated with a specific ____________ identifier value. A. Process B. Transaction C. explanatory observation
Reference Reference source locations are registered on the Administration riser bar in DataFlux Data Management Studio. One reference source location of each type should be designated as the default.
A ______________object is typically a database used by DataFlux Data Management Studio to compare user data to a reference source (for example, USPS Address Data). You cannot directly access or modify references. A. Data Source B. Reference
Diffogram The downward-sloping diagonal lines show the confidence intervals for the differences. The upward-sloping line is a reference line showing where the group means would be equal.
A _____________can be used to quickly tell whether two group means are statistically significant. The point estimates for the differences between pairs of group means can be found at the intersections of the vertical and horizontal lines drawn at group mean values. A. Histogram B. Diffogram
Chop Tables Purpose: Extract individual words from a text string Editor: Chop Table Editor
A collection of character-level rules used to create an ordered word list from a string. For each character represented in the table, you can specify the classification and the operation performed by the algorithm. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars
Schemes Purpose: Standardize phrases, words, and tokens Editor: Scheme Builder
A collection of lookup tables used to transform data values to a standard representation. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars
Regex Libraries Purpose: Standardization, categorization, casing, and pattern identification Editor: Regex Library Editor
A collection of patterns that are matched against a text string (from left to right) for character-level cleansing and operations. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars
Phonetics Libraries Purpose: Phonetic (sound-alike) reduction of words Editor: Phonetics Editor
A collection of patterns that produce the same output string for input strings that have similar pronunciations or spellings. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars
Grammars Purpose: Identify patterns in word categories Editor: Grammar Editor
A collection of rules that represent extracted patterns of words in a given context. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars
Vocabulary Libraries Purpose: Categorize words Editor: Vocabulary Editor
A collection of words, each associated with one or more categories and likelihoods. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars
A Geography hierarchy data item may be assigned to the Geography role to provide drill down capability on the coordinate points.
A content developer builds a Geo Map visualization in SAS Visual Analytics Explorer and sets the Map style property to Coordinates. Which statement about this Geo Map visualization is true? Response: A measure data item can be assigned to the Color role to control the color of the coordinate points. A Custom Geography data item cannot be assigned to the Geography role when Coordinates is set as the Map style property. A category data item can be assigned to the Group role to group the coordinates into regions. A Geography hierarchy data item may be assigned to the Geography role to provide drill down capability on the coordinate points.
The line chart was created using the automatic chart functionality.
A content developer created the visualization with a forecast shown above. Additional measures for scenario analysis cannot be added from the Roles Tab. Why? Response: Underlying factors are not available in the line chart visualization. The forecast option in a line chart does not allow scenario analysis. The visualization does not allow additional measures. The line chart was created using the automatic chart functionality.
Data explorations can be used for the following: to identify data redundancies to extract and organize metadata from multiple sources to identify relationships between metadata to catalog data by specified business data types and processes
A data exploration reads data from databases and categorizes the fields in the selected tables into categories. These categories are predefined in the Quality Knowledge Base (QKB). Data explorations perform this categorization by matching column names. You also have the option of sampling the data in the table to determine whether the data is one of the specific types of categories in the QKB. A. Repository B. Data Collection C. Data Exploration
The portfolios differ significantly with respect to risk.
A financial analyst wants to know whether assets in portfolio A are more risky (have higher variance) than those in portfolio B. The analyst computes the annual returns (or percent changes) for assets within each of the two groups and obtains the following output from the GLM procedure: Which conclusion is supported by the output? A. Assets in portfolio A are significantly more risky than assets in portfolio B. B. Assets in portfolio B are significantly more risky than assets in portfolio A. C. The portfolios differ significantly with respect to risk. D. The portfolios do not differ significantly with respect to risk.
C. The portfolios differ significantly with respect to risk.
A financial analyst wants to know whether assets in portfolio A are more risky (have higher variance) than those in portfolio B. The analyst computes the annual returns (or percent changes) for assets within each of the two groups and obtains the following output from the GLM procedure: A. Assets in portfolio A are significantly more risky than assets in portfolio B. B. Assets in portfolio B are significantly more risky than assets in portfolio A. C. The portfolios differ significantly with respect to risk. D. The portfolios do not differ significantly with respect to risk.
Business Rule Business rules are defined within a repository using the Business Rules Manager.
A formula, validation, or comparison that can be applied to a given set of data.Data must either pass or fail the business rule. A. Exception B. Business rule
proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run;
A linear model has the following characteristics: - a dependent variable (y) - one continuous predictor variables (x1) including a quadratic term (x12) - one categorical predictor variable (c1 with 3 levels) - one interaction term (c1 by x1) Which SAS program fits this model? proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1sq c1byx1 /solution; run; proc reg data=SASUSER.MLR; model y = c1 x1 x1sq c1byx1 /solution; run; proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run; proc reg data=SASUSER.MLR; model y = c1 x1 x1*x1 c1*x1; run;
proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run;
A linear model has the following characteristics: • a dependent variable (y) • one continuous predictor variables (x1) including a quadratic term (x12) • one categorical predictor variable (c1 with 3 levels) • one interaction term (c1 by x1) Which SAS program fits this model? A. proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1sq c1byx1 /solution; run; B. proc reg data=SASUSER.MLR; model y = c1 x1 x1sq c1byx1 /solution; run; C. proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run; D. proc reg data=SASUSER.MLR; model y = c1 x1 x1*x1 c1*x1; run;
Plan: Discover
A quick inspection of your corporate data would probably find that it resides in many different databases, managed by many different systems, with many different formats and representations of the same data. This step of the methodology enables you to explore metadata to verify that the right data sources are included in the data management program. You can also create detailed data profiles of identified data sources so that you can understand their strengths and weaknesses. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
d. Odds Ratio e. Spearman Correlation
A researcher wants to measure the strength of an association between two binary variables. Which statistic(s) can he use? a. Hansel and Gretel Correlation b. Mantel-Haenszel Chi-Square c. Pearson Chi-Square d. Odds Ratio e. Spearman Correlation
Record rules
A sample of data has been clustered and found to contain many multi-row clusters. For each of these clusters, you want to choose a single record to represent the information in the cluster. Which type of rule do you use to determine a surviving record? Response: Record rules Business rules Clustering rules Field rules
Field rules
A sample of data has been clustered and found to contain many multi-row clusters. To construct a "best" record for each multi-row cluster, you need to select information from other records within a cluster. Which type of rule allows you to perform this task? A. Clustering rules B. Record rules C. Business rules D. Field rules
Data Collection A data collection has the following features: provides a convenient way to build a data source using desired fields can be used as an input source for profiles
A set of data fields from different tables in different data connections. A. Repository B. Data Collection
The difference in the logit between level 1 and the average of all levels
A variable coded 1, 2, 3, and 4 is parameterized with effect coding, with 2 as the reference level. The parameter estimate for level 1 tells you which of the following? a. The difference in the logit between level 1 and level 2 b. The odds ratio between level 1 and level 2 c. The difference in the logit between level 1 and the average of all levels d. The odds ratio between level 1 and the average of all levels e. Both a and b f. Both c and d
Act: Execute
After business users establish how the data and rules should be defined, the IT staff can install them within the IT infrastructure and determine the integration method (real time, batch, or virtual). These business rules can be reused and redeployed across applications, which helps increase data consistency in the enterprise. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Act: Design
After you complete the first two steps, this phase enables you to take the different structures, formats, data sources, and data feeds and create an environment that accommodates the needs of your business. At this step, business and IT users build workflows to enforce business rules for data quality and data integration. They also create data models to house data in consolidated or master data sources. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
False Role-based permissions provide the ability to view and add comments.
All users have the ability to add and view comments. True False
Exploration
An __________________ is a metadata object that accesses one or more data sources and contains one or more visualizations of the data. The visualizations, data sources, and property settings are saved as part of an ________________. A. Data Source B. Visualization C. Exploration
Adj R-Sq
An analyst has selected this model as a champion because it shows better model fit than a competing model with more predictors. Which statistic justifies this rationale? Adj R-Sq R-Square Error DF Coeff Var
ANOVA
Analysis of variance (ANOVA) is a statistical technique used to compare the means of two or more groups of observations or treatments. For this type of problem, you have the following: a continuous dependent variable, or response variable a discrete independent variable, also called a predictor or explanatory variable. A. CONOVA B. ANOVA
Plan
Analyzing and exploring the data sources can lead to the discovery of data quality issues. The ACT phase is designed to create data jobs that cleanse, or correct, the data. This phase involves the following: standardizing, parsing, and/or casing the data correctly identifying types of data (identification analysis) performing methods to remove duplicates from data sources or to join tables with no common key A. Plan B. Act C. Monitor
ANOVA
Assessing __________ Assumptions In many cases, good data collection designs can help ensure the independence assumption. Diagnostic plots from PROC GLM can be used to verify the assumption that the error is approximately normally distributed. PROC GLM produces a test of equal variances with the HOVTEST option in the MEANS statement. H0 for this hypothesis test is that the variances are equal for all populations. A. Equality B. ANOVA C. Variability
row-based business rule called Monitor for Nulls
Assume the following items are created for the Supplier repository: - A row-based business rule called Monitor for Nulls - A set-based business rule called Percent of Verified Addresses - A group-based rule called Low Product Count - A task based on the row-based, set-based, and group-based rules called Monitor Supplier Data Which one can you apply in a profile? Response: row-based business rule called Monitor for Nulls group-based business rule called Low Product Count set-based business rule called Percent of Verified Addresses task based on the row-based, set-based, and group-based rules called Monitor Supplier Data
ANOVA
Assumptions for ______________ Observations are independent. Errors are normally distributed. All groups have equal error variances. A. Equality B. ANOVA C. Variability
ANOVA
Assumptions for _____________________ Observations are independent. Errors are normally distributed. All groups have equal error variances. A. Means B. ANOVA C. Medians
One Category Multiple categories and measures Geography and three or more measures One or more categories and any number of measures or geographies
Bar Chart A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies
Role-based rule
Business Rule that evaluates every row in a table? A. Role-based rule B. Set-based rule C. Group-based rule
Group-based rule
Business Rule that evaluates groups of data (for example, if data is grouped by product code, then the rules are evaluated for each product code)? A. Role-based rule B. Set-based rule C. Group-based rule
Set-based rule
Business Rule that evaluates the table as a whole? A. Role-based rule B. Set-based rule C. Group-based rule
SAS QKB for Product Data (PD)
Contains extraction, parsing, standardization, and pattern analysis definitions to handle the following attributes in generic product data: • brands/manufacturers • colors • dimensions • sizes • part numbers • materials • packaging terms and units of measurement A. SAS QKB for Contact Information (CI) B. SAS QKB for Product Data (PD)
Extraction
Extracts parts of the text string and assigns them to corresponding tokens for the specified data type. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
The Totals Placement property has been set to After.
For this Cross tab visualization created in SAS Visual Analytics Explorer, which statement is true? Response: The Indented property has been selected. The Show row totals property has been selected. The Totals Placement property has been set to After. Product line is the lowest level in the hierarchy.
A category data item is assigned to the Lattice Rows role and three data items are assigned to the Measures role.
For this Line chart visualization created in SAS Visual Analytics Explorer, how are data items assigned to roles? Response: A hierarchy data item is assigned to the Category role and a datetime data item is assigned to the X-axis role. Three data items are assigned to Group role and a datetime data item is assigned to the Measures role. A datetime data item is assigned to the Category role and a category data item is assigned to the Group role. A category data item is assigned to the Lattice Rows role and three data items are assigned to the Measures role.
Match
Generates match codes for text strings where the match codes denote a fuzzy representation of the character content of the tokens in the text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
Geography and zero to two measures
Geo Map A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies
Large wrist size is significantly different than small wrist size.
Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small(S), medium(M), and large(L) wrist sizes? Medium wrist size is significantly different than small wrist size. Large wrist size is significantly different than small wrist size. There is no significant difference due to wrist size. Large wrist size is significantly different than medium wrist size.
-2 and 2
Given the properties of the standard normal distribution, you would expect about 95% of the studentized residuals to be between which two values? a. -3 and 3 b. -2 and 2 c. -1 and 1 d. 0 and 1 e. 0 and 2 f. 0 and 3
Gender Analysis
Guesses the gender of the individual in the text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
Language Guess
Guesses the language of a text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
Locale Guess
Guesses the locale of a text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
High
High cardinality refers to columns with a large number of values that are unique. A. High B. Low C. Median
One Measure
Histogram A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies
They both utilize an identification analysis definition from the Quality Knowledge Base.
How are the Field name analysis and Sample data analysis methods similar? They both require the same identification analysis definition from the Quality Knowledge Base. They both utilize an identification analysis definition from the Quality Knowledge Base. They both utilize a match definition from the Quality Knowledge Base. They both require the same match definition from the Quality Knowledge Base.
They both utilize an identification analysis definition from the Quality Knowledge Base.
How are the Field name analysis and Sample data analysis methods similar? A. They both utilize a match definition from the Quality Knowledge Base. B. They both require the same identification analysis definition from the Quality Knowledge Base. C. They both utilize an identification analysis definition from the Quality Knowledge Base. D. They both require the same match definition from the Quality Knowledge Base.
from the Tools menu
How do you access the Data Management Studio Options window? from the Tools menu from the Administration riser bar in the app.cfg file in the DataFlux Data Management Studio installation folder from the Information riser bar
from the Tools menu
How do you access the Data Management Studio Options window? A. from the Tools menu B. from the Administration riser bar C. from the Information riser bar D. in the app.cfg file in the DataFlux Data Management Studio installation folder
Should have chosen Use a CLASS statement.
How do you get PROC TTEST to display the test for equal variance? Use the option EV. Request a plot of the residuals. Should have chosen Use a CLASS statement. Use the MEANS statement with a HOVTEST option.
10
How many observations did you find that might substantially influence parameter estimates as a group? a. 0 b. 1 c. 4 d. 5 e. 7 f. 10
Identification Analysis
Identifies the text string as referring to a particular predefined category. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
D. proc reg data=SASUSER.MLR; model y = x1-x4; run;
Identify the correct SAS program for fitting a multiple linear regression model with dependent variable (y) and four predictor variables (x1-x4). A. proc reg data=SASUSER.MLR; model y = x1 x2 x3 x4 /solution; run; B. proc reg data=SASUSER.MLR; model y = x1; model y = x2; model y = x3; model y = x4; run; C. proc reg data=SASUSER.MLR; var y x1 x2 x3 x4; model y = x1-x4; run; D. proc reg data=SASUSER.MLR; model y = x1-x4; run;
Role-Based
In Dataflux, this rule Evaluates every row in a table? A. Role-Based B. Set-Based C. Group-Base
Group-Base
In Dataflux, this rule Evaluates groups of data (i.e. if data is grouped by product code, then the rules are evaluated for each product code)? A. Role-Based B. Set-Based C. Group-Base
Set-Based
In Dataflux, this rule Evaluates the table as a whole(for example, evaluates 1000 rows as a set)? A. Role-Based B. Set-Based C. Group-Base
A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous.
In SAS Visual Analytics Explorer, when a date data item is dragged onto an Automatic Chart visualization either a bar chart or a line chart will be created. What determines the type of chart created? A. The format applied to the date data item determines the type of chart displayed. B. A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous. C. The properties associated with the automatic chart determines the type of chart displayed. D. A line chart is created if the Model property of the data item is set to Discrete, a bar chart is created if the Model property is set to Continuous.
A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous.
In SAS Visual Analytics Explorer, when a date data item is dragged onto an Automatic Chart visualization either a bar chart or a line chart will be created. What determines the type of chart created? Response: A line chart is created if the Model property of the data item is set to Discrete, a bar chart is created if the Model property is set to Continuous. A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous. The properties associated with the automatic chart determines the type of chart displayed. The format applied to the date data item determines the type of chart displayed.
Goal Seeking
In SAS Visual Analytics Explorer, which feature explores underlying factors by specifying target values for the forecast measure? Response: Forecast Targeting Goal Analysis Scenario Analysis Goal Seeking
In the data properties tab change the format of the data item to year.
In SAS Visual Analytics, The data Item date displays with the month. Day and year(MMDDYYYY) how does a content developer display only the year in the visualizations or report object. Response: In the data properties tab change the format of the data item to year. Format the data item using the roles tab in the right pane Select measure Details and change the format of the data item to year Right click on the data item and select New Aggregated Measure
Data Validation Node
In a data job, to filter rows of data for a specific field from a database table, which node would you select for optimal Performance. Data Validation Node SQL Query Node External data Provider Node Data Source Node
-.7563
In the Analysis of Maximum Likelihood table, using effect coding, what is the estimated logit for someone at IncLevel=2? a. -.5363 b. -.6717 c. -.6659 d. -.7563 e. Cannot tell from the information provided
Data Type
In the context of the QKB, a _______________ is an object that represents the semantic nature of some data value. A _____________ serves as a placeholder (or grouping) for metadata used to define data cleansing and data integration algorithms (called definitions). DataFlux provides many data types in the QKB, but you can also create your own. A. Data Object B. Data Type
Data Job Node The referenced data job (the one that is embedded using the Data Job (reference) node) must have an External Data Provider node as the input. Data is passed from the parent job to the referenced data job, processed, and returned to the flow in the parent job. The Data Job (reference) node is found in the Data Job grouping of nodes.
Is used to embed a data job within a data job. A. Data Node B. Data Job Node
One datetime category and any number of other categories or measures.
Line Chart A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies
Specify different sensitivities for some or all the fields.
Match codes fields were generated based on these fields: NAME ADDRESS CITY STATE The Clustering node is "over matching". It is finding matches where there should NOT be matches. What can be done in the Match Codes node to prevent this "over matching"? Select the option "Lower sensitivity levels". Specify different sensitivities for some or all the fields. Nothing can be done from within the Match Codes node. Select the option "Remove over-matched values".
Parse
Parses a text string by attempting to understand which words or phrases should be associated with each token of the specified data type. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
HOVTEST
Performs a test of homegeneity (equality) of variances. The null hypothesis for this test is that the variances are equal. Levene's test is the default. A. T-TEST B. HOVTEST C. EQUALTEST
Combine
Possible values of Diff type include the following: A record belongs to a set of records from one or more clusters in the left table that are combined into a larger cluster in the right table. A. Combine B. Divide C. Network
Divide
Possible values of Diff type include the following: A record belongs to a set of records in a cluster in the left table that is divided into two or more clusters in the right table. A. Combine B. Divide C. Network
Network
Possible values of Diff type include the following: A record belongs to a set of records that are involved in one or more different multirecord clusters in the left and right tables. A. Combine B. Divide C. Network
False
Predictor variables are assumed to be normally distributed in linear regression models. True False
Diagnostics
Produces a panel display of diagnostic plots for linear models? A. Diagnostics B. Hovtest
Metadata
Profiles are not stored as files, but as ____________. To run a profile via the command line, the Batch Run ID for the profile must be specified. A. Metadata B. Tokens
External Data Provider Node The External Data Provider node has the following characteristics: accepts source data from another job or from user input that is specified at run time can be used as the first node in a data job that is called from another job can be used as the first node in a data job that is deployed as a
Provides a landing point for source data that is external to the current job. A. External Data Provider Node B. External Data Job
Report Viewing Role
Provides commenting and personalization features, in addition to basic functionality. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role
Basic Role
Provides functionality for guest access (if applicable) and entry-level users. Enables users to view reports in the Visual Analytics Viewer, but does not provide commenting or personalization features. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role
Theme Designer Admin Role
Provides the ability to create custom themes using Theme Designer for Flex. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role
Analysis Role
Provides the ability to create reports and explorations, in addition to report viewing functionality. If SAS Visual Statistics is licensed, provides the Build Analytical Model capability. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role
Comments Admin Role
Provides the ability to delete and edit other users' comments. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role
Data profiles provide the following benefits: improve understanding of existing databases aid in identifying issues early in the data management process, when they are easier and less expensive to manage help determine which steps need to be taken to address data problems enable you to make better business decisions about your data
Provides the ability to inspect data for errors, inconsistencies, redundancies, and incomplete information. A. Data Profile B. Data Collection
Administration Role
Provides the ability to perform tasks in the administrator, in addition to most other capabilities. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role
Data Building Role
Provides the ability to prepare data, in addition to the analysis functionality. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role
Extensible
Rules are no longer limited to well-known contact data. With the customization feature in Data Management Studio, you can create data-cleansing rules for any type of data. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Modifiable
Rules can be modified to appropriately address the needs of the enterprise and can be implemented across Data Management Studio modules. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Right-click on the category data item from the Data pane, and select Colors.
SAS Visual Analytics Explorer assigns colors dynamically to category values for grouped visualizations. How would a content developer specify a specific color for a category value? Response: Change the grouping style on the properties tab in the right pane. Right-click on the category data item from the Data pane, and select Colors. Right-click on the category data item from the Data pane, and select New Custom Category. Define a color-mapped value display rule for the category data item.
The odds of the event are 1.142 greater for each one thousand dollar increase in salary.
Salary data are stored in 1000's of dollars. What is a correct interpretation of the estimate? A. The odds of the event are 1.142 greater for each one thousand dollar increase in salary. B. The probability of the event is 1.142 greater for each one thousand dollar increase in salary. C. The probability of the event is 1.142 greater for each one dollar increase in salary. D. The odds of the event are 1.142 greater for each one dollar increase in salary.
Three or more measures
Scatter Plot Matrix or Correlation Matrix A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies
Two measures
Scatter Plot or Heat Map A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies
age, body temperature, gas mileage, income The continuous variables are age, body temperature, gas mileage, and income.
Select the choice that lists only continuous variables. a. body temperature, number of children, gender, beverage size b. age, body temperature, gas mileage, income c. number of children, gender, gas mileage, income d. gender, gas mileage, beverage size, income
KERNAL
Superimposes kernal density estimates on the histogram. A. NORMAL B. EXTENDED C. KERNAL
SAS QKB for Contact Information (CI)
Supports management of commonly used contact information for individuals and organizations, such as names, addresses, company names, and phone numbers. A. SAS QKB for Contact Information (CI) B. SAS QKB for Product Data (PD)
Multiple
The Allow generation of _____________ matchcodes per definition option requires the creation of a special match definition in the QKB. A. Single B. Multiple
Validation The Data Validation node is in the Utilities grouping of nodes.
The Data _________________node is used to filter or flag rows according to the specified condition(s). A. Import B. Validation C. Output
Field Match
The Field Match report displays a list of the fields in metadata that match a selected field's name. A. Field Name B. Field Relationship C. Field Match
NULL
The Generate null match codes for blank field values option generates a ____________match code if the field is blank. If this option is not selected, then a match code of all $ symbols is generated for the field. When you match records, a field with NULL does not equal another field with NULL, but a field with all $ symbols equals another field with all $ symbols. A. Preview B. Numeric C. NULL
the predicted value of the response when all predictors = 0.
The Intercept estimate is interpreted as: the predicted value of the response when all predictors are at their means. the predicted value of the response when all predictors are at their minimum values. the predicted value of the response when all the predictors are at their current values. the predicted value of the response when all predictors = 0.
Collection
The SAS Quality Knowledge Base (QKB) is a _______________ of files that store data and logic that define data management operations. A. Collection B. Repository
LASR
The SAS ______________ Analytic Server is an analytic platform that provides a secure, multiuser environment for concurrent access to data that is loaded into memory. The SAS _______________Analytic Server enables the following: persistence of data in memory for a distributed environment superfast analytic operations on data reduced start-up times for distributed computing multiple users to access the same in-memory data in a secure manner A. MBR B. LASR C. DDAD
False
The STEPWISE, BACKWARD, and FORWARD strategies result in the same final model if the same significance levels are used in all three. A. True B. False
Surviving Record Identification
The Surviving Record Identification (SRI) node examines clustered data and determines a surviving record for each cluster. A. Entity Resolution B. Surviving Record Identification
Match
The ____________ Report node produces a report listing the duplicate records identified by the match criteria. ______________ reports are displayed with a special report viewer. A. Match B. Clustering
Table
The ______________Match report displays a list of database tables that contain matching fields for a selected table or field. A. Field B. Identification C. Table
Correlation
The _______________ matrix visualization enables you to use a matrix of rectangular cells to view the degree of statistical correlation between multiple measures. A. Lateral B. Relationship C. Correlation
Bubble
The _______________ plot visualization enables you to explore the relationship between three measures. Two measures determine the ______________ placement and the third measure determines the _________ size. A. Scatter B. Bubble C. Segmentation
Clustering
The ________________ node enables the specification of an output ______________ ID field and specifications of _____________ conditions. A. Match B. Clustering
Heat Map
The ________________ visualization enables you to display the distribution of values for two data items using colored cells. A. Box B. Box and Whisker C. Heat Map
Quality Knowledge Base (QKB)
The _________________ is a collection of files and configuration settings that contain all the DataFlux Data Management algorithms. A. Collections Repository B. Quality Knowledge Base (QKB)
Entity Resolution
The __________________ File enables you to manually review the merged records and make adjustments as necessary. This can involve the following tasks: examining clusters reviewing the Cluster Analysis section reviewing related clusters processing cluster records editing fields for surviving records A. Entity Resolution B. Surviving Record Identification
Master Data Foundation
The __________________ feature in Data Management Studio uses master data projects and entity definitions to develop the best possible record for a specific resource, such as a customer or a product, from all of the source systems that might contain a reference to that resource. A. Collection B. Data Connection C. Master Data Foundation
Cluster Diff
The __________________ node is used to compare two sets of clustered records by reading in data from a left and a right table. From each table, the ______________________ node takes two inputs: a numeric record ID field and a cluster number field. A. Cluster Group B. Cluster Diff
Data Source
The ____________________ Details window displays information about the number of rows and columns in the data source and the number used for the exploration. A. Data Source B. Explorer
Identification
The _____________________Analysis report displays a list of fields in metadata that match categories in the identification analysis definitions specified for field name and sample data analysis. A. Field B. Identification C. Table
Field Relationship
The ______________________ map provides a visual presentation of the field relationships between all of the databases, tables, and fields that are included in the data exploration. A. Field Name B. Field Relationship C. Field Match
Execute Business Rule The Execute Business Rule Properties window allows for the specification of a Return status field, which flags records as either passing (True) or failing (False) the business rule. Not selecting the Return status field will pass only records that pass the business rule to the next node.
The _________________________ node applies an existing, row-based business rule to the rows of data as they flow through a data job. Records either pass or fail the selected rule. A. Execute Business Rule B. Business Rules
Histogram Chart
The __________________visualization enables you to view the distribution of values for a single measure. A. Bar Chart B. Line Chart C. Histogram Chart
Line
The _______________chart visualization enables you to view data trends over time. A. Bar B. Line C. Histogram
Sankey
The _______________diagram visualization enables you to perform path analytics to display flows of data from one event (value) to another as a series of paths. A. Linked B. Network C. Sankey
Network
The ______________diagram visualization enables you to view the relationships between category values as a series of linked nodes. A. Linked B. Neural C. Network
Clustering
The ______________node provides the ability to match records based on multiple conditions. Create conditions that support your business needs. A. Match B. Clustering
Ranks
The ______________tab enables you to view, create, and edit ranks to subset the data in the visualization. A _____________selects either the top (greatest) or the bottom (least) aggregated value for a category. A. Ranks B. Explorer
Outliers
The ______________tab lists the X minimum and maximum value outliers. The number of listed minimum and maximum values is specified when the data profiling metrics are set. A. Frequency Distribution B. Frequency Pattern C. Outliers
Bar
The _____________chart visualization enables you to compare data that is aggregated by the distinct values of a category. A. Bar B. Line C. Histogram
Box
The _____________plot visualization enables you to view information about the variability of data and the extreme data values. The size and location of the _______________ indicate the range of values that are between the 25th and 75th percentile. A. Box B. Segmentation C. Sankey
Roles
The ____________tab enables you to view the roles and data item assignments for the selected visualization. A. Data Source B. Explanatory C. Roles
Cardinality
The actual chart depends on the ____________ of the data. A. Type B. Source C. Cardinality
Element, Phrase
The analysis of an individual field can be counted as a whole (phrase) or based on each one of the field's elements. For example, the field value DataFlux Corporation is treated as two permutations if the analysis is set as Element, but is treated only as one permutation if the analysis is set as Phrase. A. Element, Phrase B. Phrase, Element
Constant variance, because the interquartile ranges are different in different ad campaigns.
The box plot was used to analyze daily sales data following three different ad campaigns. The business analyst concludes that one of the assumptions of ANOVA was violated. Which assumption has been violated and why? A. Constant variance, because Prob > F < .0001. B. Normality, because Prob > F < .0001. C. Constant variance, because the interquartile ranges are different in different ad campaigns. D. Normality, because the interquartile ranges are different in different ad campaigns.
Linear
The defining feature of _____________ models is the __________ function of the explanatory variables. A. Linear B. Logistic
Text Analytics
The definition below describes which type of data analytics? Analyzes each value in a document collection as a text document that can contain multiple words. Words that often appear together in the document collection are identified as topics. A. Correlation B. Fit Line C. Forecasting D. Text Analytics
Correlation The strength of a correlation is described as a number between 1 and 1.
The definition below describes which type of data analytics? Identifies the degree of statistical relationship between measures. A. Correlation B. Fit Line C. Forecasting D. Text Analytics
Fit Line A fit line plots a model of the relationship between measures. You can add a fit line to a scatter plot or heat map by using the pop-up menu or the Fit Line option on the Properties tab in the Right pane.
The definition below describes which type of data analytics? Plots a model of the relationship between measures. A. Correlation B. Fit Line C. Forecasting D. Text Analytics
Forecasting
The definition below describes which type of data analytics? Predicts future values based on the statistical trends in your data. A. Correlation B. Fit Line C. Forecasting D. Text Analytics
Measure
The definition describes which type of data classification? Numeric items whose values are used in computations. Measures can be calculated or aggregated. A. Category B. Geography C. Measure D. Hierarchy
Geography
The definition describes which type of data classification? Special role to identify types of geographical information for mapping. A. Category B. Geography C. Measure D. Hierarchy
Category
The definition describes which type of data classification? Used to group and aggregate measures. Categories contain alphanumeric or datetime values. New category data items can be calculated. A. Category B. Geography C. Measure D. Hierarchy
Hierarchy
The definition describes which type of data classification? Used to navigate through the data. Hierarchies are based on category or geography values. A. Category B. Geography C. Measure D. Hierarchy
-j
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Executes the job in the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-o
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Overrides settings in configuration files. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-c
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Reads the configuration from the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-i
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Specifies job input variables. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-b
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Specifies job options for the job being run. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
-l
The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Writes the log to the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>
Monitor: Control
The final stage in a data management project involves examining any trends to validate the extended use and retention of the data. Data that is no longer useful is retired. The project's success can then be shared throughout the organization. The next steps are communicated to the data management team to lay the groundwork for future data management efforts. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
7
The following SAS Code is submitted: proc reg data=sashelp.fish; model weight=length1 height width / selection=adjrsq; run; How many possible subset models will be assessed by SAS? A. 6 B. 8 C. 5 D. 7
Administration
The locations of the Quality Knowledge Base files are registered on the _________________ riser bar in DataFlux Data Management Studio. There can only be one active QKB at a time. A. Collections B. Folders C. Administration
Data Job
The main way to process data in DataFlux Data Management Studio. Each ____________ specifies a set of data-processing operations that flow from source to target. A. Command B. Routine C. Data Job
60
The maximum number of measures that can be displayed in a correlation matrix is ____? A. 12 B. 50 C. 60
DFFITS and CooksD only The variable Summary_i compresses the indicator variables RStud_i, DFits_i, and CookD_i into a single variable, with values in the order shown in the assignment statement that defines Summary_i. Therefore, the Summary_i value 011 means that the RStudent value did not exceed the cutoff, but the values for DFFITS and CooksD did. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2
The observation below is from the data set InfluentialBF. Obs Summary_i Case PredictedValue RStudent DFFITS CutDFFits CooksD CutCooksD 1 011 39 44.8580 -2.6312 -1.5941 0.80322 0.496 0.12903 Assume that these assignment statements were used in creating the data set: CutDFFits=2*(sqrt(&numparms/&numobs)); CutCooksD=4/&numobs; RStud_i=(abs(RStudent)>3); DFits_i=(abs(DFFits)>CutDFFits); CookD_i=(CooksD>CutCooksD); Summary_i=compress(RStud_i||DFits_i||CookD_i); For which statistics did this observation exceed the cutoff criteria? a. RStudent, DFFITS, and CooksD b. RStudent and DFFITS only c. RStudent and CooksD only d. DFFITS and CooksD only
Physical The command line to execute the data job could be similar to the following: call dmpexec -j "D:\Workshop\dqdmp1\Demos\files\batch_jobs\Ch4D2_Products_Misc.ddf" -l "C:\Temp\log1.txt"
The physical path and filename of data jobs must be specified with the -j switch. A. Logical B. Physical
Plan: Define
The planning stage of any data management project starts with this essential first step. This is where the people, processes, technologies, and data sources are defined. Roadmaps that include articulating the acceptable outcomes are built. Finally, the cross-functional teams across business units and between business and IT communities are created to define the data management business rules. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Mean
The predicted value in ANOVA is the group _________. A. Mean B. Median
Mean
The predicted value in ANOVA is the group _________. A. Mean B. Median C. Mode
P Value
The probability calculated from the data is called the ____________. A. P Value B. Expected Confidence
Predicted
The regression coefficients are just numbers and they are multiplied by the explanatory variable values. These products are then summed to get the individual's ______________ value. A. Expected B. Predicted
Monitor: Evaluate
This step of the methodology enables users to define and enforce business rules to measure the consistency, accuracy, and reliability of new data as it enters the enterprise. Reports and dashboards on critical data metrics are created for business and IT staff members. The information that is gained from data monitoring reports is used to refine and adjust the business rules. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control
Case
Transforms a text string by changing the case of its characters to uppercase, lowercase, or proper case. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
Pattern Analysis
Transforms a text string into a particular pattern. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
Standardization
Transforms a text string into a standard format. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization
False Although every visualization has a common property, Name, most visualizations have additional unique properties.
True or False: All visualizations have exactly the same properties.
True
True or False: If a single value in a group of items needs to be changed, then select Edit Modify Standards Manually Single Instance. A single value can then be modified manually. To toggle back to the ability to change all instances in a group, select Edit Modify Standards Manually All Instances.
True
True or False: The following types of visualizations are not available to be included in your report: decision trees network diagrams Sankey diagrams treemaps that display additional levels word clouds visualizations that do not contain data geo maps that use a custom geographic data item
True
True or False: There are several ways that SAS Visual Analytics can be deployed, including the following: non-distributed deployment (single server) distributed deployment using a co-located data provider
True
True or False: When using values with high cardinality... Each visualization has a visualization data threshold that controls the amount of high-cardinality data that can be used. Filtering and grouping can be used to limit high-cardinality data. An error message might be displayed and the visualization not produced when the visualization data threshold is exceeded. Visualization data thresholds can be specified in the Preferences window and by an administrator.
True
True or False? Data standardization does not perform a validation of the data (for example, Address Verification). Address verification is a separate component of the DataFlux Data Management Studio application and is discussed in another section.
True
True or False? If you standardize a data value using both a definition and a scheme, the definition is applied first and then the scheme is applied.
True
True or False? Monitoring tasks are created by pairing a defined business rule with one or more events. Some available events include the following: call a realtime service execute a program launch a data flow job on a Management server log error to repository log error to text file raise an event on the process job (if hosted) run a local job run a local profile send email message set a data flow key or value write a row to a table
False A user can have only *ONE* instance of a QKB open at a time. Only one user can have an instance of a QKB open for editing. If another user tries to open the same instance, the user receives a message that he or she can open a Read-only copy. When Data Management Studio is closed, the QKB is also closed.
True or False? QKB Editing Rules: A user can have only two instance of a QKB open at a time. Only one user can have an instance of a QKB open for editing. If another user tries to open the same instance, the user receives a message that he or she can open a Read-only copy. When Data Management Studio is closed, the QKB is also closed.
True
True or False? Record-level rules select which record from a cluster should survive. If there is ambiguity about which record is the survivor, the first remaining record in the cluster is selected.
True Jobs and profiles developed with Data Management Studio can be uploaded to the Data Management Server. Jobs and profiles can be executed on this server, which is intended to be a more powerful processing system. Data Management Server needs access to a copy of the QKB and data packs that are used in the data jobs and profiles.
True or False? The DataFlux Data Management Server is an application server that supports web service requests through a service-oriented architecture (SOA) executes profiles, data jobs, process jobs, and services on Windows, UNIX, or LINUX servers.
True
True or False? The match code generation process consists of the following steps: 1. Data is parsed into tokens (for example, Given Name and Family Name). 2. Ambiguities and noise words are removed (for example, the). 3. Transformations are made (for example, Jonathon > Jon). 4. Phonetics are applied (for example, PH > F). 5. Based on the sensitivity selection, the following occurs: Relevant components are determined. A certain number of characters of the transformed, relevant components are used.
True
True or False? Tukey's HSD Test HSD=Honest Significant Difference This method is appropriate when you consider pairwisecomparisons. The experimentwise error rate is equal to alpha when all pairwise comparisons are considered less than alpha when fewer than all pairwise comparisons are considered. Also known as the Tukey-Kramer Test
True
True or False? The following data items can be created in the Data pane: custom categories calculated items (unaggregated) aggregated measures derived items duplicate items geography data items document collection (text analytics) unique row identifier (text analytics)
Heat Map
Using SAS Visual Analytics Explorer, a content developer would like to examine the relationship between two measures with high cardinality. Which visualization should the developer use? A. Scatter Plot B. Heat Map C. Scatter Plot Matrix D. Treemap
within-group sample means
What are the "predicted values" that result from fitting a one-way analysis of variance (ANOVA) model? within-group sample variances between-group sample variances within-group sample means between-group mean differences
No lower bound, No upper bound
What are the upper and lower bounds for a logit? a. Lower=0, Upper=1 b. Lower=0, No upper bound c. No lower bound, No upper bound d. No lower bound, Upper=1
Automobile was removed from the CARS metadata.
What causes this window to display? Response: Report object Bar Chart 1 was built with multiple data sources. Automobile was removed from the CARS metadata. The List Table 1 report object was edited, but not saved. The CARS data source no longer exists.
The data job does NOT create an interactive report.
What is the biggest difference between creating a data profile versus creating a data job that incorporates data profile nodes? Response: The data profile does NOT allow you to turn metrics on or off. The data job does NOT allow you to turn metrics on or off. The data job does NOT create an interactive report. The data profile does NOT allow you to apply custom metrics.
Right Fielding node
What type of node would you add in a data job to achieve the results shown in the exhibit? Identification Analysis node Data Validation node Right Fielding node Field Layout node
Data Brushing
When a bar is selected in the bar chart, the markers in the scatter plot that correspond to the selected value in the bar area are highlighted. This feature used in SAS Visual Analytics Explorer called: Response: Data Brushing File interaction Report level display rules Conditional Highlighting
Target
When choosing Output Field settings, which of the options sends all fields available to target nodes to the target? A. Target B. Source and Target C. All
All
When choosing Output Field settings, which of the options specifies All available fields are passed through source nodes, target nodes, and all intermediate nodes. A. Target B. Source and Target C. All
Source and Target
When choosing Output Field settings, which of the options specifies All fields available to a source node are passed to the next node and all fields available to target nodes are passed to the target. A. Target B. Source and Target C. All
dmserver.cfg
When configuring options for the Data Management Server...which config file describes the settings below? DMSERVER/SOAP/LISTEN_PORT= PORT specifies the TCP port number where the server will listen for SOAP connections. DMSERVER/LOGCONFIG_PATH= PATH specifies the path to the logging configuration file. A. app.cfg B. dmserver.cfg
app.cfg
When configuring options for the Data Management Server...which config file describes the settings below? QKB/PATH = PATH specifies the location of the active Quality Knowledge Base. VERIFY/USPS = PATH specifies the location of USPS reference source. VERIFY/GEO = PATH specifies the location of Geo/Phone reference source. A. app.cfg B. dmserver.cfg
Lower
When creating folders, it is best practice to set folder names in _____________ with no spaces. A. Lower B. Upper
Correlation Matrix
When creating scatter plots...Three or more measures with high cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix
Scatter Plot Matrix
When creating scatter plots...Three or more measures with low cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix
Heat Map
When creating scatter plots...Two measures with high cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix
Scatter Plot
When creating scatter plots...two measures with low cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix
Root Mean Square Error
When forecasting...which model is used to determine the best method? A. Default B. Average Square Error C. Root Mean Square Error
4 GB
When importing files from your local machine, you are limited to a file size of __ GB or less. This limitation is introduced by web browsers. A. 2 GB B. 4 GB C. 8 GB
Batch Jobs
When importing to a Data Management Server, Each defined Data Management Server has a series of predefined folders. Selecting ____________ (for example) enables the Import tool in the navigation area, as well as in the main information area. A. Data Jobs B. Batch Jobs
Link
When modeling a categorical variable, which function is used? A. Link B. Logit
Link
When modeling an interval variable, which function is used? A. Link B. Logit
ABANDONED
When parsing, the which term best describes the description below: A resource limit was reached. Increase your resource limit and try again. A. OK B. NO SOLUTION C. NULL D. ABANDONED
NULL
When parsing, the which term best describes the description below: The parse operation was not attempted. This result occurs only when a null value was in the field to be parsed and the Preserved null values option was enabled. A. OK B. NO SOLUTION C. NULL D. ABANDONED
OK
When parsing, the which term best describes the description below: The parse operation was successful. A. OK B. NO SOLUTION C. NULL D. ABANDONED
NO SOLUTION
When parsing, the which term best describes the description below: The parse operation was unsuccessful; no solution was found. A. OK B. NO SOLUTION C. NULL D. ABANDONED
Linear
When reviewing the Fit LIne results...which of the terms describes the definition below? Creates a linear fit line from a linear regression algorithm. A linear fit line produces the straight line that best represents the relationship between two measures. For a linear fit, correlation information is automatically added to the visualization. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline
PSpline
When reviewing the Fit LIne results...which of the terms describes the definition below? Creates a penalized B-spline fit. A penalized B-spline is a smoothing spline that fits the data closely. A penalized B-spline can display a complex line with many changes in its curvature. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline
Quadratic
When reviewing the Fit LIne results...which of the terms describes the definition below? Produces a line with a single curve. A quadratic fit line produces a line with the shape of a parabola. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline
Cubit
When reviewing the Fit LIne results...which of the terms describes the definition below? Produces a line with two curves. A cubic fit line often produces a line with an "S" shape. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline
Best Fit
When reviewing the Fit LIne results...which of the terms describes the definition below? Tests the cubic, quadratic, and linear fit methods against your data and selects the fit method that produces the best result. The Best Fit method uses backward selection to select the highest-order model that is significant. To see which fit method was used, select the information icon from the visualization legend. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline
- 2 Log L increased.
When selecting variables or effects using SELECTION=BACKWARD in the LOGISTIC procedure, the business analyst's model selection terminated at Step 3. What happened between Step 1 and Step 2? A. DF increased. B. AIC increased. C. Pr > Chisq increased. D. - 2 Log L increased.
- 2 Log L increased.
When selecting variables or effects using SELECTION=BACKWARD in the LOGISTIC procedure, the business analyst's model selection terminated at Step 3. What happened between Step 1 and Step 2? A. Pr > Chisq increased. B. - 2 Log L increased. C. AIC increased. D. DF increased.
Preserve
When standardizing, selecting _________________ null values ensures that if a field is null when it enters the node, then the field is null after being output from the node. It is recommended that this option be selected if the output is written to a database table. A. Import B. Preserve C. Archive
Low
When the p-value is ____________, it provides doubt about the truth of the null hypothesis. A. High B. Low
The simplest model with the best performance on the validation data
When using honest assessment, which of the following would be considered the best model? a. The simplest model with the best performance on the training data b. The simplest model with the best performance on the validation data c. The most complex model with the best performance on the training data d. The most complex model with the best performance on the validation data
Metadata
When you select Create a New Collection, you need to specify the _______________ location where the collection should be stored. A. Report B. Metadata
Parse Definition
Which Quality Knowledge Base (QKB) definition type is used by almost every other definition type? Response: Case Definition Parse Definition Standardization Definition Match Definition
Explorer
Which SAS Visual Analytics feature provides the below: an enhanced decision tree visualization, which includes interactive training and model assessment information a linear regression visualization, which creates predictive models for measure variables a logistic regression visualization, which creates predictive models for category variables a generalized linear model visualization, which creates predictive models for measure variables a cluster visualization, which segments the input data into clusters model comparison, which compares two or more predictive models A. Report Viewer B. Explorer C. Summary
C. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward select=bic; run;
Which SAS program will correctly use backward elimination with BIC selection criterion within the GLMSELECT procedure? A. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /select=backward choose=bic; run; B. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /select=backward selection=bic; run; C. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward select=bic; run; D. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward choose=bic; run;
Inner join and Address=Address
Which join type and expression would you use to join records from two different sources to find only matching records of the same address? Response: Left join and Address=Address Full join and ID=ID Right join and Address=Address Inner join and Address=Address
Expression Engine Language
Which language do you use to create a business rule using the Expression tab? Expression Builder Language Expression Engine Language Expression Monitor Language Expression Rule Language
Export the exploration as an image.
Which method is NOT used to share Explorations from SAS Visual Analytics Explorer? Response: Export the exploration as a report. Export the exploration as a PDF. Email a link to the exploration. Export the exploration as an image.
None of the above
Which of the following assumptions does collinearity violate? a. Independent errors b. Constant variance c. Normally distributed errors d. None of the above
SQL
Which of the statements below describes this querying method? The data generated for both the __________query and the filter have the same results. The filter pulled all records. The filter was processed on the machine where the profile was run. The database does the filtering for the ________ query. A. Filtering B. SQL
Condition matched field prefix
Which option in the properties of a Clustering node allows you to identify which clustering condition was satisfied? A. Condition matched field prefix B. Cluster condition field matched C. Cluster condition field count D. Cluster condition met field
Condition matched field prefix
Which option in the properties of a Clustering node allows you to identify which clustering condition was satisfied? Response: Cluster condition field matched Condition matched field prefix Cluster condition field count Cluster condition met field
records that have null or missing company fields
Which records will pass to the next node in the data job flow? records that have null or missing company fields no records records that do not have null or missing company fields all records
Data types are comprised of one or more tokens.
Which statement describes the relationship between data types and tokens? Data types are comprised of one or more tokens. Data types and tokens are interchangeable. Tokens are comprised of one or more data types. There is no relationship between these two items.
Gender should not be removed due to its involvement in the significant interaction.
Which statement is correct at an alpha level of 0.05? School should be removed because it is significant. Gender should not be removed due to its involvement in the significant interaction. School*Gender should be removed because it is non-significant. Gender should be removed because it is non-significant.
Facility Opening Date (Day) can be used outside of the hierarchy after selecting it in the
Which statement is true about Facility Opening Date (Day)? Response: Facility Opening Date (Day) values can only be used as the lowest level of the hierarchy. Facility Opening Date (Day) can be used outside of the hierarchy after selecting it in the Because it is part of a hierarchy, the format for Facility Opening Date (Day) cannot be changed. As a member of a hierarchy, Facility Opening Date (Day) is a virtual data item without properties.
Files and definitions
Which two types of items comprise the Quality Knowledge Base (QKB)? Files and repository Definitions and reference data sources Files and reference data sources Files and definitions
Group
Which type of business rule do you create to check for Countries that produce less than three items? Group Column Row Set
R Square
Which value tends to increase (can never decrease) as you add predictor variables to your regression model? A. R square B. Adjusted R square C. Mallows' Cp D. Both a and b E. F statistic F. All of the above
Crosstab and Table
Which visualization enables you to sort the data in columns? A. Crosstab B. Table C. None
File
Within the Data Mgmt Studio Repository, the ___________storage of a repository can contain the following: data jobs process jobs match reports entity resolution files queries entity definitions other files A. Data B. File
Data
Within the Data Mgmt Studio Repository, the ___________storage of a repository can contain the following: explorations and reports profiles and reports business rules monitoring results custom metrics business data information master data information A. Data B. File
%
Within the Standardization Scheme, which of these commands provides an indicator specifying that the matched word or phrase is not updated? A. //Remove B. %
//Remove
Within the Standardization Scheme, which of these commands removes the matched word or phrase from the input string? A. //Remove B. %
Define
Within the _______________ methodology, there are four main functions which can be used: Connect to Data Explore Data Define Business Rules Build Schemes A. Define B. Discover
The errors are independent, normally distributed with zero mean and constant variance.
Y = B0 + B1X + E Which statement best summarizes the assumptions placed on the errors? A. The errors are correlated, normally distributed with constant mean and zero variance. B. The errors are correlated, normally distributed with zero mean and constant variance. C. The errors are independent, normally distributed with constant mean and zero variance. D. The errors are independent, normally distributed with zero mean and constant variance.
Parsing node
You are creating a data job to apply a data cleansing process to an input data field containing city, state and postal code data. You would like to create individual fields from the components of the data values, with the resulting data being written into individual fields for City, State/Province and Postal Code. Which node would you use to accomplish this result? Right Fielding node Parsing node Identification Analysis node Standardization node
Standardization Scheme Regular Expression library
You are working with the data in the exhibit below that represents peoples' last name (or surname). You would like to ensure the proper casing is applied to the data using a Case definition. MacAlister MacDonald McCarthy McDonald McNeill Which two Quality Knowledge Base (QKB) file components can you use within the Case definition to accomplish this task? (Choose two.) Vocabulary Standardization Scheme Regular Expression library Phonetics library
Flexible
You can customize rules to conform to the ever-changing business environment regardless of your data needs. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Efficient
You can dramatically reduce manual data manipulation time by simply updating cleansing rules. It is much easier to manipulate reusable data-cleansing rules than to manually manipulate the data itself. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Data Collection
You can use _____________ to group data fields from different tables, database connections, or both. These collections can be used as input data sources for profiles. A. Repository B. Data Collection C. Data Exploration
Select the Collections riser on the Report tab, then select the collection name, right-click and select Profile Field.
You create an exploration that results in a collection of five similar fields across five disparate tables. Afterwards, what do you do to check the collection for null values and frequency distributions from the exploration? Response: Select the Report tab, then select Actions on the menu bar, and select Profile Collection Fields. Select the Collections riser on the Report tab, then select the collection name, right-click and select Profile Field. Select the Properties tab and check the Profile Collection Fields check box. Select the Report tab, then select Tools on the menu bar, and select the Profile Collection Fields.
Fully Customizable
You have full control of data-cleansing rules across the enterprise and through time. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible
Create a Data Source Name (DSN) in the operating system. Create a data connection in DataFlux Data Management Studio.
You need to run a profile analysis against a data table. Which two methods can you use to access the data table in DataFlux Data Management Studio? (Choose two.) Select Open Table from the File menu in DataFlux Data Management Studio. Create a Data Source Name (DSN) in the operating system. Create a data connection in DataFlux Data Management Studio. Create a library definition in DataFlux Data Management Studio.
From the Tools menu, select Other QKB Editors and select the appropriate editor.
You need to update a standardization scheme. Which two ways can you access the appropriate editor for the standardization scheme in the Quality Knowledge Base (QKB)? (Choose two.) From the Tools menu, select Other QKB Editors and select the appropriate editor. From the Folders riser bar, access the repository that contains the standardization scheme. Should have chosen From the Administration riser bar, open the QKB that contains the standardization scheme. From the File menu, select Edit and select the appropriate editor. From the Data riser bar, select the Data Connection that contains the standardization scheme.
Parse
_____________ definitions define rules to place the words from a text string into the appropriate tokens. A. Parse B. Text C. Case
Field Name
______________ analysis analyzes the names of each field from the selected data sources to determine which identity to assign to the field. A. Identification B. Field Name C. Sample Data
Case
______________ definitions are algorithms that can be used to convert a text string to uppercase, lowercase, or proper case. A. Parse B. Text C. Case
Roles SAS Visual Analytics is shipped with five predefined roles. Visual Analytics: Administration Visual Analytics: Analysis Visual Analytics: Basic Visual Analytics: Data Building Visual Analytics: Report Viewing
________________ are mapped to capabilities. A capability, also known as an application action, defines the operations that a user can perform. A. Roles B. Levels C. Measurements
Address Verification
________________ identifies, corrects, and enhances address information. A. Address Validation B. Address Verification
Dunnett
________________ method is recommended when there is a true control group. When appropriate (when a natural control category exists, against which all other categories are compared) it is more powerful than methods that control for all possible comparisons. A. Levene B. Tukey C. Dunnett
Forecasting A forecast adds a line with a predicted value and a colored band that represents the confidence interval. Forecasting is available only for line charts that include a datetime data item. No forecasting is available if data items are assigned to the Group, Lattice columns, or Lattice rows roles. The forecasting duration (in intervals) can be selected on the Properties tab in the Right pane. The default duration is six intervals.
________________ predicts future values based on the statistical trends in your data. A. Forecasting B. Predictive Modeling
Sample Data
_________________ analysis analyzes a sample of data in each field to determine which identity to assign to the field. A. Identification B. Field Name C. Sample Data
Data Connection
__________________ are used to access data in jobs, profiles, data explorations and data collections. A. Collection B. Data Connection C. Master Data Foundation
Goal Seeking
__________________ enables you to specify a target value for your forecast measure to determine the values of the underlying factors that are required to achieve that value. A. Scenario B. Goal Seeking
Entity Resolution
__________________ is the process of merging duplicate records in a single file or multiple files so that records referring to the same physical object are treated as a single record. Records are matched based on the information that they have in common. The records that are merged might appear to be different, but can actually refer to the same person or item. A. Entity Match B. Entity Resolution C. Match Entity
Automatic
____________________ chart is the default visualization type? A. Manual B. Default C. Automatic
Data Exploration
______________________ have the following types of analysis methods: field name matching field name analysis sample data analysis A. Repository B. Data Collection C. Data Exploration
Document Collection When using word clouds with text analytics, you can choose to analyze the document sentiment.
____________________________ is a category data item that contains the words that you want to analyze. A. Document Collection B. Metadata
Visualizations
__________________________that have no data that are items assigned to required roles are not available to include in your PDF output. A. Images B. Text C. Visualizations
Scenario
_______________analysis enables you to forecast hypothetical scenarios by specifying the future values for one or more underlying factors that contribute to the forecast. A. Scenario B. Goal Seeking
Geocoding Geocoding latitude and longitude information can be used to map locations and plan efficient delivery routes. Geocoding can be licensed to return this information for the centroid of the postal code or at the roof-top level. Currently, there are only geocoding data files for the United States and Canada. Also, roof-top level geocoding is currently available only for the United States.
_______________enhances address information with latitude and longitude values. A. Geo Validation B. Geocoding
Identification, Right
______________analysis and ___________ fielding use the same definitions from the QKB, but in different ways. ______________ analysis identifies the type of data in a field, and __________ fielding moves the data into separate fields based on its identification. Both the ___________ analysis and _________ fielding examples above use the Contact Info identification analysis definition. a. Identification, Right B. Right, Identification
GLM
ods graphics; proc _________ data=STAT1.ameshousing3 plots=diagnostics; class Heating_QC; model SalePrice=Heating_QC; means Heating_QC / hovtest=levene; format Heating_QC $Heating_QC.; title "One-Way ANOVA with Heating Quality as Predictor"; run; quit; A. SGPLOT B. SGSCATTER C. GLM
PLOTS= FREQPLOT
requests a frequency plot. Frequency plots are available for frequency and crosstabulation tables. For multiway crosstabulation tables, PROC FREQ provides a two-way frequency plot for each stratum (two-way table). A. PLOTS= FREQPLOT B. PLOTS= FREQUENCY
95% You want to be as confident as possible but increasing the conf. level too much, you risk negative and positive infinity confidence bounds.
A 95% confidence interval represents a range of values within which you are _______certain that the true population mean exists. A. 5% B. 95%
GLM
PROC __________DATA=SAS-data-set PLOTS=options; CLASS variables; MODEL dependents=independents </ options>; MEANS effects </ options>; LSMEANS effects </ options>; OUTPUT OUT=SAS-data-set keyword=variable...; RUN; QUIT; A. SGPLOT B. SGSCATTER C. GLM
Gaussian
A Normal Distribution bell curve is also known as a ___________ distribution? A. Gaussian B. Expected
One-Sided
A _____________ t-test compares the mean calculated from a sample to a hypothesized mean. The null hypothesis of the test is generally that the difference between the two means is zero. A. One-Sided B. Two-Sided
a. a PROC GLMSELECT step that contains the SCORE statement b. a PROC PLM step that contains the SCORE statement and references an item store that was created in PROC GLMSELECT c. a PROC PLM step with the CODE statement that writes the score code based on item store created in PROC GLMSELECT, and a DATA step that scores the data d. any of the above Any of these approaches can be used to score data based on the model built by PROC GLMSELECT. Review: Methods of Scoring
A department store is deploying a chosen model to make predictions for an upcoming sales period. They have the necessary data and are ready to proceed. Which of the following methods can be used for scoring? a. a PROC GLMSELECT step that contains the SCORE statement b. a PROC PLM step that contains the SCORE statement and references an item store that was created in PROC GLMSELECT c. a PROC PLM step with the CODE statement that writes the score code based on item store created in PROC GLMSELECT, and a DATA step that scores the data d. any of the above
Straight
A linear association between two continuous variables can be inferred when the general shape of a scatter plot of the two variables is a __________ line. A. Straight B. Curved
Standard Error
A statistic that measures the variability of your estimate is the ___________ of the mean. A. Variability B. Standard Error
Model 1 Models 1 and 3 are better than Model 2 because they have lower values of AIC and SC. Model 1 also has the highest values of the c statistic so it is the best of the three models. Review: Comparing the Binary and Multiple Logistic Regression Models, Fitting a Binary Logistic Regression Model
According to the goodness-of-fit statistics shown below, which multiple logistic regression model would be the best to use? Statistic Model 1 Model 2 Model 3 AIC 501.5 520.4 501.5 SC 501.5 520.4 501.5 c 0.675 0.675 0.655 a. Model 1 b. Model 2 c. Model 3
yes, Hip and Abdomen Hip and Abdomen both have p-values lower than .05, so they are statistically significant in predicting or explaining the variability of the percentage of body fat. Review: Performing Simple Linear Regression, Analysis versus Prediction in Multiple Regression, Fitting a Multiple Linear Regression Model
According to these parameter estimates, are any of the variables in the model statistically significant in predicting or explaining the percentage of body fat? Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept 1 -20.98714 5.55433 -3.78 0.0002 Age 1 0.01226 0.02836 0.43 0.6658 Hip 1 -0.40163 0.09994 -4.02 <.0001 Abdomen 1 0.86123 0.06814 12.64 <.0001 a. no b. yes, Age c. yes, Hip and Abdomen d. yes, Age, Hip, and Abdomen
a fairly strong, negative linear relationship The correlation coefficient for the relationship between Performance and RunTime is -0.82049, which is negative. It is also close to 1, making it a relatively strong relationship. Review: Using Correlation to Measure Relationships between Continuous Variables
Based on this correlation matrix, what type of relationship do Performance and RunTime have? Pearson Correlation Coefficients, N = 31 Prob > |r| under H0: Rho=0 Performance RunTime Age Performance 1.00000 -0.82049 <.0001 -0.71257 <.0001 RunTime -0.82049 <.0001 1.00000 0.19523 0.2926 Age -0.71257 <.0001 0.19523 0.2926 1.00000 a. a fairly strong, positive linear relationship b. a fairly strong, negative linear relationship c. a fairly weak, positive linear relationship d. a fairly weak, negative linear relationship
NORMAL
Creates a normal probability plot. Options (MU= SIGMA=) determine the mean and std deviation of the normal distribution used to create reference lines(normal curve overlay in HISTOGRAM and diagonal reference line in PROBPLOT). A. NORMAL B. EXTENDED
POSITION=NE
Determines the position of the inset. The position is a compass point keyword, a margin keyword, or a pair of coordinates. You can specifiy coordinates in axis percent units or axis data units. The default value is NW. A. POSITIONPLOT B. POSITION=NE
median The median is not affected by outliers and is less affected by the skewness. The mean, on the other hand, averages in any outliers that might be in your data.
For an asymmetric (or skewed) distribution, which of the following statistics is a good measure for the middle of the data? a. mean b. median c. either mean or median
STEPWISE The summary table contains both Variable Entered and Variable Removed columns. Of the three types of stepwise selection (forward, backward, and stepwise), only stepwise selection can both enter and remove variables. Therefore, STEPWISE must have been specified in the PROC REG step. Review: The Stepwise Selection Approach to Model Building, The GLMSELECT Procedure, The GLMSELECT Procedure: Performing Stepwise Regression
Given the information in this summary of variable selection, which stepwise selection method was specified in the PROC REG step? Step Variable Entered Variable Removed Number Vars In Partial R-Square Model R-Square C(p) F Value Pr > F 1 RunTime 1 0.7434 0.7434 3.3432 84.00 <.0001 2 Age 2 0.0213 0.7647 2.8192 2.54 0.1222 a. FORWARD b. BACKWARD c. STEPWISE d. can't tell from the information given
Correct answer: no The p-value of 0.2942 is greater than 0.05, so you fail to reject the null hypothesis and conclude that the variances are equal. Review: The GLM Procedure
Given this SAS output, is there sufficient evidence to reject the assumption of equal variances? a. yes b. no
yes The p-value of <.001 is less than 0.05, so you would reject the null hypothesis and conclude that the means between the two groups are significantly different. Review: Examining the Equal Variance t-Test and p-Values
Given this SAS output, is there sufficient evidence to reject the hypothesis of equal means? a. yes b. no
36.1680 and 52.3021 The CLI option, which displays the 95% CL Predict column in the Output Statistics table, produces confidence limits for an individual predicted value. In this table, the third observation, for Kate, contains the value 55 for Performance. Therefore, the values in her 95% CL Predict column are the lower and upper confidence limits for a new individual value at the same value of Performance. In contrast, the CLM option displays the values in the 95% CL Mean column, which are the lower and upper confidence limits for a mean predicted value for each observation. Review: Specifying Confidence and Prediction Intervals in SAS, Viewing and Printing Confidence Intervals and Prediction Intervals, The REG Procedure: Producing Predicted Values
Here is a table of output statistics from PROC REG. If you sample a new value of the dependent variable when Performance equals 55, what are the lower and upper prediction limits for this newly sampled individual value? Output Statistics Obs Name Performance Dependent Variable Predicted Value Std Error Mean Predict 95% CL Mean 95% CL Predict Residual 1 Jack 48 40.8400 44.9026 1.0190 42.0732 47.7319 37.4190 52.3861 -4.0626 2 Annie 43 45.1200 45.3793 1.3081 41.7475 49.0112 37.5570 53.2016 -0.2593 3 Kate 55 44.7500 44.2351 1.4885 40.1023 48.3678 36.1680 52.3021 0.5149 4 Carl 40 46.0800 45.6654 1.6493 41.0862 50.2446 37.3608 53.9700 0.4146 5 Don 58 44.6100 43.9490 1.8646 38.7719 49.1261 35.3003 52.5977 0.6610 6 Effie 45 47.9200 45.1886 1.1361 42.0343 48.3429 37.5763 52.8009 2.7314 a. 44.7500 and 44.2351 b. 40.1023 and 48.3678 c. 36.1680 and 52.3021 d. can't tell from the information given
the most parsimonious model The most parsimonious model is selected. The most parsimonious model is the simplest, least complex of the candidate models. Review: Building a Predictive Model
Honest assessment might generate multiple candidate models that have the same (or nearly the same) validation assessment values. In this situation, which model is selected? a. the model that has the highest variance when it is applied to the population b. the model that has the most terms c. the most parsimonious model d. the most biased model
the measure of the ability of the statistical hypothesis test to reject the null hypothesis when it is actually false Power is the ability of the statistical test to detect a true difference, or the ability to successfully reject a false null hypothesis. The probability of committing a Type I error is α. The probability of failing to reject the null hypothesis when it is actually false is a Type II error.
How do you define the term power? a. the measure of the ability of the statistical hypothesis test to reject the null hypothesis when it is actually false b. the probability of committing a Type I error c. the probability of failing to reject the null hypothesis when it is actually false
CLASS Statement
How do you tell PROC TTEST that you want to do a two-sample t-test? a.SAMPLE=2 option b.CLASS statement c.GROUPS=2 option d.PAIRED statement
5 In Mallows' Cp criterion, p equals the number of variables in the model plus 1 for the intercept. Therefore, for these models, p equals 8, 9, or 10, depending on the number of terms in the model. All the C(p) values are less than their respective p values, so all five models meet Mallows' Cp criterion. Review: Evaluating Models Using Mallows' Cp Statistic, Viewing Mallows' Cp Statistic in PROC REG, The REG Procedure: Using the All-Possible Regressions Technique, The REG Procedure: Using Automatic Model Selection
How many of the following models meet Mallows' Cp criterion for model selection? Model Index Number in Model C(p) R-Square Variables in Model 1 7 5.8653 0.7445 Age Weight Neck Abdomen Thigh Forearm Wrist 2 8 5.8986 0.7466 Age Weight Neck Abdomen Hip Thigh Forearm Wrist 3 8 6.4929 0.7459 Age Weight Neck Abdomen Thigh Biceps Forearm Wrist 4 9 6.7834 0.7477 Age Weight Neck Abdomen Hip Thigh Biceps Forearm Wrist 5 7 6.9017 0.7434 Age Weight Neck Abdomen Biceps Forearm Wrist a. 0 b. 1 c. 3
T-Test
If you analyze the difference between two means using ANOVA, you reach the same conclusions as you reach using a pooled, two-group _________. A. T-Test B. Analysis
Report the F value and possibly remove the blocking factor from future studies. Your only choice is to report the F value, and if you plan future studies, do not include the blocking variable. The blocking factor must be included in all ANOVA models that you calculate with the sample that you've already collected. Review: Performing ANOVA with Blocking
If your blocking variable has a very small F value in the ANOVA report, what would be a valid next step? a. Remove it from the MODEL statement and re-run the analysis. b. Test an interaction term. c. Report the F value and possibly remove the blocking factor from future studies.
tables Country Size Country*Size; You use the TABLES statement in PROC FREQ to create frequency and crosstabulation tables. In the TABLES statement, you separate table requests with a space. In a table request for a crosstabulation table, you specify an asterisk between the variable names. Review: Crosstabulation Tables
In a PROC FREQ step, which statement or set of statements creates a frequency table for Country, a frequency table for Size, and a crosstabulation table for Country by Size? a. tables Country, Size, Country*Size; b. tables Country*Size; c. tables Country | Size; d. tables Country Size Country*Size;
Populations
In inferential statistics, the focus is on learning about ______________. Examples of ___________ are all people with a certain disease, all drivers with a certain level of insurance, or all customers, both current and potential, at a bank. A. Populations B. Volumes
no The most complex model is not always the best choice. An overly complex model might be too flexible, which can lead to overfitting. Review: Model Complexity
In predictive modeling, is the most complex model the best choice? a. yes b. no
that the errors are normally distributed The Residuals versus Quantile plot is a normal quantile plot of the residuals. Using this plot, you can verify that the errors are normally distributed, which is one of our assumptions. Here the residuals follow the normal reference line pretty closely, so we can conclude that the errors are normally distributed. Review: The REG Procedure: Producing Default Diagnostic Plots
In the diagnostic plots below, what does the Residual versus Quantile plot indicate about the model? a. that the errors are normally distributed b. that the data set contains many influential observations c. that the model is inadequate because the spread of the residuals is less than the spread of the centered fit d. that the model is inadequate because patterns occur in the spread around the reference line
the SCORE= option The SCORE= option specifies the data set that contains the parameter estimates. PROC SCORE reads the parameter estimates from this data set, scores the observations in the data set that the DATA= option specifies, and writes the scored observations to the data set that the OUT= option specifies. Review: The SCORE Procedure: Scoring Predicted Values Using Parameter Estimates
In this PROC SCORE step, which option specifies the data set containing the parameter estimates that are used to score observations? proc score data=dataset1 score=dataset2 out=dataset3 type=parms; var Performance; run; a. the DATA= option b. the SCORE= option c. the OUT= option
UNIVARIATE
PROC _____________ DATA=SAS-data-set <options>; VAR variables; HISTOGRAM variables </ options>; INSET keywords </ options>; RUN; A. FREQ B. UNIVARIATE
FREQ PROC FREQ can generate large volumes of output as the number of variables or the number of variable levels (or both) increases.
PROC _____________ DATA=SAS-data-set; TABLES table-requests </ options>; RUN; A. FREQ B. UNIVARIATE
both parametric and non-parametric models Predictive models can be based on both parametric and non-parametric models. Review: What Is Predictive Modeling?
Predictive models can be based on which of the following? a. parametric models only b. non-parametric models only c. both parametric and non-parametric models
proc univariate data=statdata.sleep mu0=8; var hours; run; You specify the MU0= option as part of the PROC UNIVARIATE statement to indicate the test value of the null hypothesis. The alternative hypothesis is that μ is not equal to 8 hours, but this does not need to be specified in the PROC UNIVARIATE code.
Psychologists at a college want to know if students are sleeping more or less than the recommended average of 8 hours a day. Which of the following code choices correctly tests the null hypothesis? a. proc univariate data=statdata.sleep mu0<>8; var hours; run; b. proc univariate data=statdata.sleep; var hours / mu0=8; run; c. proc univariate data=statdata.sleep; var hours / mu0<>8; run; d. proc univariate data=statdata.sleep mu0=8; var hours; run;
The probability is .95 that the true average weight is between 15.02 and 15.04 ounces. A 95% confidence interval means that you are 95% confident that the interval contains the true population mean. If you sample repeatedly and calculate a confidence interval for each sample mean, 95% of the time your confidence interval will contain the true population mean. A confidence interval is not a probability. When a confidence interval is calculated, the true mean is in the interval or it is not. There is no probability associated with it.
Select the statement below that incorrectly interprets a 95% confidence interval (15.02, 15.04) for the population mean, if the sample mean is 15.03 ounces of cereal. a. You are 95% confident that the true average weight for a box of cereal is between 15.02 and 15.04 ounces. b. The probability is .95 that the true average weight is between 15.02 and 15.04 ounces. c. In the long run, approximately 95% of the intervals calculated with this procedure will capture the true average weight.
A Cramer's V statistic that is close to 1 Cramer's V statistic is the only appropriate statistic to use in this example. When Cramer's V is close to 1, there is a relatively strong general association between two categorical variables. You cannot use an odds ratio because the predictor Type is not binary. You cannot use the Spearman correlation statistic because the predictor Type is not ordinal. Review: Cramer's V Statistic, Odds Ratios, The Spearman Correlation Statistic
Suppose you are analyzing the relationship between hot dog ingredients and taste. Which of the following statistics provides evidence of a relatively strong association between the variables Type (which has the values Beef, Meat, and Poultry) and Taste (which has the values Bad and Good)? a. A Cramer's V statistic that is close to 1 b. An odds ratio that is greater than 1 c. A Spearman correlation statistic that is close to 1
tables Rating*Grade / chisq measures; Both variables are ordinal and have logically-ordered values, so the Mantel-Haenszel test (for ordinal association) is a stronger test than the Pearson chi-square test (for general association) in this situation. The CHISQ option produces both the Pearson and Mantel-Haenszel statistics. The MEASURES option produces the Spearman correlation statistic, which measures the strength of an ordinal association. MHCHISQ is not a valid option, and the CLODDS= option is not a valid option in PROC FREQ. Review: The Mantel-Haenszel Chi-Square Test, The Spearman Correlation Statistic, Performing a Mantel-Haenszel Chi-Square Test of Ordinal Association
Suppose you are testing for an association between student ratings of teachers and student grades. The Rating variable has the values 1 (for poor), 2 (for fair), 3 (for good) and 4 (for excellent). The Grade variable has the values A, B, C, D, and F. Which of the following TABLES statements in PROC FREQ produces the appropriate chi-square statistics and measure of strength for these variables? a. tables Rating*Grade / chisq measures; b. tables Rating*Grade / chisq; c. tables Rating*Grade / mhchisq; d. tables Rating*Grade / mhchisq clodds=pl;
the equal variance assumption When a residuals plot displays a funnel shape, it indicates that the variance of the residuals is not constant. That is, the variance increases toward the wide end of the "funnel." This shows you that your model violates the equal variance assumption. Review: Verifying Assumptions Using Residual Plots
Suppose you have a residuals plot that shows a funnel shape for the residuals, such as in the plot below. Which assumption of linear regression is being violated? a. the linearity assumption b. the independence assumption c. both the linearity assumption and the independence assumption d. the equal variance assumption e. both the linearity assumption and the equal variance assumption
proc plm restore=homestore; score data=new out=new_out; run; In PROC PLM, the RESTORE= option specifies the name of the item store. In the SCORE statement, the DATA= option specifies New as the data set that contains the observations to be scored. The OUT= option specifies that the scored results are saved in a data set named New_Out. Review: Scoring Data
Suppose you ran a PROC GLMSELECT step that saved the context and results of the statistical analysis in an item store named Homestore. Which of the following programs scores new observations in a data set named New and saves the predictions in a data set named New_Out? a. proc plm restore=homestore; score data=new out=new_out; run; b. proc plm restore=new; score data=homestore out=new_out; run; c. proc plm data=homestore; score data=new out=new_out; run; d. proc plm restore=homestore; model data=new out=new_out; run;
oddsratio Amount (units=20); oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); You must specify the intervals of Amount in the UNITS statement, not in the ODDSRATIO statement. To calculate odds ratios for the two categorical variables as described, each of the two ODDSRATIO statements must set DIFF= to REF against all levels of the interacting variable. Review: The ODDSRATIO Statement, The UNITS Statement
Suppose you want to fit a multiple logistic regression model to determine how the method of administering a drug affects patients' response to the drug. The binary variable Response has the values 0 and 1. There are three predictors: Amount identifies the dosage amount in mg, Frequency has the values Daily and Weekly, and Meal has the values Yes and No. You want to calculate three odds ratios: an odds ratio for Amount at 20 mg intervals an odds ratio for Frequency against the reference level (Daily) as compared to all levels of Meal an odds ratio for Meal against the reference level (Yes) as compared to all levels of Frequency Which of the following blocks of code below correctly completes the following PROC LOGISTIC program? proc logistic data=newdrug; class Frequency (param=ref ref='Daily') Meal (param=ref ref='Yes'); model Response (event='1') = Frequency | Meal | Amount @2; _____________________________________________ run; a. oddsratio Amount (units=20); oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); b. units Amount=20; oddsratio Amount; oddsratio Frequency / diff=all at (Meal='Yes'); oddsratio Meal / diff=all at (Frequency='Daily'); c. units Amount=20; oddsratio Amount; oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); d. oddsratio Amount (units=20); oddsratio Frequency / diff=all at (Meal='Yes'); oddsratio Meal / diff=all at (Frequency='Daily');
class Program(param=ref ref='2') Gender(param=ref ref='Male'); The CLASS statement lists all the categorical predictor variables. For each categorical predictor, you use the PARAM= option to specify reference cell coding (REF or REFERENCE) instead of the default parameterization method, effect coding. The default reference level is the level with the highest ranked value when the levels are sorted in ascending alphanumeric order. Review: Specifying a Parameterization Method in the CLASS Statement, Reference Cell Coding
Suppose you want to fit a multiple logistic regression model to determine which of two rehabilitation programs is more effective. The categorical response variable Relapsed (Yes or No) indicates whether study participants stayed clean after one year. The categorical predictor variables are Program (1 or 2) and Gender (Male or Female). Age is a continuous predictor variable. Assume that you want to use reference cell coding with the default reference levels. Which of the following CLASS statements correctly completes the PROC LOGISTIC step for this analysis? proc logistic data=programs.rehabilitation; _____________________________________ model Relapsed (event='Yes') = Program | Gender | Age @2; run; a. class Program(param=ref ref='2') Gender(param=ref ref='Male'); b. class Program(param=ref ref='2') Gender (param=ref ref='Male') Age (param=ref units=1); c. class Program(param=ref ref='1') Gender(param=ref ref='Female');
model Focus(event='Sports')=Gender; In the MODEL statement, the response variable name is followed by the EVENT= option in parentheses (which specifies the event category—the level of the response variable that you're interested in), an equal sign, and the predictor variable name. Review: The LOGISTIC Procedure
Suppose you want to investigate the relationship between the gender of elementary school students and their focus in school. The variable Gender indicates the gender of each student as Boy or Girl. The variable Focus identifies each student's main focus in school as Grades or Sports. Which of the following MODEL statements correctly completes this PROC LOGISTIC step for your analysis? proc logistic data=school.students; class Gender; _____________________________________ run; a. model Focus(event='Sports*Grades')=Gender; b. model Focus(event='Sports')=Gender; c. model Focus(ref='Sports')=Gender; d. model Focus*Gender(ref='Sports');
false The Tukey method and the pairwise t-tests are two methods you learned about that compare all possible pairs of means, so they can be used only when you make pairwise comparisons. The Dunnett method compares all categories to a control group. Review: Dunnett's Multiple Comparison Method, Tukey's Multiple Comparison Method
The Dunnett method compares all possible pairs of means, so it can be used only when you make pairwise comparisons. a. true b. false
Error
The ___________sum of squares, SSE, measures the random variability within groups; it is the sum of the squared deviations between observations in each group and that group's mean. This is often referred to as the unexplained variation or within-group variation. A. Total B. Error
Total
The _________sum of squares, SST, is a measure of the total variability in a response variable. It is calculated by summing the squared distances from each point to the overall mean. Because it is correcting for the mean, this sum is sometimes called the corrected total sum of squares. A. Total B. Error
means, normal, larger The central limit theorem states that the distribution of sample means is approximately normal, regardless of the distribution of the population data, and this approximation improves as the sample size gets larger.
The central limit theorem states that the distribution of sample __(1)__ is approximately __(2)__, regardless of the distribution of the population data, and this approximation improves as the sample size gets __(3)__. a. means, skewed, larger b. variance, equal, smaller c. means, normal, larger d. proportions, equal, smaller
the standard deviation (σ) and the variance (σ²) The location and spread of a normal distribution depend on the value of two parameters, the mean (µ) and the standard deviation (σ).
The location and spread of a normal distribution depend on the value of which two parameters? a. the mean (x̄) and the standard deviation (s) b. the standard deviation (σ) and the variance (σ²) c. the mean (µ) and the standard deviation (σ) d. none of the above
a two-sided t-test Because the cereal manufacturer is interested in determining whether the two processes produce a different mean cereal weight, he needs to perform a two-sided t-test. Review: Scenario: Comparing Group Means, Scenario: Testing for Differences on One Side
The manufacturer for a cereal company uses two different processes to package boxes of cereal. He wants to be sure the two processes are putting the same amount of cereal in each box. He plans to perform a two-sample t-test to determine whether the mean weight of cereal is significantly different between the two processes. What type of test should he run? a. an upper-tailed t-test b. a two-sided t-test c. a lower-tailed t-test
used to calculate confidence intervals of the mean. The standard error of the mean is part of the equation used to calculate a confidence interval of the mean. It is not normally distributed, and it is never less than 0.
The standard error of the mean is a. used to calculate confidence intervals of the mean. b. always normally distributed. c. sometimes less than 0. d. none of the above
The row percentages indicate that the distribution of size changes when the value of country changes. To see a possible association, you look at the row percentages. A higher percentage of American-made cars are large as opposed to small. The opposite is true for European cars and especially for Japanese cars. Review: Association between Categorical Variables, Crosstabulation Tables
This table shows frequency statistics for the variables country and size in a data set that contains data about people and the cars they drive. What evidence in the table indicates a possible association? Frequency Percent Row Pct Col Pct Table of country by size country(country) size(size) Large Medium Small Total American 36 11.88 31.30 85.71 53 17.49 46.09 42.74 26 8.58 22.61 18.98 115 37.95 European 4 1.32 10.00 9.52 17 5.61 42.50 13.71 19 6.27 47.50 13.87 40 13.20 Japanese 2 0.66 1.35 4.76 54 17.82 36.49 43.55 92 30.36 62.16 67.15 148 48.84 Total 42 13.86 124 40.92 137 45.21 303 100.00 a. The frequency statistics indicate that the values of each variable are equally distributed across levels. b. The row percentages indicate that the distribution of size changes when the value of country changes. c. The column percentages indicate that most of the cars of each size are manufactured in Japan.
The drug effect is not significant when used in patients with disease Z. The p-value for disease Z is 0.7815. Because this p-value is greater than your alpha of 0.05, you fail to reject the null hypothesis and conclude that there is no significant effect of Drug on Health for patients with disease Z. Review: Performing a Post Hoc Pairwise Comparison
This table shows output from a post hoc pairwise comparison in which you tested the significance of a drug on patients' health for three different diseases. What conclusion can you make based on this output? a. The drug effect is significant when used in patients with disease Z. b. The drug effect is significant when used in patients with diseases Y and Z. c. The drug effect is not significant when used in patients with disease Z.
True
True or False? Assessing ANOVA Assumptions In many cases, good data collection designs can help ensure the independence assumption. Diagnostic plots from PROC GLM can be used to verify the assumption that the error is approximately normally distributed. PROC GLM produces a test of equal variances with the HOVTEST option in the MEANS statement. H0 for this hypothesis test is that the variances are equal for all populations.
True The CLASS statement creates a set of "design variables" (sometimes referred to as "dummy variables") representing the information contained in any categorical variables. Linear regression is then performed on the design variables. ANOVA can be thought of as linear regression on dummy variables. It is only in the interpretation of the model that a distinction is made.
True or False? What Does a CLASS Statement Actually Do? The CLASS statement creates a set of "design variables" representing the information in the categorical variables. PROC GLM performs linear regression on the design variables, but reports the output in a manner interpretable as group mean differences. There is only one "parameterization" available in PROC GLM.
CONNECT= proc sgplot data=STAT1.ameshousing3; vbox SalePrice / category=Central_Air connect=mean; title "Sale Price Differences across Central Air"; run;
VBOX options which Specifies that a connect line joins a statistic from box to box. This option applies only when the CATEGORY option is used to generate multiple boxes. A. CATEGORY= B. CONNECT=
CATEGORY= proc sgplot data=STAT1.ameshousing3; vbox SalePrice / category=Central_Air connect=mean; title "Sale Price Differences across Central Air"; run;
VBOX options which Specifies the category variable for the plot. A box plot is created for each distinct value of the category variable. A. CATEGORY= B. CONNECT=
The variance inflation factors indicate that collinearity is present in the model. Several variance factors are above 10 (Abdomen, Weight, Height, Chest, Hip,Density, Adiposity, and FatFreeWt). This indicates that collinearity among the predictor variables is present in the model. Review: The REG Procedure: Detecting Collinearity
View this PROC REG output. What does the output indicate about the model? a. The p-value for the overall model is not significant. b. The model does not fit the data well. c. The p-values for the parameter estimates indicate that collinearity is present in the model. d. The variance inflation factors indicate that collinearity is present in the model. e. none of the above
Several observations exceed the cutoff values, so these observations might be influential. The gray horizontal lines mark the +2 and -2 cutoff values of the RSTUDENT residuals. Several observations fall outside these lines, so these observations might be influential. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2
View this plot of RSTUDENT residuals versus predicted values of PctBodyFat2. What does it indicate? a. The model does not fit the data well. b. The residuals have a cyclical shape, so the independence assumption is being violated. c. Several observations exceed the cutoff values, so these observations might be influential. d. none of the above
both of the above An influential observation is an observation that strongly affects the linear model's fit to the data. If the influential observation weren't there, the best fitting line to the rest of the data would most likely be very different. Review: Introduction, Using Diagnostic Statistics to Identify Influential Observations, Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2, Handling Influential Observations
What is an influential observation? a. unusual observation that can sometimes have a large residual compared to the rest of the points b. an observation so far away from the rest of the data that it influences the slope of the regression line c. both of the above d. neither of the above
Ho: u=uo Ho: u-uo=0
What is the null hypothesis for a one-sample t-test? A. Ho: u=uo B. Ho: uo=0 C.Ho: u-uo=0 D. Ho: uo-0=0
a table of correlations and a scatter plot matrix with histograms along its diagonal By default, PROC CORR produces a table of correlations (which can be a correlation matrix, depending on your program). The NOSIMPLE option suppresses printing of the simple descriptive statistics for each variable, and PLOT=MATRIX requests a scatter plot matrix instead of individual scatter plots. The HISTOGRAM option displays histograms of the variables in the VAR statement along the diagonal of the scatter plot matrix. Review: Using Correlation to Measure Relationships between Continuous Variables, Using Correlation to Measure Relationships between Continuous Variables
What output does this program produce? proc corr data=statdata.bodyfat2 nosimple plots=matrix(nvar=all histogram); var Age Weight Height; run; a. individual correlation plots and simple descriptive statistics b. a scatter plot matrix only, with histograms along its diagonal c. a table of correlations and a scatter plot matrix with histograms along its diagonal d. can't tell from the information given
For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 lower. The parameter estimate for Age is the average change in Oxygen_Consumption for a 1-unit change in Age. In this case, the parameter estimate is negative. So, for each year older (a 1-unit change in Age), oxygen consumption decreases by 2.78 units. Review: The Simple Linear Regression Model
When Oxygen_Consumption is regressed on RunTime, Age, Run_Pulse, and Maximum_Pulse, the parameter estimate for Age is -2.78. What does this mean? a. For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 greater. b. For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 lower. c. For every 2.78 years older (holding the other predictors at a fixed value), oxygen consumption doubles. d. For every 2.78 years younger (holding the other predictors at a fixed value), oxygen consumption doubles.
model Health=Drug Disease Drug*Disease; In the MODEL statement, you first specify the main effect variables as they exist in the two-way ANOVA model. You then define the interaction term by separating the two main effect variables with an asterisk in the MODEL statement. Review: Performing Two-Way ANOVA with Interactions, Applying the Two-Way ANOVA Model
When you perform a two-way ANOVA in SAS, which of the following statements correctly defines the model that includes the interaction between the two main effect variables? a. class Drug*Disease; b. class Drug=Disease; c. model Drug*Disease; d. model Health=Drug Disease Drug*Disease;
proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; partition fraction(test=0 validate=.20); run; The PARTITION statement specifies that the original data set, Housing, be split. The FRACTION option specifies the fraction of the original data set (as a decimal value) to be placed in the holdout data set. The training data set contains the remaining observations, those that were not allocated to the validation (or, if specified, test) data sets. Review: Using PROC GLMSELECT to Build a Predictive Model, Building a Predictive Model
Which of the following PROC GLMSELECT steps splits the original data set into a training data set that contains 80% of the original data and a validation data set that contains 20% of the original data? a. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape / fraction(test=0 validate=.20); run; b. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape / partition(test=0 validate=.20); run; c. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; fraction(test=0 validate=.20); run; d. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; partition fraction(test=0 validate=.20); run;
None of the Above
Which of the following affects alpha? a. The p-value of the test b. The sample size c. The number of Type I errors d. All of the above e. Answers a and b only f. None of the above
STUDENT residuals You can use STUDENT residuals to detect outliers. To detect influential observations, you can use RSTUDENT residuals and the DFFITS and Cook's D statistics. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2
Which of the following can you use to detect outliers? a. DFFITS statistics b. Cook's D statistics c. STUDENT residuals d. RSTUDENT residuals
proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; run; In the HISTOGRAM statement, you specify the Speed variable and the NORMAL option using estimates of the population mean and the population standard deviation. In the INSET statement, you specify the keywords SKEWNESS and KURTOSIS, as well as the POSITION=NE option.
Which of the following code choices creates a histogram for the variable Speed from the data set SpeedTest with a normal curve overlay and a box with the skewness and kurtosis statistics printed in the northeast corner? a. proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis; run; b. proc univariate data=statdata.speedtest; histogram Speed / normal (mean std); inset skewness kurtosis / position=ne; run; c. proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; run; d. proc univariate data=statdata.speedtest; histogram Speed / normal(skewness kurtosis); run;
proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Type; var Yield; run; The PROC MEANS statement must include the option PRINTALLTYPES in order for SAS to display statistics for all requested combinations of class variables - that is, for each level or occurrence of the variable and for all occurrences combined. The statistics specified on the second line must include the keywords N MEAN MEDIAN STD VAR RANGE QRANGE. The code must specify Type as the class variable and Yield as the analysis variable.
Which of the following code examples correctly calculates descriptive statistics of popcorn yield (Yield) for each level of the class variable (Type) in the data set Statdata.Popcorn, as well as statistics for all levels combined? The output should include the following statistics: sample size, mean, median, standard deviation, variance, range, and interquartile range. a. proc means data=statdata.popcorn maxdec=2 fw=10 n mean median std var range qrange; class Type; var Yield; run; b. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Yield; var Class; run; c. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Type; var Yield; run; d. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std range IQR; class Type; var Yield; run;
the smallest overall validation average squared error PROC GLMSELECT selects the model that has the smallest overall validation error. Review: Building a Predictive Model
Which of the following does PROC GLMSELECT use to select a model from the candidate models when a validation data set has been provided? a. the smallest number of predictors b. the largest adjusted R-Square value c. the smallest overall validation average squared error d. none of the above
all of the above All of these statements are available for use within PROC PLM for postprocessing. Recall that this postprocessing will be performed using the item store. Review: Performing Postprocessing Tasks with the PLM Procedure
Which of the following is available for use in postprocessing within PROC PLM? a. LSMEANS b. LSMESTIMATE c. SLICE d. all of the above
The observations are dependent. In an ANOVA model, you assume that the errors are normally distributed for each treatment, the errors have equal variances across treatments, and the observations are independent. When you add a blocking factor to your ANOVA model, you also assume that the treatments are randomly assigned within each block and that the effects of the treatment are the same within each block. Review: More ANOVA Assumptions
Which of the following is not an assumption you make when including a blocking factor in an ANOVA randomized block design? a. The treatments are randomly assigned within each block. b. The errors are normally distributed. c. The effects of the treatment factor are constant across the levels of the blocking variable. d. The observations are dependent.
all of the above All six steps are important for developing good regression models. You might need to perform some steps iteratively to produce the best possible model. Review: Using an Effective Modeling Cycle
Which of the following is suggested for developing good regression models? a. getting to know your data by performing preliminary analyses b. identifying good candidate models c. checking and validating your assumptions using residual plots and other statistical tests d. identifying any influential observations or collinearity e. revising the model if needed f. validating the model with data not used to build the model g. all of the above h. a, c, and d only
When you score data, you apply the score code (the equations obtained from the final model) to the scoring data. When you score data, you apply the score code to the scoring data. It is not necessary to rerun the algorithm that was used to build the model. If you made any modifications to the training or validation data, you must make the same modifications to the scoring data before you can score it. The size of the scoring data set is not affected by the size of the training and validation data sets. Review: Preparing for Scoring
Which of the following statements about scoring is true? a. When you score data, you must rerun the algorithm that was used to build the model. b. When you score data, you apply the score code (the equations obtained from the final model) to the scoring data. c. If you made any modifications to the training or validation data, it is not necessary to make the same modifications to the scoring data. d. The scoring data set cannot be larger than either the training data set or the validation data set.
2 only In statement 2, the amount of salty snacks eaten and thirst have a positive linear relationship. As the values of one variable (amount of salty snacks eaten) increase, the values of the other variable (thirst) increase as well. Review: Using Scatter Plots to Describe Relationships between Continuous Variables, Using Correlation to Measure Relationships between Continuous Variables
Which of the following statements describes a positive linear relationship between two variables? The more I eat, the less I want to exercise The more salty snacks I eat, the more water I want to drink. No matter how much I exercise, I still weigh the same. a. 1 only b. 1 and 2 c. 2 only d. 2 and 3 e. 3 only
all of the above All of the statements are true concerning information criteria. All of the formulas begin with the same calculation but are different in the penalty term accessing the complexity of the model. With this penalty assessment, models that contain different numbers of parameters can be compared where the smaller information criteria value is considered better. Review: Information Criteria
Which of the following statements is true about information criteria such as AIC, AICC, BIC, and SBC? a. Formulas for all information criteria begin with the same calculation. b. The penalty term to assess the complexity of the model allows information criteria to be a useful means of comparing models with different number of parameters. c. The best model is the one with the smallest information criteria value. d. all of the above
You can reproduce your results if you specify an integer that is greater than zero in the SEED= option and then rerun the code using the same SEED= value. By specifying an integer that is greater than zero in the SEED= option, you can reproduce your results by rerunning the code using the same SEED= value. The SEED= option has nothing to do with the allocation of observations to the validation data set. If you do not specify a valid value in the SEED= option, the seed is automatically generated from reading the time of day from the computer's clock. The SEED= option is used when you start with a data set that is not yet partitioned. Review: Using PROC GLMSELECT to Build a Predictive Model
Which of the following statements is true about the SEED= option in PROC GLMSELECT? PROC GLMSELECT DATA=training-data-set <SEED=number>; MODEL target(s)=input(s) </ options>; PARTITION FRACTION(<TEST=fraction><VALIDATE=fraction>); RUN; a. You can reproduce your results if you specify an integer that is greater than zero in the SEED= option and then rerun the code using the same SEED= value. b. The SEED= option offers an alternative way to specify the proportion of observations to allocate to the validation data set. c. If a valid value is not specified for the SEED= option, the code will not run. d. You can use the SEED= option only when you have already partitioned the data prior to model building.
proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case; run; quit; Program a specifies the R and INFLUENCE options, which request diagnostic statistics. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2
Which of these programs requests diagnostic statistics as well as diagnostic plots? a. proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case; run; quit; b. proc reg data=statdata.bodyfat2 plots(only)= (QQ RESIDUALBYPREDICTED RESIDUALS); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case; run; quit; c. both of the above
ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(label) COOKSD(label) DFFITS(label) DFBETAS(label)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; Program b is almost correct, but the images must be created for the data sets to be saved. Program c tells SAS to create the images and save them into their own data sets. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2
Which program correctly saves information from influential plots into individual output data sets? Assume that ODS GRAPHICS is on. a. proc reg data=statdata.bodyfat2; PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; run; quit; b. ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots=none; PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; c. ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(label) COOKSD(label) DFFITS(label) DFBETAS(label)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; d. ods output outputstatistics; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; run; quit;
The response variable can have more than two levels as long as one of the levels is coded as 0. In binary logistic regression, the response variable can only have two levels. Review: Modeling a Binary Response
Which statement about binary logistic regression is false? a. Binary logistic regression uses predictor variables to estimate the probability of a specific outcome. b. To model the relationship between a predictor variable and the probability of an outcome, you must use a nonlinear function. c. The mean of the response in binary logistic regression is a probability, which is between 0 and 1. d. The response variable can have more than two levels as long as one of the levels is coded as 0.
All main effects and interactions that remain in the final model must be significant. Backward elimination results in a final model that can contain one or more main effects and (if specified) interactions. Any interactions in the final model must be significant. Main effects that are involved in interactions must appear in the final model, whether or not they are significant. Review: The Backward Elimination Method of Variable Selection
Which statement about the backward elimination method is false? a. Backward elimination is a method of selecting variables for a logistic regression model. b. Backward elimination removes effects and interactions one at a time. c. All main effects and interactions that remain in the final model must be significant. d. To obtain a more parsimonious model, you specify a smaller significance level.
reference Typically, the original data set is split into two subset data sets called the training and validation data sets. However, in some situations, the data is split into three subsets, and the third of these is called the test data set. Review: Using PROC GLMSELECT to Build a Predictive Model
With a large enough data set, observations can be divided into three subset data sets for use in honest assessment. Which of the following is not the name of one of these three subset data sets? a. training b. validation c. reference d. test
the assumption of equal variances You use Levene's Test for Homogeneity in PROC GLM to verify the assumption of equal variances in a one-way ANOVA model. Review: The GLM Procedure
You can examine Levene's Test for Homogeneity to more formally test which of the following assumptions? a. the assumption of errors being normally distributed b. the assumption of independent observations c. the assumption of equal variances d. the assumption of treatments being randomly assigned
Parameters
____________ are evaluations of characteristics of populations. They are generally unknown and must be estimated through the use of samples. A sample is a group of measurements from a population. In order for inferences to be valid, the sample should be representative of the population. A. Metrics B. Parameters
Scatter Scatter plots are useful to accomplish the following: explore the relationships between two variables locate outlying or unusual values identify possible trends identify a basic range of Y and X values communicate data analysis results
____________plots are two-dimensional graphs produced by plotting one variable against another within a set of coordinate axes. The coordinates of each point correspond to the values of the two variables. A. Box B. Histogram C. Scatter
Total Variation
the overall variability in the response variable. It is calculated as the sum of the squared differences between each observed value and the overall mean, This measure is also referred to as the Total Sum of Squares (SST). A. Total Variation B. Between Group Variation C. Within Group Variation
Between Group Variation
the variability explained by the independent variable and therefore represented by the between-treatment sum of squares. It is calculated as the weighted (by group size) sum of the squared differences between the mean for each group and the overall mean, This measure is also referred to as the Model Sum of Squares (SSM). A. Total Variation B. Between Group Variation C. Within Group Variation
Within Group Variation
the variability not explained by the model. It is also referred to as within treatment variability or residual sum of squares. It is calculated as the sum of the squared differences between each observed value and the mean for its group, This measure is also referred to as the Error Sum of Squares (SSE). A. Total Variation B. Between Group Variation C. Within Group Variation