SAS Statistics, SAS Visual Analytics, SAS DataFlux

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

row-based business rule called Monitor for Nulls

A Data Quality Steward creates these items for the Supplier repository: - A row-based business rule called Monitor for Nulls - A set-based business rule called Percent of Verified Addresses - A group-based rule called Low Product Count - A task based on the row-based, set-based, and group-based rules called Monitor Supplier Data Which one of these can the Data Quality Steward apply in an Execute Business Rule node in a data job? Response: set-based business rule called Percent of Verified Addresses row-based business rule called Monitor for Nulls group-based rule called Low Product Count task based on the row-based, set-based, and group-based rules called Monitor Supplier Data

row-based business rule called Monitor for Nulls

A Data Quality Steward creates these items for the Supplier repository: • A row-based business rule called Monitor for Nulls • A set-based business rule called Percent of Verified Addresses • A group-based rule called Low Product Count • A task based on the row-based, set-based, and group-based rules called Monitor Supplier Data Which one of these can the Data Quality Steward apply in an Execute Business Rule node in a data job? A. set-based business rule called Percent of Verified Addresses B. row-based business rule called Monitor for Nulls C. group-based rule called Low Product Count D. task based on the row-based, set-based, and group-based rules called Monitor Supplier Data

proc logistic data = MYDIR.EMPLOYMENT descending; class Education (param=ref ref='3'); model Hired = Salary Education; run;

A Human Resource manager fits a logistic regression model with the following characteristics: - binary target Hired - continuous predictor Salary - categorical predictor Education (levels=1,2,3) The default odds ratio compares each level against the last class level for the variable Education. Which SAS program gives parameter estimates for Education that are consistent with the default odds ratios? proc logistic data = MYDIR.EMPLOYMENT descending; class Education (param=ref ref='3'); model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education; model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education (ref='3'); model Hired = Salary Education; run; proc logistic data = MYDIR.EMPLOYMENT descending; class Education Salary (param=ref ref='3'); model Hired = Salary Education; run;

Token

A __________ is an "atomically semantic" component of a data value. In other words, _____________ represent the smallest pieces of a data value that have some distinct meaning. A. Token B. Data Value C. Data Object

Path

A ____________ in a Sankey diagram represents a distinct sequence of events. Each _________ in the diagram consists of one or more transactions. A. Path B. direction indicator C. result

Adminsitrator

A ______________ has the Publish this collection for all users option in the Collections window. A. Publisher B. Administrator C. Report Developer

Residual

A ______________ is the difference between the observed value of the response and the predicted value of the response variable. A. ANOVA B. Mean C. Residual

Tree Map The size of each tile represents either the summarization of a measure or the frequency that is displayed as a count or percent.

A ______________ visualization enables you to display a category or hierarchy as a set of rectangular tiles. A. Bar Chart B. Tree Map C. Heat Map

Collection

A _______________ is a set of fields that are selected from tables that are accessed from different data connections. A _______________ provides a convenient way for users to build a dataset using those fields. A __________________ can be used as an input source for a profile in Data Management Studio. A. Collection B. Data Connection C. Master Data Foundation

Hierarchy In many cases, the levels of a hierarchy are arranged with more general information at the top and more specific information at the bottom.

A ________________ is a defined arrangement of categorical data items based on parent-child relationships. A. Lineage B. Ordinal Process C. Heirarchy

Scatter Plot

A ___________________ visualization enables you to examine the relationship between numeric data items. A. Bar B. Scatter Plot C. Histogram

Stop List

A ____________________ is a table of words that you want to ignore in your text analysis. A. Document Collection B. Stop List

Standardization A standardization definition has the following attributes: is more complex than a standardization scheme involves one or more standardization schemes can also parse data and apply regular expression libraries and casing

A ________________________ scheme is a simple find-and-replace table that specifies how data values will be standardized. A. Data Search B. Standardization

Visualization

A ___________________displays data values using one of several ___________________types. _________________types include tables, charts, plots, geographic maps, and more. A _________________can contain filters and other display properties. A. Data Source B. Visualization C. Exploration

Standardization A standardization scheme can be built from the profile report. When a scheme is applied, if the input data is equal to the value in the Data column, then the data is changed to the value in the Standard column. The standard value DataFlux was selected by the Scheme Builder because it was the permutation with the most occurrences in the profile report.

A _________________scheme takes various spellings or representations of a data value and lists a standard way to consistently write this value. A. Build B. Standardization

Preview Previewing does not create the output. The output is physically created only when the job is executed.

A _______________of a Data Output node does not show field name changes or deletions. This provides the flexibility to continue your data flow after a Data Output node. In addition, previewing a Data Output node does not create the output. You must run the data job to create the output. A. Export B. Import C. Preview

Transaction

A ______________is a sequence of events that are associated with a specific ____________ identifier value. A. Process B. Transaction C. explanatory observation

Reference Reference source locations are registered on the Administration riser bar in DataFlux Data Management Studio. One reference source location of each type should be designated as the default.

A ______________object is typically a database used by DataFlux Data Management Studio to compare user data to a reference source (for example, USPS Address Data). You cannot directly access or modify references. A. Data Source B. Reference

Diffogram The downward-sloping diagonal lines show the confidence intervals for the differences. The upward-sloping line is a reference line showing where the group means would be equal.

A _____________can be used to quickly tell whether two group means are statistically significant. The point estimates for the differences between pairs of group means can be found at the intersections of the vertical and horizontal lines drawn at group mean values. A. Histogram B. Diffogram

Chop Tables Purpose: Extract individual words from a text string Editor: Chop Table Editor

A collection of character-level rules used to create an ordered word list from a string. For each character represented in the table, you can specify the classification and the operation performed by the algorithm. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars

Schemes Purpose: Standardize phrases, words, and tokens Editor: Scheme Builder

A collection of lookup tables used to transform data values to a standard representation. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars

Regex Libraries Purpose: Standardization, categorization, casing, and pattern identification Editor: Regex Library Editor

A collection of patterns that are matched against a text string (from left to right) for character-level cleansing and operations. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars

Phonetics Libraries Purpose: Phonetic (sound-alike) reduction of words Editor: Phonetics Editor

A collection of patterns that produce the same output string for input strings that have similar pronunciations or spellings. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars

Grammars Purpose: Identify patterns in word categories Editor: Grammar Editor

A collection of rules that represent extracted patterns of words in a given context. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars

Vocabulary Libraries Purpose: Categorize words Editor: Vocabulary Editor

A collection of words, each associated with one or more categories and likelihoods. A. Schemes B. Chop Tables C. Phonetics Libraries D. Regex Libraries E. Vocabulary Libraries G. Grammars

A Geography hierarchy data item may be assigned to the Geography role to provide drill down capability on the coordinate points.

A content developer builds a Geo Map visualization in SAS Visual Analytics Explorer and sets the Map style property to Coordinates. Which statement about this Geo Map visualization is true? Response: A measure data item can be assigned to the Color role to control the color of the coordinate points. A Custom Geography data item cannot be assigned to the Geography role when Coordinates is set as the Map style property. A category data item can be assigned to the Group role to group the coordinates into regions. A Geography hierarchy data item may be assigned to the Geography role to provide drill down capability on the coordinate points.

The line chart was created using the automatic chart functionality.

A content developer created the visualization with a forecast shown above. Additional measures for scenario analysis cannot be added from the Roles Tab. Why? Response: Underlying factors are not available in the line chart visualization. The forecast option in a line chart does not allow scenario analysis. The visualization does not allow additional measures. The line chart was created using the automatic chart functionality.

Data explorations can be used for the following: to identify data redundancies to extract and organize metadata from multiple sources to identify relationships between metadata to catalog data by specified business data types and processes

A data exploration reads data from databases and categorizes the fields in the selected tables into categories. These categories are predefined in the Quality Knowledge Base (QKB). Data explorations perform this categorization by matching column names. You also have the option of sampling the data in the table to determine whether the data is one of the specific types of categories in the QKB. A. Repository B. Data Collection C. Data Exploration

The portfolios differ significantly with respect to risk.

A financial analyst wants to know whether assets in portfolio A are more risky (have higher variance) than those in portfolio B. The analyst computes the annual returns (or percent changes) for assets within each of the two groups and obtains the following output from the GLM procedure: Which conclusion is supported by the output? A. Assets in portfolio A are significantly more risky than assets in portfolio B. B. Assets in portfolio B are significantly more risky than assets in portfolio A. C. The portfolios differ significantly with respect to risk. D. The portfolios do not differ significantly with respect to risk.

C. The portfolios differ significantly with respect to risk.

A financial analyst wants to know whether assets in portfolio A are more risky (have higher variance) than those in portfolio B. The analyst computes the annual returns (or percent changes) for assets within each of the two groups and obtains the following output from the GLM procedure: A. Assets in portfolio A are significantly more risky than assets in portfolio B. B. Assets in portfolio B are significantly more risky than assets in portfolio A. C. The portfolios differ significantly with respect to risk. D. The portfolios do not differ significantly with respect to risk.

Business Rule Business rules are defined within a repository using the Business Rules Manager.

A formula, validation, or comparison that can be applied to a given set of data.Data must either pass or fail the business rule. A. Exception B. Business rule

proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run;

A linear model has the following characteristics: - a dependent variable (y) - one continuous predictor variables (x1) including a quadratic term (x12) - one categorical predictor variable (c1 with 3 levels) - one interaction term (c1 by x1) Which SAS program fits this model? proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1sq c1byx1 /solution; run; proc reg data=SASUSER.MLR; model y = c1 x1 x1sq c1byx1 /solution; run; proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run; proc reg data=SASUSER.MLR; model y = c1 x1 x1*x1 c1*x1; run;

proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run;

A linear model has the following characteristics: • a dependent variable (y) • one continuous predictor variables (x1) including a quadratic term (x12) • one categorical predictor variable (c1 with 3 levels) • one interaction term (c1 by x1) Which SAS program fits this model? A. proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1sq c1byx1 /solution; run; B. proc reg data=SASUSER.MLR; model y = c1 x1 x1sq c1byx1 /solution; run; C. proc glm data=SASUSER.MLR; class c1; model y = c1 x1 x1*x1 c1*x1 /solution; run; D. proc reg data=SASUSER.MLR; model y = c1 x1 x1*x1 c1*x1; run;

Plan: Discover

A quick inspection of your corporate data would probably find that it resides in many different databases, managed by many different systems, with many different formats and representations of the same data. This step of the methodology enables you to explore metadata to verify that the right data sources are included in the data management program. You can also create detailed data profiles of identified data sources so that you can understand their strengths and weaknesses. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control

d. Odds Ratio e. Spearman Correlation

A researcher wants to measure the strength of an association between two binary variables. Which statistic(s) can he use? a. Hansel and Gretel Correlation b. Mantel-Haenszel Chi-Square c. Pearson Chi-Square d. Odds Ratio e. Spearman Correlation

Record rules

A sample of data has been clustered and found to contain many multi-row clusters. For each of these clusters, you want to choose a single record to represent the information in the cluster. Which type of rule do you use to determine a surviving record? Response: Record rules Business rules Clustering rules Field rules

Field rules

A sample of data has been clustered and found to contain many multi-row clusters. To construct a "best" record for each multi-row cluster, you need to select information from other records within a cluster. Which type of rule allows you to perform this task? A. Clustering rules B. Record rules C. Business rules D. Field rules

Data Collection A data collection has the following features: provides a convenient way to build a data source using desired fields can be used as an input source for profiles

A set of data fields from different tables in different data connections. A. Repository B. Data Collection

The difference in the logit between level 1 and the average of all levels

A variable coded 1, 2, 3, and 4 is parameterized with effect coding, with 2 as the reference level. The parameter estimate for level 1 tells you which of the following? a. The difference in the logit between level 1 and level 2 b. The odds ratio between level 1 and level 2 c. The difference in the logit between level 1 and the average of all levels d. The odds ratio between level 1 and the average of all levels e. Both a and b f. Both c and d

Act: Execute

After business users establish how the data and rules should be defined, the IT staff can install them within the IT infrastructure and determine the integration method (real time, batch, or virtual). These business rules can be reused and redeployed across applications, which helps increase data consistency in the enterprise. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control

Act: Design

After you complete the first two steps, this phase enables you to take the different structures, formats, data sources, and data feeds and create an environment that accommodates the needs of your business. At this step, business and IT users build workflows to enforce business rules for data quality and data integration. They also create data models to house data in consolidated or master data sources. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control

False Role-based permissions provide the ability to view and add comments.

All users have the ability to add and view comments. True False

Exploration

An __________________ is a metadata object that accesses one or more data sources and contains one or more visualizations of the data. The visualizations, data sources, and property settings are saved as part of an ________________. A. Data Source B. Visualization C. Exploration

Adj R-Sq

An analyst has selected this model as a champion because it shows better model fit than a competing model with more predictors. Which statistic justifies this rationale? Adj R-Sq R-Square Error DF Coeff Var

ANOVA

Analysis of variance (ANOVA) is a statistical technique used to compare the means of two or more groups of observations or treatments. For this type of problem, you have the following: a continuous dependent variable, or response variable a discrete independent variable, also called a predictor or explanatory variable. A. CONOVA B. ANOVA

Plan

Analyzing and exploring the data sources can lead to the discovery of data quality issues. The ACT phase is designed to create data jobs that cleanse, or correct, the data. This phase involves the following: standardizing, parsing, and/or casing the data correctly identifying types of data (identification analysis) performing methods to remove duplicates from data sources or to join tables with no common key A. Plan B. Act C. Monitor

ANOVA

Assessing __________ Assumptions In many cases, good data collection designs can help ensure the independence assumption. Diagnostic plots from PROC GLM can be used to verify the assumption that the error is approximately normally distributed. PROC GLM produces a test of equal variances with the HOVTEST option in the MEANS statement. H0 for this hypothesis test is that the variances are equal for all populations. A. Equality B. ANOVA C. Variability

row-based business rule called Monitor for Nulls

Assume the following items are created for the Supplier repository: - A row-based business rule called Monitor for Nulls - A set-based business rule called Percent of Verified Addresses - A group-based rule called Low Product Count - A task based on the row-based, set-based, and group-based rules called Monitor Supplier Data Which one can you apply in a profile? Response: row-based business rule called Monitor for Nulls group-based business rule called Low Product Count set-based business rule called Percent of Verified Addresses task based on the row-based, set-based, and group-based rules called Monitor Supplier Data

ANOVA

Assumptions for ______________ Observations are independent. Errors are normally distributed. All groups have equal error variances. A. Equality B. ANOVA C. Variability

ANOVA

Assumptions for _____________________ Observations are independent. Errors are normally distributed. All groups have equal error variances. A. Means B. ANOVA C. Medians

One Category Multiple categories and measures Geography and three or more measures One or more categories and any number of measures or geographies

Bar Chart A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies

Role-based rule

Business Rule that evaluates every row in a table? A. Role-based rule B. Set-based rule C. Group-based rule

Group-based rule

Business Rule that evaluates groups of data (for example, if data is grouped by product code, then the rules are evaluated for each product code)? A. Role-based rule B. Set-based rule C. Group-based rule

Set-based rule

Business Rule that evaluates the table as a whole? A. Role-based rule B. Set-based rule C. Group-based rule

SAS QKB for Product Data (PD)

Contains extraction, parsing, standardization, and pattern analysis definitions to handle the following attributes in generic product data: • brands/manufacturers • colors • dimensions • sizes • part numbers • materials • packaging terms and units of measurement A. SAS QKB for Contact Information (CI) B. SAS QKB for Product Data (PD)

Extraction

Extracts parts of the text string and assigns them to corresponding tokens for the specified data type. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

The Totals Placement property has been set to After.

For this Cross tab visualization created in SAS Visual Analytics Explorer, which statement is true? Response: The Indented property has been selected. The Show row totals property has been selected. The Totals Placement property has been set to After. Product line is the lowest level in the hierarchy.

A category data item is assigned to the Lattice Rows role and three data items are assigned to the Measures role.

For this Line chart visualization created in SAS Visual Analytics Explorer, how are data items assigned to roles? Response: A hierarchy data item is assigned to the Category role and a datetime data item is assigned to the X-axis role. Three data items are assigned to Group role and a datetime data item is assigned to the Measures role. A datetime data item is assigned to the Category role and a category data item is assigned to the Group role. A category data item is assigned to the Lattice Rows role and three data items are assigned to the Measures role.

Match

Generates match codes for text strings where the match codes denote a fuzzy representation of the character content of the tokens in the text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

Geography and zero to two measures

Geo Map A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies

Large wrist size is significantly different than small wrist size.

Given alpha=0.02, which conclusion is justified regarding percentage of body fat, comparing small(S), medium(M), and large(L) wrist sizes? Medium wrist size is significantly different than small wrist size. Large wrist size is significantly different than small wrist size. There is no significant difference due to wrist size. Large wrist size is significantly different than medium wrist size.

-2 and 2

Given the properties of the standard normal distribution, you would expect about 95% of the studentized residuals to be between which two values? a. -3 and 3 b. -2 and 2 c. -1 and 1 d. 0 and 1 e. 0 and 2 f. 0 and 3

Gender Analysis

Guesses the gender of the individual in the text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

Language Guess

Guesses the language of a text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

Locale Guess

Guesses the locale of a text string. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

High

High cardinality refers to columns with a large number of values that are unique. A. High B. Low C. Median

One Measure

Histogram A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies

They both utilize an identification analysis definition from the Quality Knowledge Base.

How are the Field name analysis and Sample data analysis methods similar? They both require the same identification analysis definition from the Quality Knowledge Base. They both utilize an identification analysis definition from the Quality Knowledge Base. They both utilize a match definition from the Quality Knowledge Base. They both require the same match definition from the Quality Knowledge Base.

They both utilize an identification analysis definition from the Quality Knowledge Base.

How are the Field name analysis and Sample data analysis methods similar? A. They both utilize a match definition from the Quality Knowledge Base. B. They both require the same identification analysis definition from the Quality Knowledge Base. C. They both utilize an identification analysis definition from the Quality Knowledge Base. D. They both require the same match definition from the Quality Knowledge Base.

from the Tools menu

How do you access the Data Management Studio Options window? from the Tools menu from the Administration riser bar in the app.cfg file in the DataFlux Data Management Studio installation folder from the Information riser bar

from the Tools menu

How do you access the Data Management Studio Options window? A. from the Tools menu B. from the Administration riser bar C. from the Information riser bar D. in the app.cfg file in the DataFlux Data Management Studio installation folder

Should have chosen Use a CLASS statement.

How do you get PROC TTEST to display the test for equal variance? Use the option EV. Request a plot of the residuals. Should have chosen Use a CLASS statement. Use the MEANS statement with a HOVTEST option.

10

How many observations did you find that might substantially influence parameter estimates as a group? a. 0 b. 1 c. 4 d. 5 e. 7 f. 10

Identification Analysis

Identifies the text string as referring to a particular predefined category. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

D. proc reg data=SASUSER.MLR; model y = x1-x4; run;

Identify the correct SAS program for fitting a multiple linear regression model with dependent variable (y) and four predictor variables (x1-x4). A. proc reg data=SASUSER.MLR; model y = x1 x2 x3 x4 /solution; run; B. proc reg data=SASUSER.MLR; model y = x1; model y = x2; model y = x3; model y = x4; run; C. proc reg data=SASUSER.MLR; var y x1 x2 x3 x4; model y = x1-x4; run; D. proc reg data=SASUSER.MLR; model y = x1-x4; run;

Role-Based

In Dataflux, this rule Evaluates every row in a table? A. Role-Based B. Set-Based C. Group-Base

Group-Base

In Dataflux, this rule Evaluates groups of data (i.e. if data is grouped by product code, then the rules are evaluated for each product code)? A. Role-Based B. Set-Based C. Group-Base

Set-Based

In Dataflux, this rule Evaluates the table as a whole(for example, evaluates 1000 rows as a set)? A. Role-Based B. Set-Based C. Group-Base

A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous.

In SAS Visual Analytics Explorer, when a date data item is dragged onto an Automatic Chart visualization either a bar chart or a line chart will be created. What determines the type of chart created? A. The format applied to the date data item determines the type of chart displayed. B. A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous. C. The properties associated with the automatic chart determines the type of chart displayed. D. A line chart is created if the Model property of the data item is set to Discrete, a bar chart is created if the Model property is set to Continuous.

A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous.

In SAS Visual Analytics Explorer, when a date data item is dragged onto an Automatic Chart visualization either a bar chart or a line chart will be created. What determines the type of chart created? Response: A line chart is created if the Model property of the data item is set to Discrete, a bar chart is created if the Model property is set to Continuous. A bar chart is created if the Model property of the data item is set to Discrete, and a line chart is created if the Model property is set to Continuous. The properties associated with the automatic chart determines the type of chart displayed. The format applied to the date data item determines the type of chart displayed.

Goal Seeking

In SAS Visual Analytics Explorer, which feature explores underlying factors by specifying target values for the forecast measure? Response: Forecast Targeting Goal Analysis Scenario Analysis Goal Seeking

In the data properties tab change the format of the data item to year.

In SAS Visual Analytics, The data Item date displays with the month. Day and year(MMDDYYYY) how does a content developer display only the year in the visualizations or report object. Response: In the data properties tab change the format of the data item to year. Format the data item using the roles tab in the right pane Select measure Details and change the format of the data item to year Right click on the data item and select New Aggregated Measure

Data Validation Node

In a data job, to filter rows of data for a specific field from a database table, which node would you select for optimal Performance. Data Validation Node SQL Query Node External data Provider Node Data Source Node

-.7563

In the Analysis of Maximum Likelihood table, using effect coding, what is the estimated logit for someone at IncLevel=2? a. -.5363 b. -.6717 c. -.6659 d. -.7563 e. Cannot tell from the information provided

Data Type

In the context of the QKB, a _______________ is an object that represents the semantic nature of some data value. A _____________ serves as a placeholder (or grouping) for metadata used to define data cleansing and data integration algorithms (called definitions). DataFlux provides many data types in the QKB, but you can also create your own. A. Data Object B. Data Type

Data Job Node The referenced data job (the one that is embedded using the Data Job (reference) node) must have an External Data Provider node as the input. Data is passed from the parent job to the referenced data job, processed, and returned to the flow in the parent job. The Data Job (reference) node is found in the Data Job grouping of nodes.

Is used to embed a data job within a data job. A. Data Node B. Data Job Node

One datetime category and any number of other categories or measures.

Line Chart A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies

Specify different sensitivities for some or all the fields.

Match codes fields were generated based on these fields: NAME ADDRESS CITY STATE The Clustering node is "over matching". It is finding matches where there should NOT be matches. What can be done in the Match Codes node to prevent this "over matching"? Select the option "Lower sensitivity levels". Specify different sensitivities for some or all the fields. Nothing can be done from within the Match Codes node. Select the option "Remove over-matched values".

Parse

Parses a text string by attempting to understand which words or phrases should be associated with each token of the specified data type. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

HOVTEST

Performs a test of homegeneity (equality) of variances. The null hypothesis for this test is that the variances are equal. Levene's test is the default. A. T-TEST B. HOVTEST C. EQUALTEST

Combine

Possible values of Diff type include the following: A record belongs to a set of records from one or more clusters in the left table that are combined into a larger cluster in the right table. A. Combine B. Divide C. Network

Divide

Possible values of Diff type include the following: A record belongs to a set of records in a cluster in the left table that is divided into two or more clusters in the right table. A. Combine B. Divide C. Network

Network

Possible values of Diff type include the following: A record belongs to a set of records that are involved in one or more different multirecord clusters in the left and right tables. A. Combine B. Divide C. Network

False

Predictor variables are assumed to be normally distributed in linear regression models. True False

Diagnostics

Produces a panel display of diagnostic plots for linear models? A. Diagnostics B. Hovtest

Metadata

Profiles are not stored as files, but as ____________. To run a profile via the command line, the Batch Run ID for the profile must be specified. A. Metadata B. Tokens

External Data Provider Node The External Data Provider node has the following characteristics: accepts source data from another job or from user input that is specified at run time can be used as the first node in a data job that is called from another job can be used as the first node in a data job that is deployed as a

Provides a landing point for source data that is external to the current job. A. External Data Provider Node B. External Data Job

Report Viewing Role

Provides commenting and personalization features, in addition to basic functionality. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role

Basic Role

Provides functionality for guest access (if applicable) and entry-level users. Enables users to view reports in the Visual Analytics Viewer, but does not provide commenting or personalization features. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role

Theme Designer Admin Role

Provides the ability to create custom themes using Theme Designer for Flex. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role

Analysis Role

Provides the ability to create reports and explorations, in addition to report viewing functionality. If SAS Visual Statistics is licensed, provides the Build Analytical Model capability. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role

Comments Admin Role

Provides the ability to delete and edit other users' comments. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role

Data profiles provide the following benefits: improve understanding of existing databases aid in identifying issues early in the data management process, when they are easier and less expensive to manage help determine which steps need to be taken to address data problems enable you to make better business decisions about your data

Provides the ability to inspect data for errors, inconsistencies, redundancies, and incomplete information. A. Data Profile B. Data Collection

Administration Role

Provides the ability to perform tasks in the administrator, in addition to most other capabilities. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role

Data Building Role

Provides the ability to prepare data, in addition to the analysis functionality. A. Basic Role B. Report Viewing Role C. Data Building Role D. Administration Role E. Theme Designer Admin Role F. Comments Admin Role G. Analysis Role

Extensible

Rules are no longer limited to well-known contact data. With the customization feature in Data Management Studio, you can create data-cleansing rules for any type of data. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible

Modifiable

Rules can be modified to appropriately address the needs of the enterprise and can be implemented across Data Management Studio modules. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible

Right-click on the category data item from the Data pane, and select Colors.

SAS Visual Analytics Explorer assigns colors dynamically to category values for grouped visualizations. How would a content developer specify a specific color for a category value? Response: Change the grouping style on the properties tab in the right pane. Right-click on the category data item from the Data pane, and select Colors. Right-click on the category data item from the Data pane, and select New Custom Category. Define a color-mapped value display rule for the category data item.

The odds of the event are 1.142 greater for each one thousand dollar increase in salary.

Salary data are stored in 1000's of dollars. What is a correct interpretation of the estimate? A. The odds of the event are 1.142 greater for each one thousand dollar increase in salary. B. The probability of the event is 1.142 greater for each one thousand dollar increase in salary. C. The probability of the event is 1.142 greater for each one dollar increase in salary. D. The odds of the event are 1.142 greater for each one dollar increase in salary.

Three or more measures

Scatter Plot Matrix or Correlation Matrix A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies

Two measures

Scatter Plot or Heat Map A. One Measure B. One Category C. One datetime category and any number of other categories or measures. D. Two measures E. Three or more measures F. Multiple categories and measures G. Geography and zero to two measures H. One or more categories and any number of geographies

age, body temperature, gas mileage, income The continuous variables are age, body temperature, gas mileage, and income.

Select the choice that lists only continuous variables. a. body temperature, number of children, gender, beverage size b. age, body temperature, gas mileage, income c. number of children, gender, gas mileage, income d. gender, gas mileage, beverage size, income

KERNAL

Superimposes kernal density estimates on the histogram. A. NORMAL B. EXTENDED C. KERNAL

SAS QKB for Contact Information (CI)

Supports management of commonly used contact information for individuals and organizations, such as names, addresses, company names, and phone numbers. A. SAS QKB for Contact Information (CI) B. SAS QKB for Product Data (PD)

Multiple

The Allow generation of _____________ matchcodes per definition option requires the creation of a special match definition in the QKB. A. Single B. Multiple

Validation The Data Validation node is in the Utilities grouping of nodes.

The Data _________________node is used to filter or flag rows according to the specified condition(s). A. Import B. Validation C. Output

Field Match

The Field Match report displays a list of the fields in metadata that match a selected field's name. A. Field Name B. Field Relationship C. Field Match

NULL

The Generate null match codes for blank field values option generates a ____________match code if the field is blank. If this option is not selected, then a match code of all $ symbols is generated for the field. When you match records, a field with NULL does not equal another field with NULL, but a field with all $ symbols equals another field with all $ symbols. A. Preview B. Numeric C. NULL

the predicted value of the response when all predictors = 0.

The Intercept estimate is interpreted as: the predicted value of the response when all predictors are at their means. the predicted value of the response when all predictors are at their minimum values. the predicted value of the response when all the predictors are at their current values. the predicted value of the response when all predictors = 0.

Collection

The SAS Quality Knowledge Base (QKB) is a _______________ of files that store data and logic that define data management operations. A. Collection B. Repository

LASR

The SAS ______________ Analytic Server is an analytic platform that provides a secure, multiuser environment for concurrent access to data that is loaded into memory. The SAS _______________Analytic Server enables the following: persistence of data in memory for a distributed environment superfast analytic operations on data reduced start-up times for distributed computing multiple users to access the same in-memory data in a secure manner A. MBR B. LASR C. DDAD

False

The STEPWISE, BACKWARD, and FORWARD strategies result in the same final model if the same significance levels are used in all three. A. True B. False

Surviving Record Identification

The Surviving Record Identification (SRI) node examines clustered data and determines a surviving record for each cluster. A. Entity Resolution B. Surviving Record Identification

Match

The ____________ Report node produces a report listing the duplicate records identified by the match criteria. ______________ reports are displayed with a special report viewer. A. Match B. Clustering

Table

The ______________Match report displays a list of database tables that contain matching fields for a selected table or field. A. Field B. Identification C. Table

Correlation

The _______________ matrix visualization enables you to use a matrix of rectangular cells to view the degree of statistical correlation between multiple measures. A. Lateral B. Relationship C. Correlation

Bubble

The _______________ plot visualization enables you to explore the relationship between three measures. Two measures determine the ______________ placement and the third measure determines the _________ size. A. Scatter B. Bubble C. Segmentation

Clustering

The ________________ node enables the specification of an output ______________ ID field and specifications of _____________ conditions. A. Match B. Clustering

Heat Map

The ________________ visualization enables you to display the distribution of values for two data items using colored cells. A. Box B. Box and Whisker C. Heat Map

Quality Knowledge Base (QKB)

The _________________ is a collection of files and configuration settings that contain all the DataFlux Data Management algorithms. A. Collections Repository B. Quality Knowledge Base (QKB)

Entity Resolution

The __________________ File enables you to manually review the merged records and make adjustments as necessary. This can involve the following tasks: examining clusters reviewing the Cluster Analysis section reviewing related clusters processing cluster records editing fields for surviving records A. Entity Resolution B. Surviving Record Identification

Master Data Foundation

The __________________ feature in Data Management Studio uses master data projects and entity definitions to develop the best possible record for a specific resource, such as a customer or a product, from all of the source systems that might contain a reference to that resource. A. Collection B. Data Connection C. Master Data Foundation

Cluster Diff

The __________________ node is used to compare two sets of clustered records by reading in data from a left and a right table. From each table, the ______________________ node takes two inputs: a numeric record ID field and a cluster number field. A. Cluster Group B. Cluster Diff

Data Source

The ____________________ Details window displays information about the number of rows and columns in the data source and the number used for the exploration. A. Data Source B. Explorer

Identification

The _____________________Analysis report displays a list of fields in metadata that match categories in the identification analysis definitions specified for field name and sample data analysis. A. Field B. Identification C. Table

Field Relationship

The ______________________ map provides a visual presentation of the field relationships between all of the databases, tables, and fields that are included in the data exploration. A. Field Name B. Field Relationship C. Field Match

Execute Business Rule The Execute Business Rule Properties window allows for the specification of a Return status field, which flags records as either passing (True) or failing (False) the business rule. Not selecting the Return status field will pass only records that pass the business rule to the next node.

The _________________________ node applies an existing, row-based business rule to the rows of data as they flow through a data job. Records either pass or fail the selected rule. A. Execute Business Rule B. Business Rules

Histogram Chart

The __________________visualization enables you to view the distribution of values for a single measure. A. Bar Chart B. Line Chart C. Histogram Chart

Line

The _______________chart visualization enables you to view data trends over time. A. Bar B. Line C. Histogram

Sankey

The _______________diagram visualization enables you to perform path analytics to display flows of data from one event (value) to another as a series of paths. A. Linked B. Network C. Sankey

Network

The ______________diagram visualization enables you to view the relationships between category values as a series of linked nodes. A. Linked B. Neural C. Network

Clustering

The ______________node provides the ability to match records based on multiple conditions. Create conditions that support your business needs. A. Match B. Clustering

Ranks

The ______________tab enables you to view, create, and edit ranks to subset the data in the visualization. A _____________selects either the top (greatest) or the bottom (least) aggregated value for a category. A. Ranks B. Explorer

Outliers

The ______________tab lists the X minimum and maximum value outliers. The number of listed minimum and maximum values is specified when the data profiling metrics are set. A. Frequency Distribution B. Frequency Pattern C. Outliers

Bar

The _____________chart visualization enables you to compare data that is aggregated by the distinct values of a category. A. Bar B. Line C. Histogram

Box

The _____________plot visualization enables you to view information about the variability of data and the extreme data values. The size and location of the _______________ indicate the range of values that are between the 25th and 75th percentile. A. Box B. Segmentation C. Sankey

Roles

The ____________tab enables you to view the roles and data item assignments for the selected visualization. A. Data Source B. Explanatory C. Roles

Cardinality

The actual chart depends on the ____________ of the data. A. Type B. Source C. Cardinality

Element, Phrase

The analysis of an individual field can be counted as a whole (phrase) or based on each one of the field's elements. For example, the field value DataFlux Corporation is treated as two permutations if the analysis is set as Element, but is treated only as one permutation if the analysis is set as Phrase. A. Element, Phrase B. Phrase, Element

Constant variance, because the interquartile ranges are different in different ad campaigns.

The box plot was used to analyze daily sales data following three different ad campaigns. The business analyst concludes that one of the assumptions of ANOVA was violated. Which assumption has been violated and why? A. Constant variance, because Prob > F < .0001. B. Normality, because Prob > F < .0001. C. Constant variance, because the interquartile ranges are different in different ad campaigns. D. Normality, because the interquartile ranges are different in different ad campaigns.

Linear

The defining feature of _____________ models is the __________ function of the explanatory variables. A. Linear B. Logistic

Text Analytics

The definition below describes which type of data analytics? Analyzes each value in a document collection as a text document that can contain multiple words. Words that often appear together in the document collection are identified as topics. A. Correlation B. Fit Line C. Forecasting D. Text Analytics

Correlation The strength of a correlation is described as a number between 1 and 1.

The definition below describes which type of data analytics? Identifies the degree of statistical relationship between measures. A. Correlation B. Fit Line C. Forecasting D. Text Analytics

Fit Line A fit line plots a model of the relationship between measures. You can add a fit line to a scatter plot or heat map by using the pop-up menu or the Fit Line option on the Properties tab in the Right pane.

The definition below describes which type of data analytics? Plots a model of the relationship between measures. A. Correlation B. Fit Line C. Forecasting D. Text Analytics

Forecasting

The definition below describes which type of data analytics? Predicts future values based on the statistical trends in your data. A. Correlation B. Fit Line C. Forecasting D. Text Analytics

Measure

The definition describes which type of data classification? Numeric items whose values are used in computations. Measures can be calculated or aggregated. A. Category B. Geography C. Measure D. Hierarchy

Geography

The definition describes which type of data classification? Special role to identify types of geographical information for mapping. A. Category B. Geography C. Measure D. Hierarchy

Category

The definition describes which type of data classification? Used to group and aggregate measures. Categories contain alphanumeric or datetime values. New category data items can be calculated. A. Category B. Geography C. Measure D. Hierarchy

Hierarchy

The definition describes which type of data classification? Used to navigate through the data. Hierarchies are based on category or geography values. A. Category B. Geography C. Measure D. Hierarchy

-j

The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Executes the job in the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>

-o

The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Overrides settings in configuration files. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>

-c

The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Reads the configuration from the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>

-i

The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Specifies job input variables. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>

-b

The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Specifies job options for the job being run. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>

-l

The dmpexec command can be used to execute profiles and data jobs from the command line. Which of the commands, Writes the log to the specified file. A. -j <file> B. -l <file> C. -c <file> D. -i <file> E. -b <file> F. -o <file>

Monitor: Control

The final stage in a data management project involves examining any trends to validate the extended use and retention of the data. Data that is no longer useful is retired. The project's success can then be shared throughout the organization. The next steps are communicated to the data management team to lay the groundwork for future data management efforts. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control

7

The following SAS Code is submitted: proc reg data=sashelp.fish; model weight=length1 height width / selection=adjrsq; run; How many possible subset models will be assessed by SAS? A. 6 B. 8 C. 5 D. 7

Administration

The locations of the Quality Knowledge Base files are registered on the _________________ riser bar in DataFlux Data Management Studio. There can only be one active QKB at a time. A. Collections B. Folders C. Administration

Data Job

The main way to process data in DataFlux Data Management Studio. Each ____________ specifies a set of data-processing operations that flow from source to target. A. Command B. Routine C. Data Job

60

The maximum number of measures that can be displayed in a correlation matrix is ____? A. 12 B. 50 C. 60

DFFITS and CooksD only The variable Summary_i compresses the indicator variables RStud_i, DFits_i, and CookD_i into a single variable, with values in the order shown in the assignment statement that defines Summary_i. Therefore, the Summary_i value 011 means that the RStudent value did not exceed the cutoff, but the values for DFFITS and CooksD did. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

The observation below is from the data set InfluentialBF. Obs Summary_i Case PredictedValue RStudent DFFITS CutDFFits CooksD CutCooksD 1 011 39 44.8580 -2.6312 -1.5941 0.80322 0.496 0.12903 Assume that these assignment statements were used in creating the data set: CutDFFits=2*(sqrt(&numparms/&numobs)); CutCooksD=4/&numobs; RStud_i=(abs(RStudent)>3); DFits_i=(abs(DFFits)>CutDFFits); CookD_i=(CooksD>CutCooksD); Summary_i=compress(RStud_i||DFits_i||CookD_i); For which statistics did this observation exceed the cutoff criteria? a. RStudent, DFFITS, and CooksD b. RStudent and DFFITS only c. RStudent and CooksD only d. DFFITS and CooksD only

Physical The command line to execute the data job could be similar to the following: call dmpexec -j "D:\Workshop\dqdmp1\Demos\files\batch_jobs\Ch4D2_Products_Misc.ddf" -l "C:\Temp\log1.txt"

The physical path and filename of data jobs must be specified with the -j switch. A. Logical B. Physical

Plan: Define

The planning stage of any data management project starts with this essential first step. This is where the people, processes, technologies, and data sources are defined. Roadmaps that include articulating the acceptable outcomes are built. Finally, the cross-functional teams across business units and between business and IT communities are created to define the data management business rules. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control

Mean

The predicted value in ANOVA is the group _________. A. Mean B. Median

Mean

The predicted value in ANOVA is the group _________. A. Mean B. Median C. Mode

P Value

The probability calculated from the data is called the ____________. A. P Value B. Expected Confidence

Predicted

The regression coefficients are just numbers and they are multiplied by the explanatory variable values. These products are then summed to get the individual's ______________ value. A. Expected B. Predicted

Monitor: Evaluate

This step of the methodology enables users to define and enforce business rules to measure the consistency, accuracy, and reliability of new data as it enters the enterprise. Reports and dashboards on critical data metrics are created for business and IT staff members. The information that is gained from data monitoring reports is used to refine and adjust the business rules. A. Plan: Define B. Plan: Discover C. Act: Design D. Act: Execute E. Monitor: Evaluate F. Monitor: Control

Case

Transforms a text string by changing the case of its characters to uppercase, lowercase, or proper case. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

Pattern Analysis

Transforms a text string into a particular pattern. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

Standardization

Transforms a text string into a standard format. A. Case B. Extraction C. Gender Analysis D. Identification Analysis E. Language Guess F. Locale Guess G. Match H. Parse I. Pattern Analysis J. Standardization

False Although every visualization has a common property, Name, most visualizations have additional unique properties.

True or False: All visualizations have exactly the same properties.

True

True or False: If a single value in a group of items needs to be changed, then select Edit Modify Standards Manually Single Instance. A single value can then be modified manually. To toggle back to the ability to change all instances in a group, select Edit Modify Standards Manually All Instances.

True

True or False: The following types of visualizations are not available to be included in your report: decision trees network diagrams Sankey diagrams treemaps that display additional levels word clouds visualizations that do not contain data geo maps that use a custom geographic data item

True

True or False: There are several ways that SAS Visual Analytics can be deployed, including the following: non-distributed deployment (single server) distributed deployment using a co-located data provider

True

True or False: When using values with high cardinality... Each visualization has a visualization data threshold that controls the amount of high-cardinality data that can be used. Filtering and grouping can be used to limit high-cardinality data. An error message might be displayed and the visualization not produced when the visualization data threshold is exceeded. Visualization data thresholds can be specified in the Preferences window and by an administrator.

True

True or False? Data standardization does not perform a validation of the data (for example, Address Verification). Address verification is a separate component of the DataFlux Data Management Studio application and is discussed in another section.

True

True or False? If you standardize a data value using both a definition and a scheme, the definition is applied first and then the scheme is applied.

True

True or False? Monitoring tasks are created by pairing a defined business rule with one or more events. Some available events include the following: call a realtime service execute a program launch a data flow job on a Management server log error to repository log error to text file raise an event on the process job (if hosted) run a local job run a local profile send email message set a data flow key or value write a row to a table

False A user can have only *ONE* instance of a QKB open at a time. Only one user can have an instance of a QKB open for editing. If another user tries to open the same instance, the user receives a message that he or she can open a Read-only copy. When Data Management Studio is closed, the QKB is also closed.

True or False? QKB Editing Rules: A user can have only two instance of a QKB open at a time. Only one user can have an instance of a QKB open for editing. If another user tries to open the same instance, the user receives a message that he or she can open a Read-only copy. When Data Management Studio is closed, the QKB is also closed.

True

True or False? Record-level rules select which record from a cluster should survive. If there is ambiguity about which record is the survivor, the first remaining record in the cluster is selected.

True Jobs and profiles developed with Data Management Studio can be uploaded to the Data Management Server. Jobs and profiles can be executed on this server, which is intended to be a more powerful processing system. Data Management Server needs access to a copy of the QKB and data packs that are used in the data jobs and profiles.

True or False? The DataFlux Data Management Server is an application server that supports web service requests through a service-oriented architecture (SOA) executes profiles, data jobs, process jobs, and services on Windows, UNIX, or LINUX servers.

True

True or False? The match code generation process consists of the following steps: 1. Data is parsed into tokens (for example, Given Name and Family Name). 2. Ambiguities and noise words are removed (for example, the). 3. Transformations are made (for example, Jonathon > Jon). 4. Phonetics are applied (for example, PH > F). 5. Based on the sensitivity selection, the following occurs: Relevant components are determined. A certain number of characters of the transformed, relevant components are used.

True

True or False? Tukey's HSD Test HSD=Honest Significant Difference This method is appropriate when you consider pairwisecomparisons. The experimentwise error rate is equal to alpha when all pairwise comparisons are considered less than alpha when fewer than all pairwise comparisons are considered. Also known as the Tukey-Kramer Test

True

True or False? The following data items can be created in the Data pane: custom categories calculated items (unaggregated) aggregated measures derived items duplicate items geography data items document collection (text analytics) unique row identifier (text analytics)

Heat Map

Using SAS Visual Analytics Explorer, a content developer would like to examine the relationship between two measures with high cardinality. Which visualization should the developer use? A. Scatter Plot B. Heat Map C. Scatter Plot Matrix D. Treemap

within-group sample means

What are the "predicted values" that result from fitting a one-way analysis of variance (ANOVA) model? within-group sample variances between-group sample variances within-group sample means between-group mean differences

No lower bound, No upper bound

What are the upper and lower bounds for a logit? a. Lower=0, Upper=1 b. Lower=0, No upper bound c. No lower bound, No upper bound d. No lower bound, Upper=1

Automobile was removed from the CARS metadata.

What causes this window to display? Response: Report object Bar Chart 1 was built with multiple data sources. Automobile was removed from the CARS metadata. The List Table 1 report object was edited, but not saved. The CARS data source no longer exists.

The data job does NOT create an interactive report.

What is the biggest difference between creating a data profile versus creating a data job that incorporates data profile nodes? Response: The data profile does NOT allow you to turn metrics on or off. The data job does NOT allow you to turn metrics on or off. The data job does NOT create an interactive report. The data profile does NOT allow you to apply custom metrics.

Right Fielding node

What type of node would you add in a data job to achieve the results shown in the exhibit? Identification Analysis node Data Validation node Right Fielding node Field Layout node

Data Brushing

When a bar is selected in the bar chart, the markers in the scatter plot that correspond to the selected value in the bar area are highlighted. This feature used in SAS Visual Analytics Explorer called: Response: Data Brushing File interaction Report level display rules Conditional Highlighting

Target

When choosing Output Field settings, which of the options sends all fields available to target nodes to the target? A. Target B. Source and Target C. All

All

When choosing Output Field settings, which of the options specifies All available fields are passed through source nodes, target nodes, and all intermediate nodes. A. Target B. Source and Target C. All

Source and Target

When choosing Output Field settings, which of the options specifies All fields available to a source node are passed to the next node and all fields available to target nodes are passed to the target. A. Target B. Source and Target C. All

dmserver.cfg

When configuring options for the Data Management Server...which config file describes the settings below? DMSERVER/SOAP/LISTEN_PORT= PORT specifies the TCP port number where the server will listen for SOAP connections. DMSERVER/LOGCONFIG_PATH= PATH specifies the path to the logging configuration file. A. app.cfg B. dmserver.cfg

app.cfg

When configuring options for the Data Management Server...which config file describes the settings below? QKB/PATH = PATH specifies the location of the active Quality Knowledge Base. VERIFY/USPS = PATH specifies the location of USPS reference source. VERIFY/GEO = PATH specifies the location of Geo/Phone reference source. A. app.cfg B. dmserver.cfg

Lower

When creating folders, it is best practice to set folder names in _____________ with no spaces. A. Lower B. Upper

Correlation Matrix

When creating scatter plots...Three or more measures with high cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix

Scatter Plot Matrix

When creating scatter plots...Three or more measures with low cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix

Heat Map

When creating scatter plots...Two measures with high cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix

Scatter Plot

When creating scatter plots...two measures with low cardinality generate which type of graph? A. Scatter Plot B. Scatter Plot Matrix C. Heat Map D. Correlation Matrix

Root Mean Square Error

When forecasting...which model is used to determine the best method? A. Default B. Average Square Error C. Root Mean Square Error

4 GB

When importing files from your local machine, you are limited to a file size of __ GB or less. This limitation is introduced by web browsers. A. 2 GB B. 4 GB C. 8 GB

Batch Jobs

When importing to a Data Management Server, Each defined Data Management Server has a series of predefined folders. Selecting ____________ (for example) enables the Import tool in the navigation area, as well as in the main information area. A. Data Jobs B. Batch Jobs

Link

When modeling a categorical variable, which function is used? A. Link B. Logit

Link

When modeling an interval variable, which function is used? A. Link B. Logit

ABANDONED

When parsing, the which term best describes the description below: A resource limit was reached. Increase your resource limit and try again. A. OK B. NO SOLUTION C. NULL D. ABANDONED

NULL

When parsing, the which term best describes the description below: The parse operation was not attempted. This result occurs only when a null value was in the field to be parsed and the Preserved null values option was enabled. A. OK B. NO SOLUTION C. NULL D. ABANDONED

OK

When parsing, the which term best describes the description below: The parse operation was successful. A. OK B. NO SOLUTION C. NULL D. ABANDONED

NO SOLUTION

When parsing, the which term best describes the description below: The parse operation was unsuccessful; no solution was found. A. OK B. NO SOLUTION C. NULL D. ABANDONED

Linear

When reviewing the Fit LIne results...which of the terms describes the definition below? Creates a linear fit line from a linear regression algorithm. A linear fit line produces the straight line that best represents the relationship between two measures. For a linear fit, correlation information is automatically added to the visualization. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline

PSpline

When reviewing the Fit LIne results...which of the terms describes the definition below? Creates a penalized B-spline fit. A penalized B-spline is a smoothing spline that fits the data closely. A penalized B-spline can display a complex line with many changes in its curvature. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline

Quadratic

When reviewing the Fit LIne results...which of the terms describes the definition below? Produces a line with a single curve. A quadratic fit line produces a line with the shape of a parabola. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline

Cubit

When reviewing the Fit LIne results...which of the terms describes the definition below? Produces a line with two curves. A cubic fit line often produces a line with an "S" shape. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline

Best Fit

When reviewing the Fit LIne results...which of the terms describes the definition below? Tests the cubic, quadratic, and linear fit methods against your data and selects the fit method that produces the best result. The Best Fit method uses backward selection to select the highest-order model that is significant. To see which fit method was used, select the information icon from the visualization legend. A. Best Fit B. Linear C. Quadratic D. Cubit E. PSpline

- 2 Log L increased.

When selecting variables or effects using SELECTION=BACKWARD in the LOGISTIC procedure, the business analyst's model selection terminated at Step 3. What happened between Step 1 and Step 2? A. DF increased. B. AIC increased. C. Pr > Chisq increased. D. - 2 Log L increased.

- 2 Log L increased.

When selecting variables or effects using SELECTION=BACKWARD in the LOGISTIC procedure, the business analyst's model selection terminated at Step 3. What happened between Step 1 and Step 2? A. Pr > Chisq increased. B. - 2 Log L increased. C. AIC increased. D. DF increased.

Preserve

When standardizing, selecting _________________ null values ensures that if a field is null when it enters the node, then the field is null after being output from the node. It is recommended that this option be selected if the output is written to a database table. A. Import B. Preserve C. Archive

Low

When the p-value is ____________, it provides doubt about the truth of the null hypothesis. A. High B. Low

The simplest model with the best performance on the validation data

When using honest assessment, which of the following would be considered the best model? a. The simplest model with the best performance on the training data b. The simplest model with the best performance on the validation data c. The most complex model with the best performance on the training data d. The most complex model with the best performance on the validation data

Metadata

When you select Create a New Collection, you need to specify the _______________ location where the collection should be stored. A. Report B. Metadata

Parse Definition

Which Quality Knowledge Base (QKB) definition type is used by almost every other definition type? Response: Case Definition Parse Definition Standardization Definition Match Definition

Explorer

Which SAS Visual Analytics feature provides the below: an enhanced decision tree visualization, which includes interactive training and model assessment information a linear regression visualization, which creates predictive models for measure variables a logistic regression visualization, which creates predictive models for category variables a generalized linear model visualization, which creates predictive models for measure variables a cluster visualization, which segments the input data into clusters model comparison, which compares two or more predictive models A. Report Viewer B. Explorer C. Summary

C. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward select=bic; run;

Which SAS program will correctly use backward elimination with BIC selection criterion within the GLMSELECT procedure? A. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /select=backward choose=bic; run; B. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /select=backward selection=bic; run; C. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward select=bic; run; D. proc GLMSELECT data=SASUSER.MLR; model y = x1-x10 /selection=backward choose=bic; run;

Inner join and Address=Address

Which join type and expression would you use to join records from two different sources to find only matching records of the same address? Response: Left join and Address=Address Full join and ID=ID Right join and Address=Address Inner join and Address=Address

Expression Engine Language

Which language do you use to create a business rule using the Expression tab? Expression Builder Language Expression Engine Language Expression Monitor Language Expression Rule Language

Export the exploration as an image.

Which method is NOT used to share Explorations from SAS Visual Analytics Explorer? Response: Export the exploration as a report. Export the exploration as a PDF. Email a link to the exploration. Export the exploration as an image.

None of the above

Which of the following assumptions does collinearity violate? a. Independent errors b. Constant variance c. Normally distributed errors d. None of the above

SQL

Which of the statements below describes this querying method? The data generated for both the __________query and the filter have the same results. The filter pulled all records. The filter was processed on the machine where the profile was run. The database does the filtering for the ________ query. A. Filtering B. SQL

Condition matched field prefix

Which option in the properties of a Clustering node allows you to identify which clustering condition was satisfied? A. Condition matched field prefix B. Cluster condition field matched C. Cluster condition field count D. Cluster condition met field

Condition matched field prefix

Which option in the properties of a Clustering node allows you to identify which clustering condition was satisfied? Response: Cluster condition field matched Condition matched field prefix Cluster condition field count Cluster condition met field

records that have null or missing company fields

Which records will pass to the next node in the data job flow? records that have null or missing company fields no records records that do not have null or missing company fields all records

Data types are comprised of one or more tokens.

Which statement describes the relationship between data types and tokens? Data types are comprised of one or more tokens. Data types and tokens are interchangeable. Tokens are comprised of one or more data types. There is no relationship between these two items.

Gender should not be removed due to its involvement in the significant interaction.

Which statement is correct at an alpha level of 0.05? School should be removed because it is significant. Gender should not be removed due to its involvement in the significant interaction. School*Gender should be removed because it is non-significant. Gender should be removed because it is non-significant.

Facility Opening Date (Day) can be used outside of the hierarchy after selecting it in the

Which statement is true about Facility Opening Date (Day)? Response: Facility Opening Date (Day) values can only be used as the lowest level of the hierarchy. Facility Opening Date (Day) can be used outside of the hierarchy after selecting it in the Because it is part of a hierarchy, the format for Facility Opening Date (Day) cannot be changed. As a member of a hierarchy, Facility Opening Date (Day) is a virtual data item without properties.

Files and definitions

Which two types of items comprise the Quality Knowledge Base (QKB)? Files and repository Definitions and reference data sources Files and reference data sources Files and definitions

Group

Which type of business rule do you create to check for Countries that produce less than three items? Group Column Row Set

R Square

Which value tends to increase (can never decrease) as you add predictor variables to your regression model? A. R square B. Adjusted R square C. Mallows' Cp D. Both a and b E. F statistic F. All of the above

Crosstab and Table

Which visualization enables you to sort the data in columns? A. Crosstab B. Table C. None

File

Within the Data Mgmt Studio Repository, the ___________storage of a repository can contain the following: data jobs process jobs match reports entity resolution files queries entity definitions other files A. Data B. File

Data

Within the Data Mgmt Studio Repository, the ___________storage of a repository can contain the following: explorations and reports profiles and reports business rules monitoring results custom metrics business data information master data information A. Data B. File

%

Within the Standardization Scheme, which of these commands provides an indicator specifying that the matched word or phrase is not updated? A. //Remove B. %

//Remove

Within the Standardization Scheme, which of these commands removes the matched word or phrase from the input string? A. //Remove B. %

Define

Within the _______________ methodology, there are four main functions which can be used: Connect to Data Explore Data Define Business Rules Build Schemes A. Define B. Discover

The errors are independent, normally distributed with zero mean and constant variance.

Y = B0 + B1X + E Which statement best summarizes the assumptions placed on the errors? A. The errors are correlated, normally distributed with constant mean and zero variance. B. The errors are correlated, normally distributed with zero mean and constant variance. C. The errors are independent, normally distributed with constant mean and zero variance. D. The errors are independent, normally distributed with zero mean and constant variance.

Parsing node

You are creating a data job to apply a data cleansing process to an input data field containing city, state and postal code data. You would like to create individual fields from the components of the data values, with the resulting data being written into individual fields for City, State/Province and Postal Code. Which node would you use to accomplish this result? Right Fielding node Parsing node Identification Analysis node Standardization node

Standardization Scheme Regular Expression library

You are working with the data in the exhibit below that represents peoples' last name (or surname). You would like to ensure the proper casing is applied to the data using a Case definition. MacAlister MacDonald McCarthy McDonald McNeill Which two Quality Knowledge Base (QKB) file components can you use within the Case definition to accomplish this task? (Choose two.) Vocabulary Standardization Scheme Regular Expression library Phonetics library

Flexible

You can customize rules to conform to the ever-changing business environment regardless of your data needs. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible

Efficient

You can dramatically reduce manual data manipulation time by simply updating cleansing rules. It is much easier to manipulate reusable data-cleansing rules than to manually manipulate the data itself. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible

Data Collection

You can use _____________ to group data fields from different tables, database connections, or both. These collections can be used as input data sources for profiles. A. Repository B. Data Collection C. Data Exploration

Select the Collections riser on the Report tab, then select the collection name, right-click and select Profile Field.

You create an exploration that results in a collection of five similar fields across five disparate tables. Afterwards, what do you do to check the collection for null values and frequency distributions from the exploration? Response: Select the Report tab, then select Actions on the menu bar, and select Profile Collection Fields. Select the Collections riser on the Report tab, then select the collection name, right-click and select Profile Field. Select the Properties tab and check the Profile Collection Fields check box. Select the Report tab, then select Tools on the menu bar, and select the Profile Collection Fields.

Fully Customizable

You have full control of data-cleansing rules across the enterprise and through time. A. Fully Customizable B. Extensible C. Modifiable D. Efficient E. Flexible

Create a Data Source Name (DSN) in the operating system. Create a data connection in DataFlux Data Management Studio.

You need to run a profile analysis against a data table. Which two methods can you use to access the data table in DataFlux Data Management Studio? (Choose two.) Select Open Table from the File menu in DataFlux Data Management Studio. Create a Data Source Name (DSN) in the operating system. Create a data connection in DataFlux Data Management Studio. Create a library definition in DataFlux Data Management Studio.

From the Tools menu, select Other QKB Editors and select the appropriate editor.

You need to update a standardization scheme. Which two ways can you access the appropriate editor for the standardization scheme in the Quality Knowledge Base (QKB)? (Choose two.) From the Tools menu, select Other QKB Editors and select the appropriate editor. From the Folders riser bar, access the repository that contains the standardization scheme. Should have chosen From the Administration riser bar, open the QKB that contains the standardization scheme. From the File menu, select Edit and select the appropriate editor. From the Data riser bar, select the Data Connection that contains the standardization scheme.

Parse

_____________ definitions define rules to place the words from a text string into the appropriate tokens. A. Parse B. Text C. Case

Field Name

______________ analysis analyzes the names of each field from the selected data sources to determine which identity to assign to the field. A. Identification B. Field Name C. Sample Data

Case

______________ definitions are algorithms that can be used to convert a text string to uppercase, lowercase, or proper case. A. Parse B. Text C. Case

Roles SAS Visual Analytics is shipped with five predefined roles. Visual Analytics: Administration Visual Analytics: Analysis Visual Analytics: Basic Visual Analytics: Data Building Visual Analytics: Report Viewing

________________ are mapped to capabilities. A capability, also known as an application action, defines the operations that a user can perform. A. Roles B. Levels C. Measurements

Address Verification

________________ identifies, corrects, and enhances address information. A. Address Validation B. Address Verification

Dunnett

________________ method is recommended when there is a true control group. When appropriate (when a natural control category exists, against which all other categories are compared) it is more powerful than methods that control for all possible comparisons. A. Levene B. Tukey C. Dunnett

Forecasting A forecast adds a line with a predicted value and a colored band that represents the confidence interval. Forecasting is available only for line charts that include a datetime data item. No forecasting is available if data items are assigned to the Group, Lattice columns, or Lattice rows roles. The forecasting duration (in intervals) can be selected on the Properties tab in the Right pane. The default duration is six intervals.

________________ predicts future values based on the statistical trends in your data. A. Forecasting B. Predictive Modeling

Sample Data

_________________ analysis analyzes a sample of data in each field to determine which identity to assign to the field. A. Identification B. Field Name C. Sample Data

Data Connection

__________________ are used to access data in jobs, profiles, data explorations and data collections. A. Collection B. Data Connection C. Master Data Foundation

Goal Seeking

__________________ enables you to specify a target value for your forecast measure to determine the values of the underlying factors that are required to achieve that value. A. Scenario B. Goal Seeking

Entity Resolution

__________________ is the process of merging duplicate records in a single file or multiple files so that records referring to the same physical object are treated as a single record. Records are matched based on the information that they have in common. The records that are merged might appear to be different, but can actually refer to the same person or item. A. Entity Match B. Entity Resolution C. Match Entity

Automatic

____________________ chart is the default visualization type? A. Manual B. Default C. Automatic

Data Exploration

______________________ have the following types of analysis methods: field name matching field name analysis sample data analysis A. Repository B. Data Collection C. Data Exploration

Document Collection When using word clouds with text analytics, you can choose to analyze the document sentiment.

____________________________ is a category data item that contains the words that you want to analyze. A. Document Collection B. Metadata

Visualizations

__________________________that have no data that are items assigned to required roles are not available to include in your PDF output. A. Images B. Text C. Visualizations

Scenario

_______________analysis enables you to forecast hypothetical scenarios by specifying the future values for one or more underlying factors that contribute to the forecast. A. Scenario B. Goal Seeking

Geocoding Geocoding latitude and longitude information can be used to map locations and plan efficient delivery routes. Geocoding can be licensed to return this information for the centroid of the postal code or at the roof-top level. Currently, there are only geocoding data files for the United States and Canada. Also, roof-top level geocoding is currently available only for the United States.

_______________enhances address information with latitude and longitude values. A. Geo Validation B. Geocoding

Identification, Right

______________analysis and ___________ fielding use the same definitions from the QKB, but in different ways. ______________ analysis identifies the type of data in a field, and __________ fielding moves the data into separate fields based on its identification. Both the ___________ analysis and _________ fielding examples above use the Contact Info identification analysis definition. a. Identification, Right B. Right, Identification

GLM

ods graphics; proc _________ data=STAT1.ameshousing3 plots=diagnostics; class Heating_QC; model SalePrice=Heating_QC; means Heating_QC / hovtest=levene; format Heating_QC $Heating_QC.; title "One-Way ANOVA with Heating Quality as Predictor"; run; quit; A. SGPLOT B. SGSCATTER C. GLM

PLOTS= FREQPLOT

requests a frequency plot. Frequency plots are available for frequency and crosstabulation tables. For multiway crosstabulation tables, PROC FREQ provides a two-way frequency plot for each stratum (two-way table). A. PLOTS= FREQPLOT B. PLOTS= FREQUENCY

95% You want to be as confident as possible but increasing the conf. level too much, you risk negative and positive infinity confidence bounds.

A 95% confidence interval represents a range of values within which you are _______certain that the true population mean exists. A. 5% B. 95%

GLM

PROC __________DATA=SAS-data-set PLOTS=options; CLASS variables; MODEL dependents=independents </ options>; MEANS effects </ options>; LSMEANS effects </ options>; OUTPUT OUT=SAS-data-set keyword=variable...; RUN; QUIT; A. SGPLOT B. SGSCATTER C. GLM

Gaussian

A Normal Distribution bell curve is also known as a ___________ distribution? A. Gaussian B. Expected

One-Sided

A _____________ t-test compares the mean calculated from a sample to a hypothesized mean. The null hypothesis of the test is generally that the difference between the two means is zero. A. One-Sided B. Two-Sided

a. a PROC GLMSELECT step that contains the SCORE statement b. a PROC PLM step that contains the SCORE statement and references an item store that was created in PROC GLMSELECT c. a PROC PLM step with the CODE statement that writes the score code based on item store created in PROC GLMSELECT, and a DATA step that scores the data d. any of the above Any of these approaches can be used to score data based on the model built by PROC GLMSELECT. Review: Methods of Scoring

A department store is deploying a chosen model to make predictions for an upcoming sales period. They have the necessary data and are ready to proceed. Which of the following methods can be used for scoring? a. a PROC GLMSELECT step that contains the SCORE statement b. a PROC PLM step that contains the SCORE statement and references an item store that was created in PROC GLMSELECT c. a PROC PLM step with the CODE statement that writes the score code based on item store created in PROC GLMSELECT, and a DATA step that scores the data d. any of the above

Straight

A linear association between two continuous variables can be inferred when the general shape of a scatter plot of the two variables is a __________ line. A. Straight B. Curved

Standard Error

A statistic that measures the variability of your estimate is the ___________ of the mean. A. Variability B. Standard Error

Model 1 Models 1 and 3 are better than Model 2 because they have lower values of AIC and SC. Model 1 also has the highest values of the c statistic so it is the best of the three models. Review: Comparing the Binary and Multiple Logistic Regression Models, Fitting a Binary Logistic Regression Model

According to the goodness-of-fit statistics shown below, which multiple logistic regression model would be the best to use? Statistic Model 1 Model 2 Model 3 AIC 501.5 520.4 501.5 SC 501.5 520.4 501.5 c 0.675 0.675 0.655 a. Model 1 b. Model 2 c. Model 3

yes, Hip and Abdomen Hip and Abdomen both have p-values lower than .05, so they are statistically significant in predicting or explaining the variability of the percentage of body fat. Review: Performing Simple Linear Regression, Analysis versus Prediction in Multiple Regression, Fitting a Multiple Linear Regression Model

According to these parameter estimates, are any of the variables in the model statistically significant in predicting or explaining the percentage of body fat? Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept 1 -20.98714 5.55433 -3.78 0.0002 Age 1 0.01226 0.02836 0.43 0.6658 Hip 1 -0.40163 0.09994 -4.02 <.0001 Abdomen 1 0.86123 0.06814 12.64 <.0001 a. no b. yes, Age c. yes, Hip and Abdomen d. yes, Age, Hip, and Abdomen

a fairly strong, negative linear relationship The correlation coefficient for the relationship between Performance and RunTime is -0.82049, which is negative. It is also close to 1, making it a relatively strong relationship. Review: Using Correlation to Measure Relationships between Continuous Variables

Based on this correlation matrix, what type of relationship do Performance and RunTime have? Pearson Correlation Coefficients, N = 31 Prob > |r| under H0: Rho=0 Performance RunTime Age Performance 1.00000 -0.82049 <.0001 -0.71257 <.0001 RunTime -0.82049 <.0001 1.00000 0.19523 0.2926 Age -0.71257 <.0001 0.19523 0.2926 1.00000 a. a fairly strong, positive linear relationship b. a fairly strong, negative linear relationship c. a fairly weak, positive linear relationship d. a fairly weak, negative linear relationship

NORMAL

Creates a normal probability plot. Options (MU= SIGMA=) determine the mean and std deviation of the normal distribution used to create reference lines(normal curve overlay in HISTOGRAM and diagonal reference line in PROBPLOT). A. NORMAL B. EXTENDED

POSITION=NE

Determines the position of the inset. The position is a compass point keyword, a margin keyword, or a pair of coordinates. You can specifiy coordinates in axis percent units or axis data units. The default value is NW. A. POSITIONPLOT B. POSITION=NE

median The median is not affected by outliers and is less affected by the skewness. The mean, on the other hand, averages in any outliers that might be in your data.

For an asymmetric (or skewed) distribution, which of the following statistics is a good measure for the middle of the data? a. mean b. median c. either mean or median

STEPWISE The summary table contains both Variable Entered and Variable Removed columns. Of the three types of stepwise selection (forward, backward, and stepwise), only stepwise selection can both enter and remove variables. Therefore, STEPWISE must have been specified in the PROC REG step. Review: The Stepwise Selection Approach to Model Building, The GLMSELECT Procedure, The GLMSELECT Procedure: Performing Stepwise Regression

Given the information in this summary of variable selection, which stepwise selection method was specified in the PROC REG step? Step Variable Entered Variable Removed Number Vars In Partial R-Square Model R-Square C(p) F Value Pr > F 1 RunTime 1 0.7434 0.7434 3.3432 84.00 <.0001 2 Age 2 0.0213 0.7647 2.8192 2.54 0.1222 a. FORWARD b. BACKWARD c. STEPWISE d. can't tell from the information given

Correct answer: no The p-value of 0.2942 is greater than 0.05, so you fail to reject the null hypothesis and conclude that the variances are equal. Review: The GLM Procedure

Given this SAS output, is there sufficient evidence to reject the assumption of equal variances? a. yes b. no

yes The p-value of <.001 is less than 0.05, so you would reject the null hypothesis and conclude that the means between the two groups are significantly different. Review: Examining the Equal Variance t-Test and p-Values

Given this SAS output, is there sufficient evidence to reject the hypothesis of equal means? a. yes b. no

36.1680 and 52.3021 The CLI option, which displays the 95% CL Predict column in the Output Statistics table, produces confidence limits for an individual predicted value. In this table, the third observation, for Kate, contains the value 55 for Performance. Therefore, the values in her 95% CL Predict column are the lower and upper confidence limits for a new individual value at the same value of Performance. In contrast, the CLM option displays the values in the 95% CL Mean column, which are the lower and upper confidence limits for a mean predicted value for each observation. Review: Specifying Confidence and Prediction Intervals in SAS, Viewing and Printing Confidence Intervals and Prediction Intervals, The REG Procedure: Producing Predicted Values

Here is a table of output statistics from PROC REG. If you sample a new value of the dependent variable when Performance equals 55, what are the lower and upper prediction limits for this newly sampled individual value? Output Statistics Obs Name Performance Dependent Variable Predicted Value Std Error Mean Predict 95% CL Mean 95% CL Predict Residual 1 Jack 48 40.8400 44.9026 1.0190 42.0732 47.7319 37.4190 52.3861 -4.0626 2 Annie 43 45.1200 45.3793 1.3081 41.7475 49.0112 37.5570 53.2016 -0.2593 3 Kate 55 44.7500 44.2351 1.4885 40.1023 48.3678 36.1680 52.3021 0.5149 4 Carl 40 46.0800 45.6654 1.6493 41.0862 50.2446 37.3608 53.9700 0.4146 5 Don 58 44.6100 43.9490 1.8646 38.7719 49.1261 35.3003 52.5977 0.6610 6 Effie 45 47.9200 45.1886 1.1361 42.0343 48.3429 37.5763 52.8009 2.7314 a. 44.7500 and 44.2351 b. 40.1023 and 48.3678 c. 36.1680 and 52.3021 d. can't tell from the information given

the most parsimonious model The most parsimonious model is selected. The most parsimonious model is the simplest, least complex of the candidate models. Review: Building a Predictive Model

Honest assessment might generate multiple candidate models that have the same (or nearly the same) validation assessment values. In this situation, which model is selected? a. the model that has the highest variance when it is applied to the population b. the model that has the most terms c. the most parsimonious model d. the most biased model

the measure of the ability of the statistical hypothesis test to reject the null hypothesis when it is actually false Power is the ability of the statistical test to detect a true difference, or the ability to successfully reject a false null hypothesis. The probability of committing a Type I error is α. The probability of failing to reject the null hypothesis when it is actually false is a Type II error.

How do you define the term power? a. the measure of the ability of the statistical hypothesis test to reject the null hypothesis when it is actually false b. the probability of committing a Type I error c. the probability of failing to reject the null hypothesis when it is actually false

CLASS Statement

How do you tell PROC TTEST that you want to do a two-sample t-test? a.SAMPLE=2 option b.CLASS statement c.GROUPS=2 option d.PAIRED statement

5 In Mallows' Cp criterion, p equals the number of variables in the model plus 1 for the intercept. Therefore, for these models, p equals 8, 9, or 10, depending on the number of terms in the model. All the C(p) values are less than their respective p values, so all five models meet Mallows' Cp criterion. Review: Evaluating Models Using Mallows' Cp Statistic, Viewing Mallows' Cp Statistic in PROC REG, The REG Procedure: Using the All-Possible Regressions Technique, The REG Procedure: Using Automatic Model Selection

How many of the following models meet Mallows' Cp criterion for model selection? Model Index Number in Model C(p) R-Square Variables in Model 1 7 5.8653 0.7445 Age Weight Neck Abdomen Thigh Forearm Wrist 2 8 5.8986 0.7466 Age Weight Neck Abdomen Hip Thigh Forearm Wrist 3 8 6.4929 0.7459 Age Weight Neck Abdomen Thigh Biceps Forearm Wrist 4 9 6.7834 0.7477 Age Weight Neck Abdomen Hip Thigh Biceps Forearm Wrist 5 7 6.9017 0.7434 Age Weight Neck Abdomen Biceps Forearm Wrist a. 0 b. 1 c. 3

T-Test

If you analyze the difference between two means using ANOVA, you reach the same conclusions as you reach using a pooled, two-group _________. A. T-Test B. Analysis

Report the F value and possibly remove the blocking factor from future studies. Your only choice is to report the F value, and if you plan future studies, do not include the blocking variable. The blocking factor must be included in all ANOVA models that you calculate with the sample that you've already collected. Review: Performing ANOVA with Blocking

If your blocking variable has a very small F value in the ANOVA report, what would be a valid next step? a. Remove it from the MODEL statement and re-run the analysis. b. Test an interaction term. c. Report the F value and possibly remove the blocking factor from future studies.

tables Country Size Country*Size; You use the TABLES statement in PROC FREQ to create frequency and crosstabulation tables. In the TABLES statement, you separate table requests with a space. In a table request for a crosstabulation table, you specify an asterisk between the variable names. Review: Crosstabulation Tables

In a PROC FREQ step, which statement or set of statements creates a frequency table for Country, a frequency table for Size, and a crosstabulation table for Country by Size? a. tables Country, Size, Country*Size; b. tables Country*Size; c. tables Country | Size; d. tables Country Size Country*Size;

Populations

In inferential statistics, the focus is on learning about ______________. Examples of ___________ are all people with a certain disease, all drivers with a certain level of insurance, or all customers, both current and potential, at a bank. A. Populations B. Volumes

no The most complex model is not always the best choice. An overly complex model might be too flexible, which can lead to overfitting. Review: Model Complexity

In predictive modeling, is the most complex model the best choice? a. yes b. no

that the errors are normally distributed The Residuals versus Quantile plot is a normal quantile plot of the residuals. Using this plot, you can verify that the errors are normally distributed, which is one of our assumptions. Here the residuals follow the normal reference line pretty closely, so we can conclude that the errors are normally distributed. Review: The REG Procedure: Producing Default Diagnostic Plots

In the diagnostic plots below, what does the Residual versus Quantile plot indicate about the model? a. that the errors are normally distributed b. that the data set contains many influential observations c. that the model is inadequate because the spread of the residuals is less than the spread of the centered fit d. that the model is inadequate because patterns occur in the spread around the reference line

the SCORE= option The SCORE= option specifies the data set that contains the parameter estimates. PROC SCORE reads the parameter estimates from this data set, scores the observations in the data set that the DATA= option specifies, and writes the scored observations to the data set that the OUT= option specifies. Review: The SCORE Procedure: Scoring Predicted Values Using Parameter Estimates

In this PROC SCORE step, which option specifies the data set containing the parameter estimates that are used to score observations? proc score data=dataset1 score=dataset2 out=dataset3 type=parms; var Performance; run; a. the DATA= option b. the SCORE= option c. the OUT= option

UNIVARIATE

PROC _____________ DATA=SAS-data-set <options>; VAR variables; HISTOGRAM variables </ options>; INSET keywords </ options>; RUN; A. FREQ B. UNIVARIATE

FREQ PROC FREQ can generate large volumes of output as the number of variables or the number of variable levels (or both) increases.

PROC _____________ DATA=SAS-data-set; TABLES table-requests </ options>; RUN; A. FREQ B. UNIVARIATE

both parametric and non-parametric models Predictive models can be based on both parametric and non-parametric models. Review: What Is Predictive Modeling?

Predictive models can be based on which of the following? a. parametric models only b. non-parametric models only c. both parametric and non-parametric models

proc univariate data=statdata.sleep mu0=8; var hours; run; You specify the MU0= option as part of the PROC UNIVARIATE statement to indicate the test value of the null hypothesis. The alternative hypothesis is that μ is not equal to 8 hours, but this does not need to be specified in the PROC UNIVARIATE code.

Psychologists at a college want to know if students are sleeping more or less than the recommended average of 8 hours a day. Which of the following code choices correctly tests the null hypothesis? a. proc univariate data=statdata.sleep mu0<>8; var hours; run; b. proc univariate data=statdata.sleep; var hours / mu0=8; run; c. proc univariate data=statdata.sleep; var hours / mu0<>8; run; d. proc univariate data=statdata.sleep mu0=8; var hours; run;

The probability is .95 that the true average weight is between 15.02 and 15.04 ounces. A 95% confidence interval means that you are 95% confident that the interval contains the true population mean. If you sample repeatedly and calculate a confidence interval for each sample mean, 95% of the time your confidence interval will contain the true population mean. A confidence interval is not a probability. When a confidence interval is calculated, the true mean is in the interval or it is not. There is no probability associated with it.

Select the statement below that incorrectly interprets a 95% confidence interval (15.02, 15.04) for the population mean, if the sample mean is 15.03 ounces of cereal. a. You are 95% confident that the true average weight for a box of cereal is between 15.02 and 15.04 ounces. b. The probability is .95 that the true average weight is between 15.02 and 15.04 ounces. c. In the long run, approximately 95% of the intervals calculated with this procedure will capture the true average weight.

A Cramer's V statistic that is close to 1 Cramer's V statistic is the only appropriate statistic to use in this example. When Cramer's V is close to 1, there is a relatively strong general association between two categorical variables. You cannot use an odds ratio because the predictor Type is not binary. You cannot use the Spearman correlation statistic because the predictor Type is not ordinal. Review: Cramer's V Statistic, Odds Ratios, The Spearman Correlation Statistic

Suppose you are analyzing the relationship between hot dog ingredients and taste. Which of the following statistics provides evidence of a relatively strong association between the variables Type (which has the values Beef, Meat, and Poultry) and Taste (which has the values Bad and Good)? a. A Cramer's V statistic that is close to 1 b. An odds ratio that is greater than 1 c. A Spearman correlation statistic that is close to 1

tables Rating*Grade / chisq measures; Both variables are ordinal and have logically-ordered values, so the Mantel-Haenszel test (for ordinal association) is a stronger test than the Pearson chi-square test (for general association) in this situation. The CHISQ option produces both the Pearson and Mantel-Haenszel statistics. The MEASURES option produces the Spearman correlation statistic, which measures the strength of an ordinal association. MHCHISQ is not a valid option, and the CLODDS= option is not a valid option in PROC FREQ. Review: The Mantel-Haenszel Chi-Square Test, The Spearman Correlation Statistic, Performing a Mantel-Haenszel Chi-Square Test of Ordinal Association

Suppose you are testing for an association between student ratings of teachers and student grades. The Rating variable has the values 1 (for poor), 2 (for fair), 3 (for good) and 4 (for excellent). The Grade variable has the values A, B, C, D, and F. Which of the following TABLES statements in PROC FREQ produces the appropriate chi-square statistics and measure of strength for these variables? a. tables Rating*Grade / chisq measures; b. tables Rating*Grade / chisq; c. tables Rating*Grade / mhchisq; d. tables Rating*Grade / mhchisq clodds=pl;

the equal variance assumption When a residuals plot displays a funnel shape, it indicates that the variance of the residuals is not constant. That is, the variance increases toward the wide end of the "funnel." This shows you that your model violates the equal variance assumption. Review: Verifying Assumptions Using Residual Plots

Suppose you have a residuals plot that shows a funnel shape for the residuals, such as in the plot below. Which assumption of linear regression is being violated? a. the linearity assumption b. the independence assumption c. both the linearity assumption and the independence assumption d. the equal variance assumption e. both the linearity assumption and the equal variance assumption

proc plm restore=homestore; score data=new out=new_out; run; In PROC PLM, the RESTORE= option specifies the name of the item store. In the SCORE statement, the DATA= option specifies New as the data set that contains the observations to be scored. The OUT= option specifies that the scored results are saved in a data set named New_Out. Review: Scoring Data

Suppose you ran a PROC GLMSELECT step that saved the context and results of the statistical analysis in an item store named Homestore. Which of the following programs scores new observations in a data set named New and saves the predictions in a data set named New_Out? a. proc plm restore=homestore; score data=new out=new_out; run; b. proc plm restore=new; score data=homestore out=new_out; run; c. proc plm data=homestore; score data=new out=new_out; run; d. proc plm restore=homestore; model data=new out=new_out; run;

oddsratio Amount (units=20); oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); You must specify the intervals of Amount in the UNITS statement, not in the ODDSRATIO statement. To calculate odds ratios for the two categorical variables as described, each of the two ODDSRATIO statements must set DIFF= to REF against all levels of the interacting variable. Review: The ODDSRATIO Statement, The UNITS Statement

Suppose you want to fit a multiple logistic regression model to determine how the method of administering a drug affects patients' response to the drug. The binary variable Response has the values 0 and 1. There are three predictors: Amount identifies the dosage amount in mg, Frequency has the values Daily and Weekly, and Meal has the values Yes and No. You want to calculate three odds ratios: an odds ratio for Amount at 20 mg intervals an odds ratio for Frequency against the reference level (Daily) as compared to all levels of Meal an odds ratio for Meal against the reference level (Yes) as compared to all levels of Frequency Which of the following blocks of code below correctly completes the following PROC LOGISTIC program? proc logistic data=newdrug; class Frequency (param=ref ref='Daily') Meal (param=ref ref='Yes'); model Response (event='1') = Frequency | Meal | Amount @2; _____________________________________________ run; a. oddsratio Amount (units=20); oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); b. units Amount=20; oddsratio Amount; oddsratio Frequency / diff=all at (Meal='Yes'); oddsratio Meal / diff=all at (Frequency='Daily'); c. units Amount=20; oddsratio Amount; oddsratio Frequency / diff=ref at (Meal=all); oddsratio Meal / diff=ref at (Frequency=all); d. oddsratio Amount (units=20); oddsratio Frequency / diff=all at (Meal='Yes'); oddsratio Meal / diff=all at (Frequency='Daily');

class Program(param=ref ref='2') Gender(param=ref ref='Male'); The CLASS statement lists all the categorical predictor variables. For each categorical predictor, you use the PARAM= option to specify reference cell coding (REF or REFERENCE) instead of the default parameterization method, effect coding. The default reference level is the level with the highest ranked value when the levels are sorted in ascending alphanumeric order. Review: Specifying a Parameterization Method in the CLASS Statement, Reference Cell Coding

Suppose you want to fit a multiple logistic regression model to determine which of two rehabilitation programs is more effective. The categorical response variable Relapsed (Yes or No) indicates whether study participants stayed clean after one year. The categorical predictor variables are Program (1 or 2) and Gender (Male or Female). Age is a continuous predictor variable. Assume that you want to use reference cell coding with the default reference levels. Which of the following CLASS statements correctly completes the PROC LOGISTIC step for this analysis? proc logistic data=programs.rehabilitation; _____________________________________ model Relapsed (event='Yes') = Program | Gender | Age @2; run; a. class Program(param=ref ref='2') Gender(param=ref ref='Male'); b. class Program(param=ref ref='2') Gender (param=ref ref='Male') Age (param=ref units=1); c. class Program(param=ref ref='1') Gender(param=ref ref='Female');

model Focus(event='Sports')=Gender; In the MODEL statement, the response variable name is followed by the EVENT= option in parentheses (which specifies the event category—the level of the response variable that you're interested in), an equal sign, and the predictor variable name. Review: The LOGISTIC Procedure

Suppose you want to investigate the relationship between the gender of elementary school students and their focus in school. The variable Gender indicates the gender of each student as Boy or Girl. The variable Focus identifies each student's main focus in school as Grades or Sports. Which of the following MODEL statements correctly completes this PROC LOGISTIC step for your analysis? proc logistic data=school.students; class Gender; _____________________________________ run; a. model Focus(event='Sports*Grades')=Gender; b. model Focus(event='Sports')=Gender; c. model Focus(ref='Sports')=Gender; d. model Focus*Gender(ref='Sports');

false The Tukey method and the pairwise t-tests are two methods you learned about that compare all possible pairs of means, so they can be used only when you make pairwise comparisons. The Dunnett method compares all categories to a control group. Review: Dunnett's Multiple Comparison Method, Tukey's Multiple Comparison Method

The Dunnett method compares all possible pairs of means, so it can be used only when you make pairwise comparisons. a. true b. false

Error

The ___________sum of squares, SSE, measures the random variability within groups; it is the sum of the squared deviations between observations in each group and that group's mean. This is often referred to as the unexplained variation or within-group variation. A. Total B. Error

Total

The _________sum of squares, SST, is a measure of the total variability in a response variable. It is calculated by summing the squared distances from each point to the overall mean. Because it is correcting for the mean, this sum is sometimes called the corrected total sum of squares. A. Total B. Error

means, normal, larger The central limit theorem states that the distribution of sample means is approximately normal, regardless of the distribution of the population data, and this approximation improves as the sample size gets larger.

The central limit theorem states that the distribution of sample __(1)__ is approximately __(2)__, regardless of the distribution of the population data, and this approximation improves as the sample size gets __(3)__. a. means, skewed, larger b. variance, equal, smaller c. means, normal, larger d. proportions, equal, smaller

the standard deviation (σ) and the variance (σ²) The location and spread of a normal distribution depend on the value of two parameters, the mean (µ) and the standard deviation (σ).

The location and spread of a normal distribution depend on the value of which two parameters? a. the mean (x̄) and the standard deviation (s) b. the standard deviation (σ) and the variance (σ²) c. the mean (µ) and the standard deviation (σ) d. none of the above

a two-sided t-test Because the cereal manufacturer is interested in determining whether the two processes produce a different mean cereal weight, he needs to perform a two-sided t-test. Review: Scenario: Comparing Group Means, Scenario: Testing for Differences on One Side

The manufacturer for a cereal company uses two different processes to package boxes of cereal. He wants to be sure the two processes are putting the same amount of cereal in each box. He plans to perform a two-sample t-test to determine whether the mean weight of cereal is significantly different between the two processes. What type of test should he run? a. an upper-tailed t-test b. a two-sided t-test c. a lower-tailed t-test

used to calculate confidence intervals of the mean. The standard error of the mean is part of the equation used to calculate a confidence interval of the mean. It is not normally distributed, and it is never less than 0.

The standard error of the mean is a. used to calculate confidence intervals of the mean. b. always normally distributed. c. sometimes less than 0. d. none of the above

The row percentages indicate that the distribution of size changes when the value of country changes. To see a possible association, you look at the row percentages. A higher percentage of American-made cars are large as opposed to small. The opposite is true for European cars and especially for Japanese cars. Review: Association between Categorical Variables, Crosstabulation Tables

This table shows frequency statistics for the variables country and size in a data set that contains data about people and the cars they drive. What evidence in the table indicates a possible association? Frequency Percent Row Pct Col Pct Table of country by size country(country) size(size) Large Medium Small Total American 36 11.88 31.30 85.71 53 17.49 46.09 42.74 26 8.58 22.61 18.98 115 37.95 European 4 1.32 10.00 9.52 17 5.61 42.50 13.71 19 6.27 47.50 13.87 40 13.20 Japanese 2 0.66 1.35 4.76 54 17.82 36.49 43.55 92 30.36 62.16 67.15 148 48.84 Total 42 13.86 124 40.92 137 45.21 303 100.00 a. The frequency statistics indicate that the values of each variable are equally distributed across levels. b. The row percentages indicate that the distribution of size changes when the value of country changes. c. The column percentages indicate that most of the cars of each size are manufactured in Japan.

The drug effect is not significant when used in patients with disease Z. The p-value for disease Z is 0.7815. Because this p-value is greater than your alpha of 0.05, you fail to reject the null hypothesis and conclude that there is no significant effect of Drug on Health for patients with disease Z. Review: Performing a Post Hoc Pairwise Comparison

This table shows output from a post hoc pairwise comparison in which you tested the significance of a drug on patients' health for three different diseases. What conclusion can you make based on this output? a. The drug effect is significant when used in patients with disease Z. b. The drug effect is significant when used in patients with diseases Y and Z. c. The drug effect is not significant when used in patients with disease Z.

True

True or False? Assessing ANOVA Assumptions In many cases, good data collection designs can help ensure the independence assumption. Diagnostic plots from PROC GLM can be used to verify the assumption that the error is approximately normally distributed. PROC GLM produces a test of equal variances with the HOVTEST option in the MEANS statement. H0 for this hypothesis test is that the variances are equal for all populations.

True The CLASS statement creates a set of "design variables" (sometimes referred to as "dummy variables") representing the information contained in any categorical variables. Linear regression is then performed on the design variables. ANOVA can be thought of as linear regression on dummy variables. It is only in the interpretation of the model that a distinction is made.

True or False? What Does a CLASS Statement Actually Do? The CLASS statement creates a set of "design variables" representing the information in the categorical variables. PROC GLM performs linear regression on the design variables, but reports the output in a manner interpretable as group mean differences. There is only one "parameterization" available in PROC GLM.

CONNECT= proc sgplot data=STAT1.ameshousing3; vbox SalePrice / category=Central_Air connect=mean; title "Sale Price Differences across Central Air"; run;

VBOX options which Specifies that a connect line joins a statistic from box to box. This option applies only when the CATEGORY option is used to generate multiple boxes. A. CATEGORY= B. CONNECT=

CATEGORY= proc sgplot data=STAT1.ameshousing3; vbox SalePrice / category=Central_Air connect=mean; title "Sale Price Differences across Central Air"; run;

VBOX options which Specifies the category variable for the plot. A box plot is created for each distinct value of the category variable. A. CATEGORY= B. CONNECT=

The variance inflation factors indicate that collinearity is present in the model. Several variance factors are above 10 (Abdomen, Weight, Height, Chest, Hip,Density, Adiposity, and FatFreeWt). This indicates that collinearity among the predictor variables is present in the model. Review: The REG Procedure: Detecting Collinearity

View this PROC REG output. What does the output indicate about the model? a. The p-value for the overall model is not significant. b. The model does not fit the data well. c. The p-values for the parameter estimates indicate that collinearity is present in the model. d. The variance inflation factors indicate that collinearity is present in the model. e. none of the above

Several observations exceed the cutoff values, so these observations might be influential. The gray horizontal lines mark the +2 and -2 cutoff values of the RSTUDENT residuals. Several observations fall outside these lines, so these observations might be influential. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

View this plot of RSTUDENT residuals versus predicted values of PctBodyFat2. What does it indicate? a. The model does not fit the data well. b. The residuals have a cyclical shape, so the independence assumption is being violated. c. Several observations exceed the cutoff values, so these observations might be influential. d. none of the above

both of the above An influential observation is an observation that strongly affects the linear model's fit to the data. If the influential observation weren't there, the best fitting line to the rest of the data would most likely be very different. Review: Introduction, Using Diagnostic Statistics to Identify Influential Observations, Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2, Handling Influential Observations

What is an influential observation? a. unusual observation that can sometimes have a large residual compared to the rest of the points b. an observation so far away from the rest of the data that it influences the slope of the regression line c. both of the above d. neither of the above

Ho: u=uo Ho: u-uo=0

What is the null hypothesis for a one-sample t-test? A. Ho: u=uo B. Ho: uo=0 C.Ho: u-uo=0 D. Ho: uo-0=0

a table of correlations and a scatter plot matrix with histograms along its diagonal By default, PROC CORR produces a table of correlations (which can be a correlation matrix, depending on your program). The NOSIMPLE option suppresses printing of the simple descriptive statistics for each variable, and PLOT=MATRIX requests a scatter plot matrix instead of individual scatter plots. The HISTOGRAM option displays histograms of the variables in the VAR statement along the diagonal of the scatter plot matrix. Review: Using Correlation to Measure Relationships between Continuous Variables, Using Correlation to Measure Relationships between Continuous Variables

What output does this program produce? proc corr data=statdata.bodyfat2 nosimple plots=matrix(nvar=all histogram); var Age Weight Height; run; a. individual correlation plots and simple descriptive statistics b. a scatter plot matrix only, with histograms along its diagonal c. a table of correlations and a scatter plot matrix with histograms along its diagonal d. can't tell from the information given

For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 lower. The parameter estimate for Age is the average change in Oxygen_Consumption for a 1-unit change in Age. In this case, the parameter estimate is negative. So, for each year older (a 1-unit change in Age), oxygen consumption decreases by 2.78 units. Review: The Simple Linear Regression Model

When Oxygen_Consumption is regressed on RunTime, Age, Run_Pulse, and Maximum_Pulse, the parameter estimate for Age is -2.78. What does this mean? a. For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 greater. b. For each year older (holding the other predictors at a fixed value), the predicted value of oxygen consumption is 2.78 lower. c. For every 2.78 years older (holding the other predictors at a fixed value), oxygen consumption doubles. d. For every 2.78 years younger (holding the other predictors at a fixed value), oxygen consumption doubles.

model Health=Drug Disease Drug*Disease; In the MODEL statement, you first specify the main effect variables as they exist in the two-way ANOVA model. You then define the interaction term by separating the two main effect variables with an asterisk in the MODEL statement. Review: Performing Two-Way ANOVA with Interactions, Applying the Two-Way ANOVA Model

When you perform a two-way ANOVA in SAS, which of the following statements correctly defines the model that includes the interaction between the two main effect variables? a. class Drug*Disease; b. class Drug=Disease; c. model Drug*Disease; d. model Health=Drug Disease Drug*Disease;

proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; partition fraction(test=0 validate=.20); run; The PARTITION statement specifies that the original data set, Housing, be split. The FRACTION option specifies the fraction of the original data set (as a decimal value) to be placed in the holdout data set. The training data set contains the remaining observations, those that were not allocated to the validation (or, if specified, test) data sets. Review: Using PROC GLMSELECT to Build a Predictive Model, Building a Predictive Model

Which of the following PROC GLMSELECT steps splits the original data set into a training data set that contains 80% of the original data and a validation data set that contains 20% of the original data? a. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape / fraction(test=0 validate=.20); run; b. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape / partition(test=0 validate=.20); run; c. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; fraction(test=0 validate=.20); run; d. proc glmselect data=housing; class fireplace lot_shape; model Sale_price = fireplace lot_shape; partition fraction(test=0 validate=.20); run;

None of the Above

Which of the following affects alpha? a. The p-value of the test b. The sample size c. The number of Type I errors d. All of the above e. Answers a and b only f. None of the above

STUDENT residuals You can use STUDENT residuals to detect outliers. To detect influential observations, you can use RSTUDENT residuals and the DFFITS and Cook's D statistics. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

Which of the following can you use to detect outliers? a. DFFITS statistics b. Cook's D statistics c. STUDENT residuals d. RSTUDENT residuals

proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; run; In the HISTOGRAM statement, you specify the Speed variable and the NORMAL option using estimates of the population mean and the population standard deviation. In the INSET statement, you specify the keywords SKEWNESS and KURTOSIS, as well as the POSITION=NE option.

Which of the following code choices creates a histogram for the variable Speed from the data set SpeedTest with a normal curve overlay and a box with the skewness and kurtosis statistics printed in the northeast corner? a. proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis; run; b. proc univariate data=statdata.speedtest; histogram Speed / normal (mean std); inset skewness kurtosis / position=ne; run; c. proc univariate data=statdata.speedtest; histogram Speed / normal(mu=est sigma=est); inset skewness kurtosis / position=ne; run; d. proc univariate data=statdata.speedtest; histogram Speed / normal(skewness kurtosis); run;

proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Type; var Yield; run; The PROC MEANS statement must include the option PRINTALLTYPES in order for SAS to display statistics for all requested combinations of class variables - that is, for each level or occurrence of the variable and for all occurrences combined. The statistics specified on the second line must include the keywords N MEAN MEDIAN STD VAR RANGE QRANGE. The code must specify Type as the class variable and Yield as the analysis variable.

Which of the following code examples correctly calculates descriptive statistics of popcorn yield (Yield) for each level of the class variable (Type) in the data set Statdata.Popcorn, as well as statistics for all levels combined? The output should include the following statistics: sample size, mean, median, standard deviation, variance, range, and interquartile range. a. proc means data=statdata.popcorn maxdec=2 fw=10 n mean median std var range qrange; class Type; var Yield; run; b. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Yield; var Class; run; c. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std var range qrange; class Type; var Yield; run; d. proc means data=statdata.popcorn maxdec=2 fw=10 printalltypes n mean median std range IQR; class Type; var Yield; run;

the smallest overall validation average squared error PROC GLMSELECT selects the model that has the smallest overall validation error. Review: Building a Predictive Model

Which of the following does PROC GLMSELECT use to select a model from the candidate models when a validation data set has been provided? a. the smallest number of predictors b. the largest adjusted R-Square value c. the smallest overall validation average squared error d. none of the above

all of the above All of these statements are available for use within PROC PLM for postprocessing. Recall that this postprocessing will be performed using the item store. Review: Performing Postprocessing Tasks with the PLM Procedure

Which of the following is available for use in postprocessing within PROC PLM? a. LSMEANS b. LSMESTIMATE c. SLICE d. all of the above

The observations are dependent. In an ANOVA model, you assume that the errors are normally distributed for each treatment, the errors have equal variances across treatments, and the observations are independent. When you add a blocking factor to your ANOVA model, you also assume that the treatments are randomly assigned within each block and that the effects of the treatment are the same within each block. Review: More ANOVA Assumptions

Which of the following is not an assumption you make when including a blocking factor in an ANOVA randomized block design? a. The treatments are randomly assigned within each block. b. The errors are normally distributed. c. The effects of the treatment factor are constant across the levels of the blocking variable. d. The observations are dependent.

all of the above All six steps are important for developing good regression models. You might need to perform some steps iteratively to produce the best possible model. Review: Using an Effective Modeling Cycle

Which of the following is suggested for developing good regression models? a. getting to know your data by performing preliminary analyses b. identifying good candidate models c. checking and validating your assumptions using residual plots and other statistical tests d. identifying any influential observations or collinearity e. revising the model if needed f. validating the model with data not used to build the model g. all of the above h. a, c, and d only

When you score data, you apply the score code (the equations obtained from the final model) to the scoring data. When you score data, you apply the score code to the scoring data. It is not necessary to rerun the algorithm that was used to build the model. If you made any modifications to the training or validation data, you must make the same modifications to the scoring data before you can score it. The size of the scoring data set is not affected by the size of the training and validation data sets. Review: Preparing for Scoring

Which of the following statements about scoring is true? a. When you score data, you must rerun the algorithm that was used to build the model. b. When you score data, you apply the score code (the equations obtained from the final model) to the scoring data. c. If you made any modifications to the training or validation data, it is not necessary to make the same modifications to the scoring data. d. The scoring data set cannot be larger than either the training data set or the validation data set.

2 only In statement 2, the amount of salty snacks eaten and thirst have a positive linear relationship. As the values of one variable (amount of salty snacks eaten) increase, the values of the other variable (thirst) increase as well. Review: Using Scatter Plots to Describe Relationships between Continuous Variables, Using Correlation to Measure Relationships between Continuous Variables

Which of the following statements describes a positive linear relationship between two variables? The more I eat, the less I want to exercise The more salty snacks I eat, the more water I want to drink. No matter how much I exercise, I still weigh the same. a. 1 only b. 1 and 2 c. 2 only d. 2 and 3 e. 3 only

all of the above All of the statements are true concerning information criteria. All of the formulas begin with the same calculation but are different in the penalty term accessing the complexity of the model. With this penalty assessment, models that contain different numbers of parameters can be compared where the smaller information criteria value is considered better. Review: Information Criteria

Which of the following statements is true about information criteria such as AIC, AICC, BIC, and SBC? a. Formulas for all information criteria begin with the same calculation. b. The penalty term to assess the complexity of the model allows information criteria to be a useful means of comparing models with different number of parameters. c. The best model is the one with the smallest information criteria value. d. all of the above

You can reproduce your results if you specify an integer that is greater than zero in the SEED= option and then rerun the code using the same SEED= value. By specifying an integer that is greater than zero in the SEED= option, you can reproduce your results by rerunning the code using the same SEED= value. The SEED= option has nothing to do with the allocation of observations to the validation data set. If you do not specify a valid value in the SEED= option, the seed is automatically generated from reading the time of day from the computer's clock. The SEED= option is used when you start with a data set that is not yet partitioned. Review: Using PROC GLMSELECT to Build a Predictive Model

Which of the following statements is true about the SEED= option in PROC GLMSELECT? PROC GLMSELECT DATA=training-data-set <SEED=number>; MODEL target(s)=input(s) </ options>; PARTITION FRACTION(<TEST=fraction><VALIDATE=fraction>); RUN; a. You can reproduce your results if you specify an integer that is greater than zero in the SEED= option and then rerun the code using the same SEED= value. b. The SEED= option offers an alternative way to specify the proportion of observations to allocate to the validation data set. c. If a valid value is not specified for the SEED= option, the code will not run. d. You can use the SEED= option only when you have already partitioned the data prior to model building.

proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case; run; quit; Program a specifies the R and INFLUENCE options, which request diagnostic statistics. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

Which of these programs requests diagnostic statistics as well as diagnostic plots? a. proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case; run; quit; b. proc reg data=statdata.bodyfat2 plots(only)= (QQ RESIDUALBYPREDICTED RESIDUALS); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case; run; quit; c. both of the above

ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(label) COOKSD(label) DFFITS(label) DFBETAS(label)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; Program b is almost correct, but the images must be created for the data sets to be saved. Program c tells SAS to create the images and save them into their own data sets. Review: Looking for Influential Observations, Part 1, Looking for Influential Observations, Part 2

Which program correctly saves information from influential plots into individual output data sets? Assume that ODS GRAPHICS is on. a. proc reg data=statdata.bodyfat2; PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm / r influence; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; run; quit; b. ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots=none; PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; c. ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(label) COOKSD(label) DFFITS(label) DFBETAS(label)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; title; run; quit; d. ods output outputstatistics; proc reg data=statdata.bodyfat2 plots(only)= (RSTUDENTBYPREDICTED(LABEL) COOKSD(LABEL) DFFITS(LABEL) DFBETAS(LABEL)); PREDICT: model PctBodyFat2 = Abdomen Weight Wrist Forearm; id Case PctBodyFat2 Abdomen Weight Wrist Forearm; run; quit;

The response variable can have more than two levels as long as one of the levels is coded as 0. In binary logistic regression, the response variable can only have two levels. Review: Modeling a Binary Response

Which statement about binary logistic regression is false? a. Binary logistic regression uses predictor variables to estimate the probability of a specific outcome. b. To model the relationship between a predictor variable and the probability of an outcome, you must use a nonlinear function. c. The mean of the response in binary logistic regression is a probability, which is between 0 and 1. d. The response variable can have more than two levels as long as one of the levels is coded as 0.

All main effects and interactions that remain in the final model must be significant. Backward elimination results in a final model that can contain one or more main effects and (if specified) interactions. Any interactions in the final model must be significant. Main effects that are involved in interactions must appear in the final model, whether or not they are significant. Review: The Backward Elimination Method of Variable Selection

Which statement about the backward elimination method is false? a. Backward elimination is a method of selecting variables for a logistic regression model. b. Backward elimination removes effects and interactions one at a time. c. All main effects and interactions that remain in the final model must be significant. d. To obtain a more parsimonious model, you specify a smaller significance level.

reference Typically, the original data set is split into two subset data sets called the training and validation data sets. However, in some situations, the data is split into three subsets, and the third of these is called the test data set. Review: Using PROC GLMSELECT to Build a Predictive Model

With a large enough data set, observations can be divided into three subset data sets for use in honest assessment. Which of the following is not the name of one of these three subset data sets? a. training b. validation c. reference d. test

the assumption of equal variances You use Levene's Test for Homogeneity in PROC GLM to verify the assumption of equal variances in a one-way ANOVA model. Review: The GLM Procedure

You can examine Levene's Test for Homogeneity to more formally test which of the following assumptions? a. the assumption of errors being normally distributed b. the assumption of independent observations c. the assumption of equal variances d. the assumption of treatments being randomly assigned

Parameters

____________ are evaluations of characteristics of populations. They are generally unknown and must be estimated through the use of samples. A sample is a group of measurements from a population. In order for inferences to be valid, the sample should be representative of the population. A. Metrics B. Parameters

Scatter Scatter plots are useful to accomplish the following: explore the relationships between two variables locate outlying or unusual values identify possible trends identify a basic range of Y and X values communicate data analysis results

____________plots are two-dimensional graphs produced by plotting one variable against another within a set of coordinate axes. The coordinates of each point correspond to the values of the two variables. A. Box B. Histogram C. Scatter

Total Variation

the overall variability in the response variable. It is calculated as the sum of the squared differences between each observed value and the overall mean, This measure is also referred to as the Total Sum of Squares (SST). A. Total Variation B. Between Group Variation C. Within Group Variation

Between Group Variation

the variability explained by the independent variable and therefore represented by the between-treatment sum of squares. It is calculated as the weighted (by group size) sum of the squared differences between the mean for each group and the overall mean, This measure is also referred to as the Model Sum of Squares (SSM). A. Total Variation B. Between Group Variation C. Within Group Variation

Within Group Variation

the variability not explained by the model. It is also referred to as within treatment variability or residual sum of squares. It is calculated as the sum of the squared differences between each observed value and the mean for its group, This measure is also referred to as the Error Sum of Squares (SSE). A. Total Variation B. Between Group Variation C. Within Group Variation


Ensembles d'études connexes

Medical Assisting Profession Questions

View Set

Evolve Maternity and Women's Health Nursing - Women's Health

View Set