Data Mining Exam 1
True/False: The first step in trying to reduce the number of predictors should always be to use domain knowledge
TRUE: This is the first step because it is very important to understand what the various predictors are measuring and why. By using domain knowledge, the user can ensure he or she has condensed the data to a manageable level. This will make finding the solution much easier.
Grocery stores can use such information after a customer's purchases have all been scanned to print discount coupons, where the items being discounted are determined by mapping the customer's purchases onto the association rules.
True
T/F: A frequent problem in data mining is that of using a regression equation to predict the value of a dependent variable when we have many variables available to choose as predictors in our model.
True
True or False: In supervised learning the goal is to predict a single "target" or "outcome" variable
True
True or False: The most popular model for making predictions is the multiple linear regression model which is used to fit a linear relationship between quantitative dependent variables and predictors.
True
True/ False: The equation that describes how y is related to x and the error term is called the regression model?
True, the simple linear regression model is y=Bo+B1X+E. B0 and B1 are called parameters of the model and E is a random variable
True or False: An important step in data mining pre-processing is detecting outliers
True: An outlier is an extreme observation, being distant from the rest of the data. Once outliers are detected, domain knowledge is required to determine if it is an error, or truly extreme.
True or False: An important step in data mining pre-processing is detecting outliers.
True: An outlier is an observation that is extreme, being distant from the rest of the data. Once outliers are detected, domain knowledge is required to determine if it is an error, or truly extreme.
True or False: Outliers can have disproportionate influence on models.
True: An outlier is an observation that is said to be "extreme" which can alter the accuracy of results one may obtain using different models. It is very important to be able to detect outliers.
T/F: The meaning behind Rsquared (r^2) is that it explains how much of the dependent variable can be explained by the independent variables.
True: If r^2= .86 then that means 86% of the dependent variable can be explained by changes in the independent variables.
Which partition is used to develop multiple models and is also usually the largest partition? (A.) Validation (B.) Test (C.) Normalizing (D.) Training
(D.) Training; The training partition is typically the largest used to develop multiple models. It contains the data we use to build the models.
What are the steps in Data Mining? A. Develop an Understanding of the purpose of the data mining process, Obtain the data set to be used in the analysis, Explore the data, Reduce the data, Determine the data mining task, Choose the data mining techniques to be used, Use algorithms to perform the task, Interpret the results of the algorithms, Deploy the model. B. Develop an Understanding of the purpose of the data mining process, Obtain the data set to be used in the analysis, Explore the data, Reduce the data, Deploy the model, Choose the data mining techniques to be used, Use algorithms to perform the task, Interpret the results of the algorithms, determine the data mining task. C. Obtain the data set to be used in the analysis, Explore the data, Reduce the data, Deploy the model, Choose the data mining techniques to be used, Develop an Understanding of the purpose of the data mining process Use algorithms to perform the task, Interpret the results of the algorithms, determine the data mining task.
A
Which of the following assumption is NOT true about the Error Term ε. A) The error ε is a random variable with mean of zero. B) The variance of ε is always positive. C) The values of ε are independent. D) The error ε is a normally distributed random variable.
B
At what value must "p" be at in order for it to be statistically significant? A) >.5 B) <1 C) <.05 D) >.1
B) <.05: In order for the p value to be reliable, it must be <.05 or else it is not reliable.
What is the formula for the total sum of squares (SST)? a. SST = SSR + SSE b. SST = SSS + SSE c. SST = SSR + SSD d. SST = SSS + SSD
A) SST = SSR + SSE
Identify the three types of partitions, (more than one answer may apply) A)Training Partition, B) Test Partition, C)Teaching Partition, D)Validation Partition.
A,B and D
An example of a continuous variable is: A. Temperature B. The population C. Housing zone D. Gender
A: Temperature
There are several ways of classifying variables. What are those ways? A) Success rate, repition, categorical and text B) Numerical or text, continuous, integer and categorical C) Categorical, integer, nominal and phasing D) Integer, text, binary and nominal
B
What is the name of the type of regression that compares one independent variable with one dependent variable? A) Multiple Linear Regression B) Simple Linear Regression C) Logistic Regression D) Regression Trees
B
What type of modeling has a Model Goal to Fit the data well and understand the contribution of explanatory variables to the model? A.Predictive Modeling B.Explanatory Modeling C.Linear Regression Modeling D.Multiple Linear Regression Modeling
B
Identify the response NOT associated with Supervised Prediction A. Predicts numerical target (outcome) variables B. Target variable is often binary C. Each column is a variable D. When taken with classification, constitutes predictive analytics
B: Binary target variables are associated with classification, not prediction.
What is the term used to describe this definition: the more data we are dealing with, the greater the chance of encountering erroneous values resulting from measurement error, data entry error, or the like? a. Missing Values b. Outliers c. Over fitting d. Variable Selection
B: Outliers result from any of the errors mentioned above. An outlier can sit well outside the range and be harmful, but it can also sit inside the range of the rest of the data and can be harmless.
All of the following are similarities between test and validation partitioning except for... A. Used to assess performance of (a) model(s) B. Helps find out an optimal model C. Used on a single model to ensure it's a good model D. Part of the partitioning process
C. Only test partitioning uses only one model to make sure the model is optimal and can perform with unbiased data
Which of the following is NOT a cause of overfitting? A. A model with too many parameters B. Too many predictors C. Disproportionate influence on models (a problem if it is spurious) D. Trying many different models
C: Keep in mind, the more variables that are included, and the greater the risk of overfitting data. Answer C, is not the correct answer because disproportionate influence on models is used when detecting outliers.
Categorical is one type of variable. Which of the following is not a categorical variable? A. Hair color B. Gender C. Integer D. Political affiliation
C: There are two types of variables, categorical and numeric. Categorical would be ordered (low,medium,high) or unordered (male or female). Numeric variables are variables that are continuous or integers.
T/F: Training data refers to that portion of the data used to assess how well the model fits.
False: Training data refers to that portion of the data used to fit a model. Validation data refers to that portion of the data used to assess how well the model fits.
1.Out of the six core ideas in data mining, which are associated with unsupervised learning algorithms? a.) Association rules, classification, data reduction, data exploration b.) Data reduction, prediction, data visualization, association rules c.) Association rules, data visualization, data exploration, data reduction d.) Prediction, data reduction, data exploration, classification
C: Unsupervised learning algorithms are those used where there is no outcome variable to predict or classify.
The variable being predicted is called the ? , while variables being used to predict the value are called the ? . A: Independent variable, denoted by y / Dependent variables, denoted by x B: Dependent variable, denoted by x / Independent variables, denoted by y C: Dependent variable, denoted by y / Independent variables, denoted by x D: Independent variable, denoted x / Dependent variables, denoted by y
C: Y is dependent upon X. The relationship between these two or more variables help make managerial decisions. Regression Analysis can be used to develop an equation showing how the variables are related.
Multiple regression involves which variables a. One dependent and one independent b) Two or more dependent variables c) Two or more independent variables d) None of the above
C: regression analysis involving two or more independent variables is called multiple regression
Which of the following is not considerd an unsupervised learning method A) Association rules B) Dimension reduction methods C) Simple linear regression D) Clustering techniques
C: simple linear regression because we use it to try and predict an outcome and unsupervised learning has no outcome to predict.
True/False: Unsupervised learning algorithms are those used in classification and prediction.
False: Unsupervised learning algorithms are those used where there is no outcome variable to predict or classify
What is the most important factor propelling the growth of data mining? A. growth of technology B. better resources C. expanding economies D. the growth of data
D
What helped the in the growth of the data mining field? A. The Growth of data B. The Scarcity of data C. Technology D. Both A and C
D: Both technology which allowed the mass collection of data and the growth of data
The following is an example of which type of variables: South America, Asia, Africa, North America. a) numerical b) categorical c) ordinal varibles d) nominal variables
D: nominal variables
The rapid growth of data mining is contributed to ? A). Growth of data B). Growth of internet C). Data warehouse D). Computation power E). All of the above
E (All of the above): Data mining is a scientific approach to managerial decision making in which raw data are processed and manipulated to produce meaningful information. The increase use is a result of expanding company transactions/Information which is processed everyday. These transactions/information are able to be processed as a direct consequence of superior technology (Computational power and growth of the internet.)
Which of the following is NOT a step to be taken in a typical data mining effort? A) Develop an understanding of the purpose of the data mining project B) Obtain the dataset to be used in analysis C) Use algorithms to perform the task D) Determine the data mining task E) Use regression model techniques to manipulate the data
E) Use regression model techniques to manipulate the data, Regression model techniques are only one of the possible techniques that can be used. Step 6 is "Choose the data mining techniques to be used." This includes regression, neural nets, hierarchical clustering, etc. Even though regression is included, it is not the only possible technique.
Where is Data Mining used? A.) Security Specialists B.) Intelligence Agencies C.) Medical Researchers D.) Military E.) All of the above
E: Data Mining is a scientific approach to managerial decision making which can be used in a variety of important settings. This question shows the importance of data mining and how its not just used in one place.
The usefulness of a data mining method depends on _________. A. The size of dataset B. The types of patterns that exist in the data C. Noisiness of data D. The particular goal of the analysis E. All of the above
E: Every method of data mining has some advantages and disadvantages. The method that is most useful for the current goal should be used
True or False. Most serious errors in data analysis result from a poor understanding of which model to use to find data.
FALSE: Most serious errors in data analysis result from a poor understanding of the problem- an understanding that must be developed before we get into details of algorithms to be used.
True or False: Multiple Linear Regression is used when trying to discover the relationship between one or more independent variables and the dependent variable.
FALSE: Multiple Linear Regression describes how TWO OR MORE independent variables are related to the dependent variable
T/F: The term multicollinearity refers to the correlation among the dependent variables.
False. The term multicollinearity refers to the correlation among the independent variables.
True or False: A good predictive model is one that fits the data closely.
False: A good predictive model predicts new cases accurately, whereas an explanatory model fits data closely.
True and or False) When normalizing data, the mean is subtracted from each value and divide by the total number of values.
False: Divide by the standard deviation of the resulting variations of the mean. As chapter 2, page 24 states, "In effect, we are expressing each value as the 'number of standard deviations away from the mean,' also called a z-score.
T/F: An explanatory model is one that predicts new cases accurately.
False: Predictive modeling predicts new cases accurately, whereas explanatory models fit the original data closely.
T/F : Regression analysis is a poor way to show the relationship between the dependent variable and independent variable(s)
False: Regression analysis is one of the best ways to show the relationship between the two types of variables
True/False: Simple linear regression involves one dependent variable and two or more independent variables.
False: Simple linear regression involves one independent variable and one dependent variable. Multiple regression involves two or more independent variables.
T/F Supervised learning refers to analysis in which one attempts to learn something about the data other than predicting an output value of interest.
False: the definition refers to unsupervised learning. Supervised learning is the process of providing an algorithm with records in which an output variable of interest is known.
T/F Categorical Variables can be used as is?
False: they must be broken down into binary variables
True or False: Data Mining is a scientific approach to managerial decision making in which raw data are processed and manipulated to produce meaningful information?
True: In order to make those decisions you must extract data from large data sets. With data analysis you can detect meaningful patterns and rules, ultimately finding meaningful correlations, patterns, and trends.
True or False: Two solutions to handling missing data are omission and Imputation
True: Most algorithms will not process records with missing values. One can use omission to omit the missing values if there or only a few, or one can use imputation to replace the missing values with reasonable substitutes.
True or false: The goal of unsupervised learning is to segment data into meaningful segments; detect patterns.
True: With unsupervised learning there is no target variable to predict or classify.
True/ False: The equation that describes how y is related to x and the error term is called the regression model?
True: the simple linear regression model is y=Bo+B1X+E. B0 and B1 are called parameters of the model and E is a random variable called
