MGMT 473
Equation: Salary = 44.0073 + 6.6227GPA + 6.6071MIS + 6.7309Statistics. What is the additional salary a graduate would earn with an MIS degree?
$6,607
What are the 6 phases of CRISP-DM?
1. Business understanding 2. Data understanding 3. Data preparation 4. Modeling 5. Evaluation 6. Deployment
What are the 5 steps of SEMMA?
1. Sample 2. Explore 3. Modify 4. Model 5. Assess
According to interviews and expert estimates, analytics professionals spend from ________________ of their time in the mundane task of collecting and preparing unruly data, before analytics can be applied (The New York Times ,August 17, 2014)
50 - 80%
Calculate the Mean Absolute Deviation for the following data: We have observed the age of 3 individuals in a study, where the mean age is 40. The observed ages were 31, 40, and 49. What is the MAD?
6 Reason: =abs((31-40)+(40-40) +(49-40))/3 = 6
Which of the following is not a common approach for transforming categorical data? A. Mathematical transformation B. Dummy variables C. Category reduction D. Category scores
A. Mathematical transformation
Which of the following are reasons for data professionals to learn data wrangling skills? Analytics professionals can no longer rely on the IT department to provide data Analytics professionals are superior to all other IT professionals Organizations will be able to make decisions more rapidly Analytics professionals need broader skill sets than data mining techniques
Analytics professionals can no longer rely on the IT department to provide data Organizations will be able to make decisions more rapidly Analytics professionals need broader skill sets than data mining techniques
What is the term used to describe computer systems that demonstrate human-like intelligence and cognitive functions, such as deduction, pattern recognition, and the interpretation of complex data?
Artificial intelligence
When selecting the cutoff values for performance measures, in some applications, the analyst may choose to increase or decrease the cutoff value to classify fewer or more observations into the target class. What are some reasons for doing this? Select all that apply. Uneven class distributions Asymmetric misclassification costs Personal preference Even class distributions
Asymmetric misclassification costs Uneven class distributions
Recall Organic Food Superstore from the introductory case; In that case, an Entity Relationship Diagram (ERD) for the store illustrates three entities:CUSTOMER, ORDER, and PRODUCT. The relationship between CUSTOMER and ORDER entities is 1:M because: A. An order can contain ONLY one product B. Each order can only belong to one customer C. A customer places many orders over time
B. Each order can only belong to one customer
Recall that we use nominal and ordinal measurement scales to represent categorical variables. Which of the examples below represent a nominal scale representation of a categorical variable? A. Performance of a manager (excellent, good, fair, poor). B. The temperature of the resort location C. Marital status (single, married, widowed, divorced, separated) D. Profit and inventory level of a distribution center
C. Marital status (single, married, widowed, divorced, separated)
Examples of transforming numerical data include transforming: Calculating Percentages Individual's date of birth to age Combining height and weight to create body mass index There is no need to transform data
Calculating Percentages Individual's date of birth to age Combining height and weight to create body mass index
If we have a third variable in the data set that is categorical, we can plot the two numerical variables and then add the third categorical variable. This scatter plot is called a scatter plot with a '________________' variable.
Categorical
When conducting data mining analysis, practitioners generally adopt either the ____________ or ___________
Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology or the Sample, Explore, Modify, Model, and Assess (SEMMA) methodology
What is the name of the chart that shows the improvement that a predictive model provides over a random selection in capturing the target class cases?
Cumulative life chart
What are the most popular performance charts?
Cumulative lift chart Decile-wise lift chart Receiver operating characteristic (ROC) curve
Unsupervised data mining techniques are especially effective for:
Data exploration Dimension reduction Pattern recognition
What is the name of the chart that shows the improvement that a prediction model provides over a random selection but presents the information in 1- equal-sized intervals (e.g. every 10% of the observations)?
Decile-wise lift chart
A relationship between CUSTOMER and ORDER entities would be 1:M because
Each order can only belong to one customer
What is one of the most widely used measures for evaluating similarity with numerical variables. It is defined as the length of a straight line between two observations.
Euclidean distance
In the semi-log regression model not all variables are transformed into logs.The semi-log model that transforms only the response variable is often called:
Exponential regression model
True or false: R^2 can decrease as we add more predictor variables to the linear regression model
False
True or false: The greater the k, the lesser the reliability of the k-fold method and the greater will be its computational cost
False
In the case of a dummy variable categorizing a person's gender, we can define 1 for male and 0 for female. In this case, what would the reference category be?
Female
The formula for calculating the Matching coefficient is (the number of variables with matching outcomes)/(total number of variables). The _________ the value of the matching coefficient, the more similar the two observations are
Higher
Another commonly used transformation that captures nonlinearities is based on the natural logarithm. Which of the following variables are commonly log-transformed? Select all that apply. House prices Age Scores Income
House prices Income
Important risk factors for high blood pressure reported by the National Institute of Health include weight and ethnicity.High blood pressure is common in adults who are overweight and are African American.a public policy researcher in Atlanta surveyed 150 adult men about 5′10″ in height and in the 55-60 age group. Data were collected on their systolic pressure, weight (in pounds), and race (Black = 1 for African American, 0 otherwise). The resulting regression equation which includes the interaction between weight and race is: Systolic = 70.8312 + 0.4362Weight + 30.2482Black − 0.1118(Weight × Black). The interaction variable is negative and statistically significant at the 5% level. Interpret what a negative interaction implies in this example:
Implies that black men carry their weight better in terms of the systolic pressure than their non black counterparts.
What is the term in regression models when a predictor variable has a different partial effect on the outcome depending on the values of another predictor variable?
Interaction effect
Euclidean and Manhattan distance: suitable for numerical variables What similarity measures should we use for categorical and binary data?
Matching coefficient Jaccard's coefficient
The logistic regression model cannot be estimated with standard ordinary least squares (OLS) procedures. Instead, we rely on which method?
Maximum likelihood estimation (MLE)
There are numerous applications where the relationship between the predictor variable and the response variable cannot be represented by a straight line and, therefore, must be captured by an appropriate curve. What are some simple transformations of the variables for nonlinear relationships? Select all that apply. Natural logarithms Goodness of fit Squares Dummy variables
Natural logarithms Squares
Which of the following is true of a data warehouse? It is a small-scale data warehouse or a subset of the enterprise data ware - house that focuses on one particular subject or decision area. One of its primary purposes is to support decision making It can be designed to support the marketing department for analyzing customer behaviors , and it contains only the data relevant to such analyses Data in a data warehouse are usually organized around subjects such as sales, customers, or products that are relevant to business decision making
One of its primary purposes is to support decision making Data in a data warehouse are usually organized around subjects such as sales, customers, or products that are relevant to business decision making
An effective strategy for dealing with these issues is category reduction, where we collapse some of the categories to create fewer nonoverlapping categories. The first guideline states that categories with very few observations may be combined to create the ________________ category
Other
_____________ occurs when a predictive model is made overly complex to fit the quirks of given sample data. By making the model conform too closely to the sample data, its predictive power is compromised.
Overfitting
Examples of the use of prediction models include:
Predict the selling price of a house Predict the spending of a customer
Which nonlinear regression model is appropriate when the slope, capturing the influence of x on y, changes in magnitude as well as sign?
Quadratic regression model
Which tool shows the sensitivity and specificity measures across all cutoff values and how accurately the model is able to classify both target and non target class cases overall?
ROC curve
Which of the following are reasons for missing values in data? Select all that apply. Respondents always provide all the requested information Respondents decline to provide the information due to its sensitive nature There are never missing values in data Some of the questions do not apply to every respondent
Respondents decline to provide the information due to its sensitive nature Some of the questions do not apply to every respondent
The most popular query language used today is ___________________. This popular query language is used for manipulating data in a relational database using relatively simple and intuitive commands.
SQL
The basic structure of a SQL statement is relatively simple and usually consists of three keywords: Which of the following is a SQL keyword? Select all that apply! Where Choose Select From
Select Where From
________________ measures gauge whether a group of observations are similar or dissimilar to one another
Similarity
_________________ data also allows us to review the range of values for each variable.
Sorting
This difference in scale distorts the true distance between observations and can lead to inaccurate results.It is common, therefore, to make the observations unit-free. How is this accomplished?
Standardizing Normalizing
In order to select the preferred model, we examine several goodness-of-fit measures. Which ones?
The coefficient of determination The standard error of the estimate The adjusted coefficient of determination
What is the linear regression model applied to a binary response variable called?
The linear probability regression model
What is true of the cross-validation method?
The sample is partitioned into a training set and a validation set to assess how well the estimated model predicts with unseen data The holdout method is a cross validation method The k-fold cross-validation method is a cross validation method
What is used to evaluate how well the sample regression equation fits the data?
The standard error of the estimate The coefficient of determination, R^2
We use analysis of variance (ANOVA) in the context of the linear regression model to derive R2.We denote the total variation in y as Σ(yi−y ̄)2, which is the numerator in the formula for the variance of y. What is this total variation called?
Total sum of squares
As a common practice, in data partitioning in the oversampling technique, which data set is oversampled?
Training data set
True or false: A relational database consists of one or more logically related data files, where each data file is a two-dimensional grid that consists of rows and columns.
True
True or false: In a business setting, we might use a 1:1 relationship to describe a situation where each department can have only one manager and each manager can only manage one department.
True
True or false: In the holdout method, the sample data is partitioned into two independent and mutually exclusive data sets-the training set and the validation set
True
Target (success) class is Class 1 Non target class is Class 0 In the confusion matrix there are 4 possible outcomes. When a Class 1 observation is correctly classified by the model, what would it be called?
True positive (TP)
Which of the following are the very first tasks most data analysts perform to gain a better understanding and insights into the data? Sorting the data Counting the data Copying the data Visually reviewing
Visually reviewing Counting the data Sorting the data
When is the RMSE performance measure most desirable?
When large errors are particularly undesirable
Oftentimes, a categorical variable is defined by more than two categories. For example, the mode of transportation used to commute may be described by three categories: Public Transportation, Driving Alone, and Car Pooling. Given k categories of a variable, the general rule is to create how many dummy variables?
k - 1
When comparing models with the same response variable, we prefer the model with a smaller Se. A smaller Se implies that there is _____________ dispersion of the observed values from the predicted values
less
Data ______________________ is a process that an organization uses to acquire, organize, store, manipulate, and distribute data
management
In addition to binning, another common approach is to create new variables through ________________ transformations of existing variables.
mathematical
Data ____________________ is the process of defining the structure of a database.
modeling
The z-score measures the distance of a given observation from the sample mean in terms of standard deviation. The z-score is an example of making observations unit free. This is an example of ________________ data
numerical
There are two common strategies for dealing with missing values: ___________________ & ___________________.
omission and imputation
Data _________ is the process of dividing a data set into a training, validation, and in some situations, an optional test data set
partitioning
It is important to develop ____________ measures that evaluate how well an estimated model will perform in an unseen sample, rather than making the evaluation solely on the basis of the sample data used to build the model
performance
The formula for the variance differs depending on whether we have a sample or a ______________.
population
The accuracy rate is calculated as the number of predictions divided by the _____________ number of observations
total
Data ____________________ is the data conversion process from one format or structure to another
transformation
True or false: Data in a data mart are organized using a multidimensional data model called a star schema, which includes dimension and fact tables.
true
A large lecture class has 280 students. Mean score of exam: 74. Standard deviation of exam: 8. Distribution of scores is bell-shaped. How many standard deviations above the mean would a score of 90 be?
z-score = (observed value - mean of the sample)/standard deviation of sample z-score = (90-74)/8 z-score = 2
R^2 measures the percentage of sample variations of the response variable explained by the model. What is true when comparing linear and log-transformed regression models?
We need to compute the percentage of explained variations of y We cannot compare the percentage of explained variations of y with that of ln(y)
Which of the following is NOT a correct statement about entity-relationship diagram (ERD) attributes? A. A primary key is an attribute that uniquely identifies each instance of an entity B. The relationships between entities can only be one-to-many C. An entity is a generalized category to represent persons, places, things, or events. D. A foreign key is the primary key of a related entity.
B. The relationships between entities can only be one-to-many
Finally, another common transformation of categorical variables is to create category ________________.
scores
If a linear regression model uses only one predictor variable, then the model is referred to as a ______________ linear regression model
simple
The _____________ coefficient measures the degree to which a distribution is not symmetric about its mean.
skewness
For 0 < B1 < 1, the log-log regression model implies a positive relationship between x and E(y); as x increases, E(y) increases at a _________ rate
slower
In situations where negative outcomes are not as important as positive outcomes, what is a more appropriate measure of similarity? a. Euclidean coefficient b. Manhattan coefficient c. Jaccard's coefficient d. Matching coefficient
b. Jaccard's coefficient
The key distinction between supervised and unsupervised data mining techniques is supervised data mining is: a. Effective for data exploration b. Effective for dimension reduction c. Effective for developing predictive models d. Effective for pattern recognition
c. Effective for developing predictive models
The example of momentum p is the product of the mass m and the velocity v of an object; that is, p = mv is an example of a _________ relationship
deterministic
A ______________ variable, also referred to as an indicator or a binary variable, is commonly used to describe two categories of a variable.
dummy