QMB Final

Ace your homework & exams now with Quizwiz!

What is one of the most widely used measures for evaluating similarity with numerical variables. It is defined as the length of a straight line between two observations

Euclidean distance

In the semi-log regression model not all variables are transformed into logs. A semi log model that transforms only the predictor variable is often called:

logarithmic regression model

If a linear regression model uses only one predictor variable, then the model is referred to as a ____ linear regression model

simple

Whereas the moving average technique weighs all recent observations equally, the method called simple exponential smoothing assigns exponentially decreasing weights as the observations get ''_________"

older

____ measures gauge whether a group of observations are similar or dissimilar to one another

similarity

__________is an attempt to imitate a real-world process that produces several business scenarios.

simulation

The probability of a randomly selected case belonging to class 1 or the ___ of the diagonal line

slope

Principal components are uncorrelated variables whose values are the weighted linear combinations of the original variables.Which principal component accounts for most of the variability in the data?

First

The linear trend model is used for a time series that is expected to grow by a ''______" amount each time period.

Fixed

With quarterly data, an exponential trend model with seasonal dummy variables is specified as ln(y) = β 0 + β 1 d 1 + β 2 d 2 + β 3 d 3 + β 4 t + ε.. What does the d2 represent?

The quarter 2 dummy variable

In a quadratic regression model y = β0 + β1x + β2x2 + ε, the coefficient β2 determines the relationship between x and y. Which of the following is true? Select all that apply!

The relationship between x and y is an inverted U-shaped when (β2 < 0) The relationship between x and y is U-shaped when (β2 > 0) is U-shaped (β2 > 0) or inverted U-shaped (β2 < 0).

true or false: In the holdout method, the sample data is partitioned into two independent and mutually exclusive data sets- The training set and the validation set

True

The exponential trend model is attractive when the expected increase in the series gets

"larger" over time

Examples of the use of prediction models include

predict the spending of a customer Predict the selling of a house

Order of CRISP DM

1. Business understanding:understanding the data mining project and its objectives; info about the situation context around the project 2. Data Understanding: collecting relevant data and conducting a preliminary analysis to understand the data; results from initial analysis may lead to ideas and potential hypothesis for subsequent data mining phases 3. Data preparation: specific tasks in this phase include record and variable selection, data wrangling, and cleansing for subsequent analyses 4. Modeling: selection and execution of data mining techniques including linear and logistic regression models; certain techniques require specific formats and types of variables in the data set 5. Evaluation: evaluate the performance of the competing models based on specific criteria in order to select the best model that meets the business objectives of the project 6. Deployment: develop a set of actionable reccommendations based on the analysis results. Strategy for deployment, monitoring, and feedback. CRISP-DM model was conceived as a life cycle, implying the cyclical nature of the data mining projects.

The linear trend model is used for a time series that is expected to grow by a fixed amount each time period. What are the steps in applying the linear trend model? Please choose the steps in the correct order.

1. Collect the time series data 2. Visually inspect the time series to confirm the existence of a trend 3. Estimate and interpret the linear trend model 4. Forecast the variable of interest

1. The k-means clustering method assigns each observation to a cluster, such that the observations assigned to the same cluster are as similar as possible. The number of clusters k is determined prior to estimation. Order the first 5 k-means steps in the proper order below!

1. Specify the k value 2. Randomly assign k observations as cluster centers 3. Assign each observation to its nearest cluster center 4. Calculate cluster centroids 5. Reassign each observation to a cluster with the nearest centroid

Place the steps of the holdout method in the proper order

1. We partition the sample data into two parts, labeled training set and validation set 2. We use the training set to estimate competing models 3. We use the estimates from the training set to predict the response variable in the validation set 4. We calculate the accuracy rate for each competing model. The preferred model will have the smallest RMSE or the largest accuracy rate

Important risk factors for high blood pressure reported by the National Institute of Health include weight and ethnicity.High blood pressure is common in adults who are overweight and are African American.a public policy researcher in Atlanta surveyed 150 adult men about 5′10″ in height and in the 55-60 age group. Data were collected on their systolic pressure, weight (in pounds), and race (Black = 1 for African American, 0 otherwise). The resulting regression equation is: Systolic = 80.2085 + 0.3901Weight + 6.9082Black. What is the expected Systolic blood pressure for a 170 pound black male?

153.44

The Manhattan distance is calculated using the formula=|x1i−x1j|+|x2i−x2j|+|x3i−x3j|+⋯+|xki−xkj|. Calculate the Manhattan distance between Observations 1 and 2 which is shown by Observation 1: (3,4) and Observation 2: (4,5)

2

11. An educational researcher is trying to analyze the determinants of the applicant pool for the specialized Master of Science in Accounting (MSA) program.Two important determinants are the marketing expense of the business school and the percentage of the MSA alumni who were employed within three months after graduation.For a given marketing expense of $80,000, predict the number of applications received if Marketing equals 80 and Employed equals 50 using the equation Applicantŝ=−49.5490+0.3550Marketing+1.0149Employed.

30

An educational researcher is trying to analyze the determinants of the applicant pool for the specialized Master of Science in Accounting (MSA) program.Two important determinants are the marketing expense of the business school and the percentage of the MSA alumni who were employed within three months after graduation. Using the equation Applicantŝ=−49.5490+0.3550Marketing + 2.0Employed, answer the following question. If the number employed increased by 30, how many more applicants would there have been?

60

1. A common practice in data partitioning is to partition some percent of the data into the training data set and some percent of the data into the validation data set. Which of the answer below is consistent with the percentage of data in training data set and percentage of data in validation data set?

60%,40%

At a University of California campus, data were collected on the starting salary of business graduates (Salary in $1,000s) along with their cumulative GPA, whether they have an MIS concentration (MIS = 1 if yes, 0 otherwise), and whether they have a statistics minor (Statistics = 1 if yes, 0 otherwise).Use the estimated equation Salary = 44.0073 + 6.6227GPA + 6.6071MIS + 6.7309Statistics. What is the additional salary a graduate would earn with an MIS degree?

6607

Using the estimated equation Salary = 44.0073 + 6.6227GPA + 6.6071MIS + 6.7309Statistics. For a graduate with a GPA of 3.5, compute the predicted salary for a business graduate with neither an MIS concentration nor a Statistics minor.67

67,187

"what goes with what" study designed to identify events that tend to occur together

Also referred to as market basket analysis

What is the term used to describe computer systems that demonstrate human like intelligence and cognitive functions, such as deduction, pattern recognition, and the interpretation of complex data?

Artificial intelligence

For 0 < β1 < 1, the log-log regression model implies a positive relationship between x and E(y); as x increases, E(y) increases at a slower rate. This may be appropriate in the food expenditure example where we expect food expenditure to react positively to changes in income, with the impact diminishing at higher income levels. If β1 < 0, what is the relationship between x and E(y)?

As x increases, E(y) decreases at a slower rate

When selecting the cutoff values for performance measures, in some applications, the analyst may choose to increase or decrease the cutoff value to classify fewer or more observations into the target class. What are some reasons for doing this? Select all that apply!

Asymmetric misclassification costs Uneven class distributions

What is one of the key assumptions for the Naive Bayes Method?

Categorical predictor variables are independent

KNN is one of the simplest, yet most frequently used data mining techniques for classification when the response variable is '_______' and for prediction when the response variable is '_________'.

Categorical, numerical

'_____' analysis finds similarities among data and groups them into clusters of observations that share similar characteristics.

Cluster

Which data mining technique uses similarity measures to form clusters in such a way that objects are similar within a group but dissimilar across groups?

Cluster analysis

Name two popular unsupervised data mining techniques: cluster analysis and association rule analysis.

Cluster analysis Association rule analysis

Principal components analysis (PCA) transforms a large number of possibly correlated variables into a smaller number of uncorrelated variables called principal ___

Components

What is referred to as 'A family of computer algorithms used to model the risk or uncertainty of a real-world process or system'?

Computer simulation Relies on random sampling The Monte Carlo Method

In the feild of data mining, there is a growing need for the establishment of standards in the area. When conducting data mining analysis, practitioners generally adopt two standards.

Cross industry standard process for data mining (CRISP-DM) Sample, explose, modify, model, and assess (SEMMA)

1. Higher-order polynomial functions can be estimated also. The ''_________" trend model allows for two changes in the direction of a series.

Cubic

Although a linear relationship can be adequate, there are many cases in which a nonlinear functional form is more suitable. Which of the following are trend models that use a nonlinear functional form? Select all that apply!

Cubic, quadratic, exponential

1. What are two methods used to detect overfitting and provide objective assessment of the predictive performance of models? Select all that apply!

Data partitioning Cross validating

What is the name of the chart that shows the improvement that a predicitve model provides over a random selection but presents the information in 10 equal sized intervals (e.g., every 10% of the observation)

Decile wise lift chart

1. It is sometimes more informative to have graphic representations to assess the predictive performance of data mining models.Which of the following are the most popular performance charts? Select all that apply!

Decile wise lift chart Cumulative lift chart ROC curve

The height of each branch (cluster) or sub-branch (sub-cluster) indicates the distance, or how '__________' it is from the other branches or sub-branches with which it is merged.

Dissimilar

Hierarchical clustering is a technique that uses an iterative process to group data into a hierarchy of clusters. What are two common strategies of hierarchical clustering?

Divisive clustering Agglomerative clustering

With seasonal data, we estimate a linear trend model that also includes ''________" variables to capture the seasonal variations.

Dummy

If the linear regression model includes an Intercept, the number of dummy variables representing a categorical variable should be one less than the number of categories of the variable. This solution helps to avoid which problem?

Dummy Variable Trap

True or false: The exponential trend model is attractive when the expected increase in the series gets smaller over time.

False

True or false: For quarterly data, we need to define only two dummy variables, using the other two quarters as reference.

False (3)

Note that the greater the K, the lesser will be the reliability of the K-fold method and the greater will be its computational cost

False: the greater will be the reliability

Mixed data are quite often of interest in business applications. Which of the following is an example of mixed data?

Gender and age Income and gender

What are two of the most relevant discrete probability distributions for Monte Carlo simulation?

Poisson, binomial

Important risk factors for high blood pressure reported by the National Institute of Health include weight and ethnicity.High blood pressure is common in adults who are overweight and are African American.a public policy researcher in Atlanta surveyed 150 adult men about 5′10″ in height and in the 55-60 age group. Data were collected on their systolic pressure, weight (in pounds), and race (Black = 1 for African American, 0 otherwise). The resulting regression equation which includes the interaction between weight and race is: Systolic = 70.8312 + 0.4362Weight + 30.2482Black − 0.1118(Weight × Black). The interaction variable is negative and statistically significant at the 5% level. Interpret what a negative interaction implies in this example:

Implies that black men carry their weight better in terms of the systolic pressure than their non black counterparts.

In situations where negative outcomes are not as important as positive outcomes, what is a more appropriate measure of similarity?

Jaccards coefficient

What are two common clustering techniques? Select all that apply!

K means clustering Hierarchical clustering

Note that for the exponential model, we compute y ̂ t in regular units and not in natural logs. The resulting y ̂ t also enables us to compare the linear and the exponential models in terms of which of the following? (Select all that apply!).

MSE, MAD and MAPE

1. Euclidean and Manhattan distance measures are suitable for numerical variables. When dealing with categorical variables, we rely on other measures of similarity. What are two commonly used measures for categorical and binary data? Please select all that apply!

Matching coefficient Jaccards coefficient

The Monte Carlo Simulation technique relies on random sampling to mimic the odds of all possible outcomes. Which of the following is true?

Monte Carlo technique is stochastic and probabilistic

Sometimes a time series reverses direction, due to any number of circumstances. A common method for forecasting this type of series is: Select the correct answer!

Polynomial trend model

Each data mining technique has its advantages and limitations. Which data mining technique mimics the neural structure of the brain using learning, memory, and generalization?

Neural networks

The variance inflation factor (VIF) is another measure that can detect a high correlation between three or more predictor variables even if no pair of predictor variables has a particularly high correlation. What is the smallest possible value of VIF (absence of multicollinearity)

One

Which of the following is a disadvantage of logistic regression methods? Select all that apply!

Only for classification Can be affected by collinearity between predictor variables

1. it is important to develop '_________' measures that evaluate how well an estimated model will perform in an unseen sample, rather than making the evaluation solely on the basis of the sample data used to build the model.

Performance

We explore supervised data mining techniques, including classification models where the target variable is categorical and prediction models where the target variable is numerical. Examples of classification models include which of the following? Select all that apply!

Predicting whether or not a consumer will make a purchase Predicting whether a mortgage will be approved

_________ analytics uses simulation and optimization algorithms to quantify the effect of different possible actions of a decision maker to make a more informed decision.

Prescriptive

Predictions with the exponential regression model are made by yˆ=exp(b0+b1x+se2/2). Which of the following is true

Se is the standard error of the estimate B0 and B1 are the coefficient estimates

1. Bayes' theorem uses new information to update a '________' probability to form a '_______' probability

Prior, posterior

Which of the time series methods shows repetitions over a one-year period. For example, every year, sales of retail goods increase during the holiday season, and the number of vacation packages sold goes up during the summer?

Seasonal

"___________" forecasting methods are based on the judgment of the forecaster, who uses prior experience and expertise to make forecasts.

Qualitative

Which tool shows the sensitivity and specificity measures across all cutoff values and how accurately the model is able to classify both target and non target class cases overall.?

ROC curve

Forecasting smoothing techniques are used when the time series represent ''_________" fluctuations with no discernible trend or seasonal fluctuations.

Random

1. A dummy variable, also referred to as an indicator or a binary variable, takes on numerical values of 1 or 0 to describe two categories of a categorical variable. For a predictor variable that is a dummy variable, it is common to refer to the category that assumes a value of 0 as: Please select all that apply.

Reference category, benchmark category

Which of the following describes a prescriptive analytics method?

Results are analyzed to improve decision making

_____ measures gauge whether a group of observations are similar or dissimilar to one another

Similarity

Smoothing techniques are employed to reduce the effect of random fluctuations. Which of the following are two distinct smoothing techniques discussed in the book

Simple exponential smoothing Moving average technique

___________ is a procedure for continually revising the forecast in the light of more recent observations.

Smoothing

There are numerous applications where the relationship between the predictor variable and the response variable cannot be represented by a straight line and, therefore, must be captured by an appropriate curve. What are some simple transformations of the variables for nonlinear relationships? Select all that apply!

Squares, natural logarithms

1. Data mining uses many kinds of computational algorithms to identify hidden patterns and relationships in data. For developing predictive models, one tends to employ '__________' data mining techniques.

Supervised

Unsupervised data mining, also referred to as unsupervised learning, requires no knowledge of the response variable.It is called unsupervised because:(select the correct response below)

The algorithms allow the computer to identify complex processes and patterns without any specific guidance from the analyst

The exponential regression model is specified as ln(y)=B0+B1x+e. What does B1x100 measure

The approximate percentage change in E(y) when x increases by one unit

1. What is used to evaluate how well the sample regression equation fits the data?

The coefficient of determination, R2

In order to select the preferred model, we examine several goodness-of-fit measures:

The coefficient of determination, the adjusted coefficient of determination

Time series forecasting models consist of which components? the trend, the seasonal, the cyclical, and the random components.

The cyclical components The seasonal components The trend components

In order to measure the distance between two observations with mixed data, it is common to use the distance measure, referred to as Gower's coefficient. Gower's coefficient computes which of the following?

The distance for each variable, converts it into a [0, 1] scale, and calculates a weighted average of the scaled distances

which of the following is true of cross validation sets?

The holdout method is a cross validation method The k-fold cross validation method is a cross validation method The sample is partitioned into a training set and a validation set to assess how well the estimated model predicts with unseen data

In regression models, we use both numerical and dummy (categorical) variables as predictor variables, what are the 3 interaction variables we discussed in the chapter?

The interaction of a dummy variable with a numerical variable The interaction of two numerical values The interaction of two dummy variables

Which of the following describes the k-fold cross validation method

The k-fold method is less sensitive to data partitioning than the holdout method The sample data are partitioned into k subsets, where one of the k subsets is used as the validation set

Name 3 prediction model performance measures described in the chapter

The mean error (ME) The root mean square error (RMSE) The mean absolute percentage error (MAPE)

The AGNES algorithm uses one of several linkage methods.Which linkage method uses the nearest distance between a pair of observations that do not belong to the same cluster.

The single linkage method

With either CRISP-DM or SEMMA, it is important to fully understand which of the following: Please select all that apply! the surrounding socioeconomic climate, business goals, and underlying issues at hand prior to preparing the data and choosing analysis techniques

The surrounding socioeconomic climate Business goals Underlying issues for the business

A ___ series is a sequence of observations that are ordered in time

Time

When creating a decile-wise lift chart, if the lift for the first 10% of the observations (first bar) is about 7.1, what does that mean?

Top 10% of observations selected by model contain 7.1x as many Class 1 cases as the 10% of the observations randomly selected

Recall the data partitioning is the process of dividing a data set into training, validation, and an optional test data set. As a common practice, in the oversampling technique, which data is oversampled?

Training data set

What type of analysis extracts long term upward or downward movement of a time series

Trend

1. True or false: The Holt exponential smoothing method incorporates long-term upward or downward movements of a time series

True

True or False: the term artificial intelligence, machine learning, and data mining are often grouped together or used interchangeably because their definitions tend to overlap with no clear boundaries

True

True or false: A Monte Carlo simulation can be used to estimate the average profit or a range of best- and worse-case scenarios at different staffing levels

True

True or false: Compared to the hierarchical clustering methods, the k-means clustering method is more computationally efficient, especially when dealing with large data sets.

True

True or false: In any business setting, when the probability distributions are known for all relevant random variables, we can easily formulate and develop a Monte Carlo simulation. The next crucial step is to construct a quantitative model that represents the relationship among relevant variables.

True

True or false: In-sample criteria do not help us assess how well an estimated model will predict in an unseen sample

True

1. Assume that the target class, also called the success class (in general, the class of interest), is Class 1 and that the non target class is Class 0. In the confusion matrix there are 4 possible outcomes. When a Class 1 observation is correctly classified by the model, what would it be called? (according to the textbook)

True Positive (TP)

When making forecasts in an exponential model It is advisable to use what type of values for making forecasts?

Unrounded

Which of the following is true of qualitative forecast methods? Select all that apply!

Used when future results are suspected to depart markedly from results in prior periods Attractive when historical data are not available

What is the tree like structure that allows users to visually inspect the clustering result and determine the appropriate number of clusters in the data

dendrogram

R2 measures the percentage of the sample variations of the response variable explained by the model. Which of the following are true when comparing linear and log transformed regression models?

We cannot compare the percentage of explained variations of y with that of the ln(y) We need to compute the percentage of explained variations of y

Prescriptive analytics, refers to using simulation and optimization algorithms to advise on which of the following?

What businesses should do

When is the RMSE performance measure most desirable?

When large errors are particularly undesirable

When is the use of the Holt exponential smoothing method appropriate?

When the time series has been deseaonalized When the time series exhibits trend but no seasonality

1. In the case of simple exponential smoothing,forecasts are weighted averages of past observations, with the weights decaying exponentially as the observations get older. Recall the smoothing parameter α: which of the following is true of α? Select all that apply!

With smaller values of α, greater emphasis is on past observations With larger values of α, we pay attention mainly to the most recent observations α is a smoothing constant

Suppose the competing hypothesis in testing for individual significance are Ho: Bj=0 versus Ha: B:=/0. What would rejecting the null hypothesis imply?

X is significant in explaining Y

What is the term for a table that summarizes classification outcomes obtained from the validation set

confusion matrix

The If portion of the association rule analysis is called the ___ and the Then portion is called the consequent

antecedent

A good predictive model would have a ROC curve that lies above the diagonal line. THe greater the area between the ROC curve and the baseline, the ___ the model is

better

What is the name of the chart that shows the improvement that a predictive model provides over a random selection in capturing the target class cases?

cumulative lift chart

Specific tasks in this phase include record and variable selection, data This step in the CRISP-DM methodology includes wrangling, and cleansing data for subsequent analyses. For example, certain data mining techniques may require categorical variables to be transformed into binary variables. What is this step?

data preparation

If the value of the response variable is uniquely determined by the values of the predictor variables, we say that the relationship between the variables is

deterministic

The key distinction between supervised and unsupervised techniques is that, in supervised data mining

effective for developing predictive models

In-sample and out-of-sample criteria are based on the forecast ''__________"

error

The formula for calculating the Matching coefficient is (the Number of variables with matching outcomes)/(Total number of variables).The '______' the value of the matching coefficient, the more similar the two observations are.

higher

Another commonly used transformation that captures nonlinearities is based on the natural logarithm. Which of the following variables are commonly log-transformed? Select all that apply!

income, house prices

What is the term in regression models when a predictor variable has a different partial effect on the outcome depending on the values of another predictor variable?

interaction effect

Which of the following is true of Association rule analysis?

is an unsupervised data mining technique

The premise of memory-based reasoning is that we tend to guide our decisions using the memories of similar situations we experienced in the past. Which data mining technique belongs to the memory-based reasoning category of data mining?

k-nearest neighbors

Definition: an attempt to find an optimal way to achieve a business objective under constraints, such as limited capacities, financial resources, and competing priorities. What is this a definition of?

optimization

When an estimated model begins to describe the quirks of the data rather than the real relationships between variables this is called

overfitting

___ occurs when a predictive model is made overly complex to fit the quirks of given sample data. By making the model conform too closely to the sample data, its predictive power is compromised

overfitting

The ___ technique involves intentionally selecting more samples from one class than from the other class or classes in order to adjust the class distribution of a data set

oversampling

Data ___ is the process of dividing a data set into a training, validation and an optional test set

partitioning

The natural logarithm converts changes in a variable into ___ changes

percentage

Which distribution is particularly useful for modeling when we are interested in finding the number of occurrences (successes) of a certain event over a given interval of time or space?

poisson

In AGNES, each observation in the data initially forms its own cluster.The algorithm then successively merges these clusters into larger clusters based on their similarity until all observations are merged into one final cluster, referred to as a '______'

root

What is the process for classifying or predicting the value of the target variable of a new observation for given values of predictor variables called

scoring a record

. In the log-log regression model, both the response variable and the predictor variable are transformed into natural logs. We can write this model ln(y)=β0+β1ln(x)+ε. For 0 < β1 < 1, the log-log regression model implies a positive relationship between x and E(y); as x increases, E(y) increases at a '_______' rate.

slower

The z-score measures the distance of a given observation from the sample mean in terms of standard deviations.The z-score is an example of making observations unit free. This is an example of '________' data.

standardizing

This difference in scale distorts the true distance between observations and can lead to inaccurate results. It is common, therefore, to make the observation unit free. How is this accomplished

standardizing, normalizing

Common applications of supervised data mining include classification and prediction models. What is true of a classification model?

the objective is to predict the class memberships of new cases The target variable is categorical


Related study sets

Chapitre Préliminaire: Compétence 2

View Set

ACAAI board review: cells in immune system

View Set

ATI Mental Health Ch. 21- Medications for Anxiety and Trauma and Stressor Related Disorders

View Set