CIS 9660 Midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

Types of Logistic Regression

1. Basic Form: Binary, The dependent variable is dummy (0 or 1). EX: die or not die, buy or not buy 2. Extended Form: Multinomial, the dependent variable has categorical output. EX: car, motorcycle, or bike, Ordinal: the dependent variable has ordered categorical output EX: a student gets an A, B or lower

The data mining process

1. business understanding 2. data understanding 3. data preparation 4. model building 5. testing and evaluation 6. deployment

Which of the following is not correct about outliers: a. Outlier is an observation that lies outside the overall pattern b. Least squares regression is not resistant to the presence of outliers c. An outlier can pull the fit of the line toward it d. We should detect and remove outliers before estimating the parameters

d. We should detect and remove outliers before estimating the parameters

Linear regressions assume linearity of the relationship between dependent and independent variables. Which of the following is not true about this assumption: a. In most cases, a linear relationship requires the dependent variable to be continuous b. A simple method to check this assumption is to examine a scatter plot between the predictors and the target c. If the dependent variable is a dummy, using linear regressions will be mathematically wrong d. none

d. none

Which of the following is not true about the Lack of multicollinearity assumption of linear regressions: a. When this assumption is violated, the parameters will be non-identifiable b. This assumption will be violated when having two or more control variables perfectly correlated with your independent variable of interest c. This assumption will be violated if there is too little data available compared to the number of parameters to be estimated d. none

d. none

Model specification errors

mistakes or shortcomings in the specification of the functional form, variables, or parameters of a statistical model, which can lead to biased or inefficient estimates of the model parameters and inaccurate or unreliable predictions.

Multicollinearity

occurs when two or more predictor variables in a regression model are highly correlated, which can cause instability in the model coefficients and make it difficult to determine the individual effects of each variable on the outcome.

reverse causality

the situation in which the apparent "cause" is actually the "effect". Y causes X

Linear regression model

y=b0+b1x+e

R squared

A measure of goodness of fit of the estimated regression

What is Data Mining?

A process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful patterns from large sets of data. These patterns can be in the form of business rules, affinities, correlations, trends, or prediction models

Which one of the following is correct: a. Linear regressions estimate coefficient with MLE, while logit models use OLS b. Logit models can fit data better than linear regression c. The dependent variable for logit models can only be a dummy. d. None of the above is correct

d. None of the above is correct

Why do we use Logistical Regression?

Although binary dependent variable models can be estimated by OLS, OLS is not the preferred method of estimation for such models because of two limitations: 1)The estimated probabilities from LPM do not necessarily lie in the bounds of 0 and 1 2)Linear probability models assumes that the probability of a positive response increases linearly with the level of the explanatory variable, which is counterintuitive.

Modeling

Apply data mining techniques to the data. No universally best methods or algorithms for data mining tasks.

Evaluation

Assess the data mining results and evaluate if they are valid and reliable before moving on.

Supervised data mining

By actual outcome. Operates under supervision by being provided with the actual outcome for each of the training. You use the information provided to you by events in the past. EX:find groups of customers who are more likely to cancel their orders

Data Preparation

Collect and clean data. Manipulate and convert data into forms that yield better results

How do we select variables?

Decisions about the regressors must weigh issues of omitted variable bias, data availability, data quality, and, most importantly, economic theory and the nature of the substantive questions being addressed.

Logistical Regression

Designed for binary dependent variables

Associate Rule Mining

Find out which events or items go together. EX: those who bought this, also bought this.

Disadvantages of R squared

It is an increasing function of number of regressors, meaning, if you add a variable to the model, the R squared value increases. To avoid this, we use adjusted R squared

What is linear regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the explanatory variable and Y as the dependent variable.

When is the key term for logistical regression when determining probability?

Log odds

Logistical Regression Model example

Log(P(clickonad)/(1-P(clickonad))=B0+B1Male

Benefits of using logistical regression to estimate class probability

Logistic function is mathematically correct. Linear regression can only give approximate of the truth. Predicted value of y ranges from 0 to 1

Unsupervised data mining

No actual outcome. Algorithms are left to their own devises to discover and present interesting patterns in the data. EX: Categorize customers into different groups based on similarity.

Correlation does not always imply causation because...

Omitted Variable Bias. Reverse Causality

How do we find the best fit line?

Ordinary Least Squares (OLS)

Major problems of using linear regression to estimate class probability

Predicted value of y ranges from negative infinity to infinity, but a probability should range from 0 to 1.

Why do we use data mining?

Raw data is essentially worthless. We need techniques to automatically extract information from it. Information refers to patterns in the data.

Problems of data mining

Real data is imperfect. Many patterns will be unimportant. Anything discovered will be inexact. Algorithms need to be robust enough to cope with imperfect data.

Regression

Statistical method that allows you to examine the relationship between two or more variables of interest.

Deployment

The results are put into real use in order to realize return on investment. Often return to business understanding phase to improve solution. Results are not always deployed.

Omitted Variable Bias

There is a variable that is not among the explanatory or dependent variables (X and Y) in a study, and yet may cause both the change in the dependent and independent variables

Business Understanding

Understand business context. Cast a business problem as one or more data science problem.

Data Understanding

Understand the strengths and limitations of the data.

Clustering

Used to group similar items together. Based on similarities between objects, or alternatively distance between objects. EX: detect insurance or credit card fraud, optimize good delivery by finding the optimal number of launch locations.

How do we know which variables are statistically significant?

We look at the p value. If the p value is less than .1, we say it is significant

Decision tree

a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility

Autocorrelation

a statistical concept that occurs when the values of a variable at different points in time or space are correlated with each other, which can violate the assumption of independence and cause problems with statistical analysis.

Heteroscedasticity

a term used in regression analysis to describe a situation where the variance of the errors is not constant across all values of the predictor variable, which violates one of the assumptions of regression analysis and can lead to biased or inefficient estimates of the coefficients and their significance.

If the odds of one event increase, it means that: a. This event has a higher probability to happen b. This event has a lower probability to happen c. The probability could be either higher or lower d. The change in probability depends on the value of the independent variables

a. This event has a higher probability to happen

Which of the following about data mining is correct? a. When making business decisions, we should replace intuition with data mining b. Data is a complement to past experience c. The first step of data mining is data preparation d. The best way to evaluate your model is to test it in deployment

b. Data is a complement to past experience

Which of the following is not true about regressions: a. Regression analysis is a statistical method that allows you to examine the relationship between variables b. Explanatory variables are termed as the dependent variables and the variables to be explained are termed as the independent variables c. The econometric problem is to estimate this slope—that is, to estimate the effect on Y of a unit change in X—using a sample of data on these two variables d. B and C

b. Explanatory variables are termed as the dependent variables and the variables to be explained are termed as the independent variables

Which of the following is the least likely to be a potential implication of regressions: a. Discover distinct customer groups who are most likely to churn b. Finding groups of similar firms based on profitability, growth rate, market size, products, etc. c. Predict which investment product would be popular among customers in a specific year d. Detect whether a person fell in a video

b. Finding groups of similar firms based on profitability, growth rate, market size, products, etc.

Which of the following is true about how to interpret the R2 and the adjusted R2 in practice: a. A high R2 or an adjusted R2 means that the regressors are a true cause of the dependent variable b. A high R2 or an adjusted R2 means that you have the most appropriate set of regressors c. An R2 or an adjusted R2 near 1 means that the regressors are good at predicting the values of the dependent variable in the sample d. None

c. An R2 or an adjusted R2 near 1 means that the regressors are good at predicting the values of the dependent variable in the sample

For which problem would data mining analysis be least appropriate: a. Predicting whether a customer will subscribe Netflix next year b. Predicting a patient's possible future symptoms given the patient's past history of symptoms c. Identifying the best performing salesperson d. Identifying groups of houses based on their house style, value, and location

c. Identifying the best performing salesperson

Which of the following is the last thing to be weighed when selecting variables: a. Omitted variable bias b. Data availability and data quality c. R2 or an adjusted R2 d. Personal experience

c. R2 or an adjusted R2

Why correlation does not always imply causation: a. There could be a variable that is not among the explanatory or response variables in a study, and yet may cause both the change in the depended and independent variables b. The dependent and independent variables are confounded when their effects on a response variable cannot be distinguished from each other c. Models may not fit the data well d. All Except c

d. All Except c

What is the major problem with using linear model to estimate class probability: a. It is only mathematically correct b. Violates at least one assumption of linear regressions c. The model will not give any results d. Both a and b

d. Both a and b

What could be the next step of model evaluation in the data mining process? a. Finalizing models b. Deployment c. Business understanding d. Business understanding or deployment

d. Business understanding or deployment

Which of the following is the least likely to be supervised data mining tasks: a. Predict how Uber entry may influence city traffic b. Predicting whether a customer will return a product c. Find groups of customers who are more likely to cancel their orders d. Identifying groups of similar houses based on their house style, value, and location

d. Identifying groups of similar houses based on their house style, value, and location


Conjuntos de estudio relacionados

Medical Terminology Chapter 4 - Musculoskeletal System - MEDLIN

View Set

Biology - Ch. 6 - Inquizitive EC

View Set

SPA2 (Comparisons/Equality/Inequality[Test #3])

View Set

Physiology Ch. 11 (Blood Clotting)

View Set

Chapter 6 Learnsmart b*tch are you dumb or are you dumb

View Set

Network+ Domain 1: Network Architecture

View Set

Life Insurance Comprehensive Exam

View Set