Badm Test 1, 2, 3

Ace your homework & exams now with Quizwiz!

8. What is a prediction problem?

"Predicting future number" EX: Estimate a home's market value

Partition the data training/validation for prediction, we want 30% of the data in the validation set. Set random state as 5.

## Code for data partition"" train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=.3, random_state=5) test_size = size of test sample, in this case 30%

If we have p predictors in the data, in exhaustive search we need to evaluate __________ possible models.

(2^P)-1 10 features = (2^10)-1 = 1023

In explanatory modeling goodness of fit can be assessed using ________.

(R^2), Performance measures (goodness of fit) The higher, the better the model in explaining the outcome variable (R^2) is the proportion of the variance in the outcome variable that is predictable from the predictors, i.e. How well does your model/predictors explain the outcome variable. (R^2) =1 implies your model perfectly fits the data 0 < (R^2) > 1 Anything in between means that your model explains the data partially but not perfectly, Why? - Missing variables in the data (We may need to collect more features) - There is a non-linear relationship between the predictors and outcome (by running MLS we assume implicitly a linear relationship) (R^2) = 0 implies your model does not explain the data any better than a naïve predictor

In a dataset, rows and columns correspond to ________ and _________ respectively.

(X,Y), Records and variables X to rows (records, cases) Y to columns (features, predictors, variables)

18. What are the two ways of normalizing data?

- "Standardizing": subtract the mean from each value and divide by standard deviation - rescale each variable, subtract min and divide by max-min for each variable Brings data to [0,1] scale

performance evaluation of a confusion matrix

1. 0 1 TP FN 0 FP TN

.Two measures that can be used to assess predictive performance are ___________ .

1. Mean Error 2. Error Percentiles

Four assumptions in linear regression are _____________ .

1. The noise (error) follows a normal distribution, Mean Error = 0 2. There is a linear relationship between the predictors and the outcome variable 3. the records are independent of each other 4. The variance of the error does not depend on the values of the predictors Homoskedasticity

20. Gertrude from the Economics section has requested that you create predictive model to estimate which of four levels of income a potential customer will have. She has provided you with 11 variables, so you will need _________ samples to achieve acceptable accuracy.

110 "10 variables for every 1 predictor variable" Prediction problem

14. If you have 20 predictors and 2 classes, then you'll need a minimum of __________________ cases.

6 X M X P M = classes P = # of variables 240 (Classification Problem)

Suppose we want to investigate whether advertising on TV, radio, or newspaper have any effect on sales. We run a linear regression model and obtain the results below NULL HYPOTHESIS: Bi=0 - there is no association between variable i and outcome. Alternative HYPOTHESIS: Bi ≠0 - there is significant association between variable i and outcome. What does this data tell us? (Co). (ST) (T). (P-value) Intercept 2.9 . .31 . 9.42 . <.0001 TV .04 . .0014 32. 81 . <.0001 News -.01 . .005 -.18 . 8599

Because P<.05 implies rejection of null hypothesis.. - there is a significant positive association between advertising and Intercept and tv -there is no significant association between advertising on news on sales

Types of classification?

Binary: outcome variable has two distinct classes Multiclass: outcome variable has more than TWO distinct classes

19. The Finance group at Acme Corporation has gathered every possible economic indicator, both regionally, nationally, and globally, in order to predict where to execute financial engineering for maximum returns. There are 29,788 variables in the data, but there are only 2,649 records. Before you begin the process of model building, you must first engage in _____________.

Dimension Reduction (reduce variables)

Actual outcome (y) = 2 Predicted outcome (y^) = 3 What is Error?

Error for record i: Ei = Yi - Y^i 2-3 = -1 error = -1

How to determine Decision Threshold?

Everything at the threshold and below Method 1: A popular choice is to use t=0.5 as the decision threshold Method 2: Try all the possible thresholds and choose the one maximizing accuracy

Among the methods of feature selection and feature extraction, which is supervised/unsupervised learning?

Feature Extraction (Unsupervised) (PCA) Feature Selection (Supervised): Exhaustive Search: Create different models using all possible subsets of features - Evaluate predictive performance of model using RMSE Forward Selection: Step 1: Start with 0 features Step 2: Add the most important feature (can be measured most decrease in RMSE). Repeat. Step 3: You can stop when decrease in error (RMSE) is marginal Using error measures RMSE and MEA, the lower the error the better the performance Backward Elimination: Step 1: Start with all features Step 2: Eliminate the least important feature Step 3: Stop when error (RMSE) increases significantly

13. What are the main types of categorical variables and how can we code them?

Nominal (Unordered): coded in 0 and 1s Ordinal (Ordered): in order and coded in selected numbers EX: January = 1, september = 9

Two types of categorical variables are ____________ and ____________.

Nominal and Ordinal Nominal (Unordered) Ex: Male, Female Convert to dummy variables (0/1) Ordinal (Ordered) Ex: Low, high Convert to integers, preserving rank

Odds of belonging to positive class, coin flip, 8H, 2T

Odds = p / (p-1) Probability of heads = 8/10 = .8 Probability of Tails = 2/10 = .2 Odds = .8 / (1-.8) = 4

What is Unsupervised learning? and what types are there?

segment data into meaningful segments and detect patterns, WITHOUT GIVEN an outcome variable Unsupervised: Association rules (Amazon preferences, not personalized), Collaborative filtering (based on previous choices, Netflix), exploration, visualization

16. One way of dealing with missing values is by __________________ .

simple: put mean variable across all records advanced: use imputation

Area Under ROC Curve (AUC)

● Another (threshold free) measure of model accuracy: AUC ○ AUC tells us how much the model is capable of distinguishing between the two classes. ○ Better performance by curves closer to the topleft corner (The higher the AUC the better performance) ● AUC=1 means perfect prediction

Naive Benchmark

● The naïve rule for classification is to classify each record as a member of the majority class. ● Relies solely on the y information and excludes any additional predictor information.

A key shortcoming of using Mean Error to gauge the predictive accuracy of a model is ___________.

That Mean Error only tells us whether the predictor is biased, not its assessment performance Therefore, we do not use it to assess performance, only to check whether there is systematic bias in predictions Mean Error=0 The model is unbiased but not perfect

Display validation set distribution

plt.tight_layout() plt.show()

If for Variable i, Beta_i > 0, then what is the association between variable i and the outcome variable?

positive

Suppose you have no predictors and asked to give an estimate of the outcome variable. What would be your predictor?

Naïve Predictor: The mean of the all values in the outcome variable

Calculate the residuals on the validation set

wine_lm_pred = wine_lm.predict(valid_X) all_residuals = valid_y - wine_lm_pred plt.hist(all_residuals,bins=25,color='blue')

Compare the performance of naive benchmark and model prediction on the validation set

y_naive_valid=[valid_y.mean()]*len(valid_y) print("Performane of naive model") regressionSummary(valid_y, y_naive_valid) print("\n Performane of MLR") regressionSummary(valid_y, wine_lm.predict(valid_X))

10. What are main types of supervised and unsupervised learning?

Supervised: prediction (regression) and classification Unsupervised: Association rules (Amazon preferences, not personalized), Collaborative filtering (based on previous choices, Netflix)

15. Identification of outliers is followed by __________________ .

- if the number of outliers is small, treat it as missing data (imputation) - if big you might want to remove variables containing many outliers

5. A predictor may be best defined as a(n) __________________.

A variable, usually denoted by X, used as the input in a predictive model other names: feature, variable

Logistic Regression

Assume logit(p) is a linear function of the predictors ● Similar to MLR, estimate beta coefficients that minimize misclassification error ● We can use estimated beta coefficients to make new predictions and estimate the probability that a new sample belongs to the positive class using the following formula.

Recursive Partitioning with Gini Index

At each successive stage, calculate the Gini Index all possible splits in all variables • Choose the split with the smallest Gini Index (i.e. the one that reduces impurity the most) • Chosen split points become nodes on the tree

Fill in missing values using the mean of the remaining values

Command = .fillna(wine_df.mean()) wine_df=wine_df.fillna(wine_df.mean())

Load the data located at ../resource/lib/publicdata/winequality-red.csv and name it wine_df

Command to Load data = pd.read_csv("Location of file") wine_df=pd.read_csv('../resource/lib/publicdata/winequality-red.csv',sep=";")

The key difference between feature elimination and feature extraction algorithms is _________.

Instead of removing features (feature elimination), we use weighted average of features to create new features (feature extraction)

Logistic Regression- Interpreting the Model

Interpretation of coefficients in logistic regression is the same as linear linear regression ● If B' > 0 for a variable i , then it is positively associated with outcome (UPWARD SLOPE) The higher the value of i, the more likely the record belongs to the positive class ● If B' < 0 for a variable i , then it is negatively associated with outcome (DOWNWARD SLOPE) ○ The higher the value of i, the less likely the record belongs to the positive class

equations for performance evaluation of a confusion matrix

N = # of observations = TP + FP +FN + TN Accuracy = (TN + TP)/N Error = (FP +FN)/N = 1 - accuracy Sensitivity = TPR = TP/AP =TP/( TP + FN) Specificity = TNR = TN/AN = TN/( TN + FP) False Negative Rate = FN/( TP + FN) = 1- Sensitivity False Positive Rate = FP/( TN + FP) = 1- Specificity

21. What are the three types of basic plots?

Line graphs: Can be used to plot time series, shows how a variable changes overtime Bar charts: Can be used to compare single statistics across groups Scatter Plots: show the relationship between two numerical variables

Odds vs Log odds

Log odds = log (p/(p-1)) Odds: 0 to infinity Logit: − ∞ to + ∞ There is a one-one relationship between p and the logit of p. ● Estimating logit(p) is equivalent to estimating p

7. What is a classification problem?

Must have data Available, and its where the outcome value is known example: Will a client subscribe (Y/N) to a term deposit (Outcome Variable)

Complexity of decision trees

One reason for overfitting: Having more complex models increases the chances of overfitting • The complexity of decision tree increases with the number of decision nodes (i.e. # of splits) Graph: left side (underfitting) middle (bow of curve) sweet spot right side (Overfitting) • To avoid overfitting we can limit the growth of the tree (A fully fit tree will usually produce perfect prediction in the training set but low performance in the validation)

___________ is a method for feature extraction.

PCA

Decision Trees: Recurvise Partitioning

Pick one of the predictor variables, say Income in our case • Pick a value of Income, say "S", that divides the training data into two portions (nodes) • Measure how "pure" or homogeneous each of the resulting (nodes) is "Pure" = containing records of mostly one class • Try for all predictors , and all possible splits of "S", to maximize purity • After you get a "maximum purity" split, repeat the process for a second split (on any variable) and then a third split etc. until a stopping condition is met.

Prevalence of Majority class when four samples = 0, and two samples = 1?

Pm = # of records in majority class / # of total records 4/6 = majority negative

What is predictive modeling and explanatory modeling? What are the differences?

Predictive: The purpose is to predict unseen data accurately - Measured by predictive accuracy mainly Mean Error and RMSE (MAE, MAPE, also used) -overfitting is problem, data split into training and validation sets - Training data is used to fit the model - Performance is assessed using validation set -focused on predictions in MLR (Y) Explanatory: We are interested in predicting averages and understand the association between the outcome variable and other variables - Performance measured by Goodness of fit (R^2) - focused on coefficients (B) - Entire dataset used, maximize info

How does PCA work?

Principal component analysis (PCA) uses the weighted average of features to create new features, with minimal information lost (minimal loss of variance) - is the weighted average of two variables - Dimension Reduction "PCA can be thought as looking at the same data from a different angle!"

Decision trees pros vs cons

Pros: • Easily understandable classification rules • Flexible can be used both for classification and regression • Is non-linear (as opposed to MLR or Logistic Regression) so does not suffer from multicollinearity of predictors (we can use drop_first=False in dummy_conversion) Cons: • Instable (when you a slightly different data set for training) the resulting decision tree could be significantly different • A fully fit complex tree will lead to overfitting

random forest algorithm

Random forest is an ensemble learning algorithm • Instead of making predictions using a single tree, we fit many decision trees (using different depth and/or partitions of the data) and combine the predictions from those trees. • Advantage: More robust performance less chance of overfitting (The wisdom of the crowds in data analytics) • Disadvantage: Difficult to Interpret many trees simultaneously

Multicollinearity occurs when ___________.

predictors in a regression model are correlated When predictors are more correlated, it becomes difficult for the model to estimate the true relationship between each predictor variable and the response variable independently because the predictors tend to change in unison

6. What are various terms used for observation?

Records, Case, Sample Ex: House 1, House 2, House 3

Which method(s) can be used for feature elimination?

Regularization through LASSO Trying to find a trade-off between prediction error and the number of variables Regularization: adds another term to the cost function, which you can think as the cost having extra features in the model LASSO: optimization problem is called (Least Absolute Shrinkage and Selection Operator) In LASSO we define a penalty term that shrinks some regression coefficients to zero COST = CE + CV If few features, CE is high but CV is low if lot of features, CE is low but CV is high

Decision Trees

Series of decision rules (Yes/No questions) to divide data into subgroups represented in a tree structure • Easy to understand and interpret • Flexible: Works for classification and regression problems

Two ways of normalizing the data are ____________ .

Standardizing and Rescaling - "Standardizing": subtract the mean from each value and divide by standard deviation - rescale each variable, subtract min and divide by max-min for each variable Brings data to [0,1] scale

9. What is supervised and unsupervised learning?

Supervised Learning: to predict an outcome variable, GIVEN an outcome variable Unsupervised learning: finds association and patterns, WITHOUT GIVEN an outcome variable

How do we interpret variance results of PCA?

The higher the variance in data, the more information it contains, the better the model (assuming the data is not very noisy)

The dimensionality of the data is ___________________ .

The number of features (predictors, variables) in the data

print the intercept and beta coefficients of the model

print('intercept ', wine_lm.intercept_) print(pd.DataFrame({'Predictor': X.columns, 'coefficient': wine_lm.coef_}))

What is Supervised Learning? and what are the two types?

There exists an outcome variable of interest that we want to predict Regression and Classification Regression - numerical outcome - EX: Predict the amount spent on a fraudulent transaction Classification - categorical outcome - EX: "Probability" of a class membership = predict the probability that a transaction is fraudulent - EX: Predicted class membership = predict whether or not transaction is fraudulent

What is linear regression?

Used for making predictions with predictive and explanatory modeling Y = B0 + B1X1+B2X2... BpXp+e Y = outcome B0 = intercept (Explained part of the data) (B1...Bp) = Beta Coefficients (X1...Xp) = predictors (Unexplained part of the data) e = error (noise) residuals caution: avoid collinearity

3. The four Vs of big data are __________________.

Volume: amount of data Velocity: speed/flowrate of data Variety: types of data Veracity: Generated by organics "us"

ROC (Receiver Operating Characteristics) Curve

Want to choose the threshold that gives best trade-off between sensitivity and specificity ● Sensitivity on y-axis; 1-specificity (false positive rate) on x-axis ● Captures all cutoffs (t) simultaneously ● Trade-off between FN and FP: ○ If t is large, false-positive ↓ and false-negative ↑ ○ If t is small, false-positive ↑ and false-negative ↓

How to mathematically quantify Purity with Gini index?

We use Gini Index to quantify the purity/homogeneity of a split • Takes values in [0,0.5] • The lower the Gini Index, the purer/more homogenous the split • I(A) = 0 when all records belong to same class • I(A) = 0.5 when all classes are equally represented in the split

Regression Trees

What you need to know is that the ideas of classification trees can be generalized; hence, decision trees can be used for regression problems with minor modifications: 1. The definition of terminal nodes 2. Calculation of Impurity measures 3. Performance evaluation (RMSE, MAE)

convert categorical variables into dummy variables

X = pd.get_dummies(X, drop_first=True)

Define predictors and the outcome variable. We use "quality" as the outcome variable and all the other variables as predictors

X = wine_df.drop(columns=['quality']) y = wine_df['quality'] Because "quality" is outcome variable to avoid collinearity, we drop it

Print the name of predictors

X.columns

In Lasso, by increasing the penalty parameter lambda, we are shrinking _____________.

more beta coefficients towards zero features in the model (feature selection) LASSO penalizing regression coefficients and shrink them to zero by increasing penalty

2. Machine learning can be defined as __________________.

algorithms that learn directly from the data Ex. Decision Trees, Linear Regression

Propensities

assign classes, Most algorithms do this by first estimating the probability that a record belong to each of the classes

We use _________ to visualize the entire distribution of a variable

boxplots and histograms

We need to dummy code _____________ .

categorical variables to numerical variables (variables) - 1 = Dummy variables (MLR) - always drop one variable!

A numerical variable can be defined as ___________.

continuous or integer continuous = (1.2, 2.4, 5.6) integer = (1, 3, 5, 7)

11. The performance of data mining algorithms can be improved by limiting variables and by__________________.

data partitioning

Mean error gives us information ____________ .

on whether the predictor is biased Error = -1 (overestimated by 1) Error = 1 (underestimated by 1) Error = 0 (on average) I.E. whether the predictions are on average, over, or under-predicting the outcome variable

Beta coefficients represent ______________ .

estimated coefficients that minimize error (cost) Beta Coefficients = B in MLR In business problems making prediction error is costly so we can think error as cost, denote cost from prediction error as CE CE = ERROR = RMSE2 β ̂ = minβi (RMSE(β)) 2 = minβi CE

What are the subset selections?

exhaustive search, forward selection, backward elimination

Decision tree Scructure

first node = root node branches = connecting lines terminal nodes = last nodes without branches

PCA is more effective _______________ .

for use with numerical variables

12. When is oversampling rare events useful?

identifying outliers (fraudulent behavior)

1. The term "business analytics" may be best defined as _____________________.

is the practice of bringing quantitative data to bear on choices

The higher the error , the ______ the model.

less accurate

If we have p features in the data, we will have __________ PCs

p PCS 5 features = 5 PCs

The key limitation of exhaustive search is _________.

that computation is slow and costly In both exhaustive search and iterative search methods we build multiple models to find the best set of features, slow and computationally expensive

In PCA, attribute weights quantify ___________

the contribution of variable i to the first principal component.

Naive predictor in regression is?

the mean of the outcome variable in the validation set

What is the P-Value and when do we reject the null hypothesis?

the probability of obtaining the observation assuming null hypothesis if P-Value < .05, we reject the null Null hypothesis: coin flip is fair Alternative Hypothesis: coin flip isn't fair

Scatter plots represent ___________ .

the relationship between two numerical variables

The main reason for doing data partitioning is _____________ .

to avoid overfitting (Partitions) Training Partition: used to develop the model Validation Partition: assess predictive performance of each model Test partition: assess performance of chosen data. Useful if enough data, testing lots of models

17. Why do we use data partitions in data mining?

to avoid overfitting Training Partition: used to develop the model Validation Partition: assess predictive performance of each model Test partition: assess performance of chosen data. Useful if enough data, testing lots of models

Output the number of rows and columns of prediction matrix for the training data

train_X.shape Output: (X,Y) X is rows Y is columns

We estimate the beta coefficients of a MLR model using ________ data.

training Estimated beta coefficients are the ones that minimize the prediction error of the training set

If for Variable i, P_i < .05, then what is the association between variable i and the outcome variable?

variable i is significantly associated with the outcome variable - this is because P value is less then .05, rejecting the null, proving there is an association

4. A categorical variable may be defined as a __________________.

variable that takes on one of several fixed values (high, medium, low)

Output descriptive statistics of the data

wine_df.describe()

Print bottom 5 rows of wine_df

wine_df.tail(5)

Fit a linear regression model

wine_lm = LinearRegression() wine_lm.fit(train_X, train_y)


Related study sets

New Study Guide Midterm English 1 C

View Set

As a gas is compressed in a cylinder

View Set

7.24.T - Lesson: Emotions with Estar

View Set

AP Chemistry: Unit 3 College Board Questions Izabel Edition

View Set

Cutaneous Fungal Infections Athlete's Foot

View Set