JANG
How do we run the "Classification Trees" classification technique?
*Graphical representation of set of rules for classification of observations into 2+ groups*Pruning & requiring min amount of observations from terminal mode helps with prevention of overfitting
How do we run the "Neural Networks" classification technique?
*Having multiple layers of estimations, that way, we can capture the complicated pattern*Mimic the operation of the human brain by receiving stimuli than processing it with interconnected sets of neurons & determining a response
How do we run the "k Nearest Neighbors" classification technique?
*Look at the nearest k data points and then figure out which group majority of them belong to and assign the new point to that group*Let's find the k observations in a group that is near the new observation*Euclidean distance
Based on the confusion matrix of training dataset analysis in DA, what is the model accuracy of positive predictions? (pic in pdf)
.813
Which of the following information is redundant when finding a solution to this problem?
0.5𝐷 + 1𝑊 ≤ 60
You throw a coin 10 times and get {THHTTTHTTT}. What is the "best guess" for probability of getting a tail, using the data?
0.7
What are the 5 types of classification techniques?
1. Discriminant Analysis 2. Logistic Regression 3. K Nearest Neighbor 4. Classification Trees 5. Neural Networks
What are the 7 steps in the data mining process?
1. Identify opportunity 2. Collect data 3. Explore, understand, and prepare data 4. Identify task and tools 5. Partition data 6. Build and evaluate models 7. Deploy models
The following linear programming problem has been written to plan the production of two products. The company wants to maximize its profits. 𝑋1 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 1 𝑝𝑟𝑜𝑑𝑢𝑐𝑒𝑑 𝑖𝑛 𝑒𝑎𝑐ℎ 𝑏𝑎𝑡𝑐ℎ 𝑋2 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑡 2 𝑝𝑟𝑜𝑑𝑢𝑐𝑒𝑑 𝑖𝑛 𝑒𝑎𝑐ℎ 𝑏𝑎𝑡𝑐ℎ 𝑀𝐴𝑋: 150𝑋1 + 250𝑋2 𝑆. 𝑇. 2𝑋1 + 5𝑋2 ≤ 200 − 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒 13𝑋1 + 7𝑋2 ≤ 175 − 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒2 𝑋1, 𝑋2 ≥ 0
3
SA] Here is a poll result from Pew Research about people's usage of online dating app. What isthe standard deviation of the sample? Write the answer in percentage point (Say we use z-scoreof 2 for 95% confidence interval) Today, 79% of online dating users ages 18 to 29 report ever using Tinder.Data in this report is drawn from the panel wave conducted from July 5 to 17th, 2022. A totalof 2,500 panelists responded out of 4,900 who were sampled. The margin of error at the 95%confidence interval is plus or minus 2 percentage points
50%
Understand the predicted values for the new dataset
Apply the different techniques to the new dataset and have them predict the outcome for each trial. Have to take into account the validity of each technique when determining whether to trust it's output.
How do we run the "Discriminant Analysis" classification technique?
Assigns each variable based off distance from centroid (distance can be calculated using 2 different methods)
What is the step we need to do right after collecting the dataset?
Explore the data
The following diagram shows the constraints for a LP model. Assume the point (0,0) satisfies constraint (B,J) but does not satisfy constraints (D,H) or (C,I). Which set of points on this diagram defines the feasible
I, F, G, J
The following questions (Q19-Q20) use the information below Sanderson Manufacturing produces ornate, decorative wood frame doors and windows. Each item produced goes through three manufacturing processes: cutting, sanding, and finishing. Each door produced requires 1 hour in cutting, 30 minutes in sanding, and 30 minutes in finishing. Each window requires 30 minutes in cutting, 45 minutes in sanding, and 1 hour in finishing. In the coming week Sanderson has 40 hours of cutting capacity available, 40 hours of sanding capacity, and 60 hours of finishing capacity. Assume all doors produced can be sold for a profit of $500 and all windows can be sold for a profit of $400. Let D represent Number of doors produced and W represent Number of windows produced.
IV. Objective function 500𝐷 + 400𝑊 ≤ 1,000
The following linear programming problem has been written to plan the production of three products. 𝑋1, 𝑋2, 𝑋3 = the number of product 1, 2, 3 produced in each batch MAX: 20𝑋1 + 30𝑋2 + 60𝑋3 S.T. 2𝑋1 + 5𝑋2 ≤ 100 --> resource 1 10𝑋2 + 12𝑋3 ≤ 160 --> resource 2 𝑋1 − 3𝑋3 ≤ 15 --> resource 3 As a data analyst, you write the objective function and the constraints. Which of the following statement is incorrect?
The constraint regarding resource 2 is redundant.
Blue Ridge Hot Tubs manufactures and sells two models of hot tubs: the Aqua-Spa and the Hydro-Lux. Howie Jones, the owner and manager of the company, needs to decide how many of each type of hot tub to produce during his next production cycle. MAX 350𝑋1 + 300𝑋2 S.T. 𝑋1 + 𝑋2 ≤ 200 → pumps 9𝑋1 + 6𝑋2 ≤ 1,520 → labor 12𝑋1 + 16𝑋2 ≤ 2,650 → tubing 𝑋1, 𝑋2 ≥ 0 Suppose Howie Jones has to purchase a single piece of equipment for $1,000 in order to produce any Aqua-Spas or Hydro-Luxes. You use 𝑌1, 𝑌2 as the production of each product, and 𝑀1 and 𝑀2 to denote the upper bound for each product line. How will the fixed cost affect the formulation of the model of his decision problem? Which of the following problem formation is incorrect?
The new set of decision variables is 𝑋1 and 𝑋2.
When the objective function can increase without ever contacting a constraint the LP, we callsuch solution:
Unbounded solution
How do we compare different techniques?
We compare different techniques (models) looking at the area under the curve (AUC)
What is incorrect about hypothesis testing?
We fail to reject the null when the p-value is smaller than the significance level (𝛼)
Agri-Pro is a company that sells agricultural products to farmers in a number of states. They offer customfeed mixing, in which a farmer can order a specific amount of livestock feed and specify the amount ofcorn, grain, and minerals the feed should contains. Agri-Pro has received an order for 10,000 pounds ofchicken feed to be mixed from the following feeds. The order must contain at least 15% corn, 15% grain,and 20% minerals. Which of the following statement is incorrect? (4)
We set F6 ≤ G6 as a constraint
What are the three common elements of an optimization problem?
decisions, constraints, an objective.
What happens in the "Partition data" step of data mining?
divide the data and test the model (algorithm) only on one subset and test on other subsets
How do we interpret the coefficient?
how close the regression function's coefficients are to the actual, unknown coefficients.
In using neural networks, an analyst must decide __________ and ___________
how many hidden layers to use and how many nodes to use in each of the hidden layers
The first step in creating a classification tree involves
recursively partitioning the independent variables using the outcome of previous partitions.
A production optimization problem has 4 decision variables and resource 1 limits how many of the 4 products can be produced. Which of the following constraints reflects this fact?
𝑓(𝑋1, 𝑋2, 𝑋3, 𝑋4) ≤ 𝑏1
You throw a fair coin N number of times and let X be the number of heads. The probability distribution for getting x number of heads should follow:
Binomial Distribution
EXAM 2 Which of the following represents a regression model?
𝑌 = 𝑓(𝑋1, 𝑋2, ... , 𝑋𝑘) + 𝜀
What does the Excel "=SUMPRODUCT(A1:A5,C6:C10)" function do?
Multiplies each pair of cells in two arrays matched by position and sum the products
Blue Ridge Hot Tubs manufactures and sells two models of hot tubs: the Aqua-Spa and the Hydro-Lux. Howie Jones, the owner and manager of the company, needs to decide how many of each type of hot tub to produce during his next production cycle. MAX 350𝑋1 + 300𝑋2 S.T. 𝑋1 + 𝑋2 ≤ 200 → pumps 9𝑋1 + 6𝑋2 ≤ 1,520 → labor 12𝑋1 + 16𝑋2 ≤ 2,650 → tubing 𝑋1, 𝑋2 ≥ 0 Suppose Howie Jones has to purchase a single piece of equipment for $1,000 in order to produce any Aqua-Spas or Hydro-Luxes. You use 𝑌1, 𝑌2 as the production of each product, and 𝑀1 and 𝑀2 to denote the upper bound for each product line. How will the fixed cost affect the formulation of the model of his decision problem? What is incorrect about 𝑀?
𝑀1 is the maximum value among (200/1 , 1520/9 , 2650/12 ).
For a simple linear regression model, a 100(1 − 𝛼)% prediction interval for a new value of𝑌 when 𝑋 = 𝑋ℎ is computed as
𝑌ℎ̂ ± 𝑡(1−𝛼2,𝑛−2)𝑆𝑝
If a company selects either Project 2 or Project 3 (but not both 2 and 3) then it must also select Project 1. Write the constraint enforces this condition. We denote the selection of a project as 𝑋1, 𝑋2, 𝑋3 each being a binary variable.
(𝑋2 + 𝑋3) − 𝑋1 ≤ 0
How do we run the "Logistic Regression" classification technique?
*Computes a function that maps the independent variables into a probability of membership*Estimates the probability of an observation belonging to each group
Based on the confusion matrix of validation dataset analysis in DA, how good a model is a tdetecting the actual positives (pic in pdf)
.778
Confusion Matrix: What is the correct classification rate? Actual/Predicted 0 1 Total 0 9 4 13 1 2 10 12 Total 11 14 25
19/25 = 76%
In this problem, we rescaled the objective function and constraints because scaling problems
All of these choices are correct.
What is the difference between Euclidean distance and Mahalanobis distance?
Euclidean distance assumes that the variables are uncorrelated, while Mahalanobis distance takes into account the covariance between variables.
What incorrect about the DA and Logistic regression models that we trained? (pic in pdf)
It is better to use DA model because it has lower AUC
How do we interpret the log-transformed variable? (Note: when both sides are log transformed)
Multiplicative change in independent variable = Multiplicative change in dependent variable(Transform multiplication to addition)
What is the category of data mining tasks in which a researcher attempts to predict the value ofa continuous response variable based on the data set?
Prediction
You also predicted five new student's class. Which student has different prediction value from two models? (pic in pdf)
Record 1 student
Which of the following argument is incorrect about standard error?
Standard error is calculated differently depending on the population distribution.
What do we call a type of machine learning where the computer is trained to make predictions or decisions based on labeled data?
Supervised learning
After getting the regression model, how do we make a prediction?
The y value can be predicted by plugging in the values of each x variable into the regression model.
When do alternate optimal solutions occur in LP models?
When a binding constraint is parallel to a level curve.
What is a Confusion Matrix?
a table summarizing the accuracy of a classification technique for two or more groups
The terms 𝑏0 and 𝑏1 are
all of the above
The regression function indicates the
average value the dependent variable assumes for a given value of the independent variable
K nearest neighbor
identifies the k observations in training data that are most similar to a new observation we want to classify
Rounding the LP relaxation solution up or down to the nearest integer may:
produce an infeasible solution.
Which of the following is the general format of an objective function?
𝑓(𝑋1, 𝑋2, ... , 𝑋𝑛)