CS 4407 - Data Mining and Machine Learning - Study

¡Supera tus tareas y exámenes ahora con Quizwiz!

True or False: The fix() function identifies values that contain data within a data frame that are inconsistent and automatically corrects these values.

False

True or False: The snowflake schema differs from the star schema in that the table holding the dimensional data are normalized.

True

The income of a company that produces disaster equipment has been expressed as a linear regression model based upon the input variable which is the number of hurricanes projected for the upcoming hurricane season. The model is express as Y = mX + b where Y is the estimated sales in millions of dollars, m = .67 and b = 8.2. Assuming that the weather service is predicting 12 hurricanes during the season what are the sales in millions of dollars expected to be?

16.24

What does ETL stand for?

Extract transform load

True or False: In agglomerative hierarchical clustering, the cluster builds top down from the root node to the leaves.

False

The value of K should typically be an odd number for what reason?

It ensures that when classifying a solution there will not be a tie

Which of the following is an example of an unsupervised learning algorithm?

K-Means

What type of network does the following diagram illustrate? (input vector x1, x2, and xn)

Kohonen (self-organizing) Network

You have a dataset which produces the following plot and you need to create a predictive model. Which of the following techniques are you most likely to use?

Linear Regression

The names() function within R:

Lists all of the column names in the data frame provided as an argument to the function.

The first artificial neural network model was developed by:

McCulloch and Pitts

______________ wrote a book in 1969 describing the limitations of the perceptron which led to a drop in interest in neural networks for many years.

Minsky and Papert

Residual plots are a useful tool for identifying:

Non-linearity

When using a relational database engine as the backend for analytics processing, the acronym _____ is used to describe it.

ROLAP

Assuming K=3 how would the point X be classified using KNN? Training point Distance Class d1 4 Red d2 3 Blue d3 5 Red d4 7 Red

Red

Assuming K=3, how would the point X be classified using KNN? Training point Distance Class d1 2 Blue d2 2.5 Red d3 1.75 Red d4 4 Blue d5 5 Red d6 6.5 Red

Red

Assuming K=5, how would the point X be classified using KNN? Training point Distance Class d1 2 Blue d2 2.5 Red d3 1.75 Red d4 4 Blue d5 5 Red d6 6.5 Red

Red

Assuming that we have a data set that includes sales data for every customer over the course of several years and we wanted to use this data to predict future sales which would be the most appropriate technique to investigate?

Regression

True or False: Unsupervised learning is often performed as part of an exploratory data analysis.

True

True/False: Supervised learning features both input variables or attributes and an output or predicted variable.

True

Which command will provide descriptive statistics for the Boston data frame?

summary(Boston)

The sales of a company (in million dollars) for each year are shown in the table below, identify the linear regression model in the form y=mx+b and report the values of m (slope) and b (intercept) as well as the estimated value of y when the value of x is 10. x (year) 2005 2006 2007 2008 2009 y (sales) 12 19 29 37 45 NOTE: You should consider the value x as the elapsed time. For 2005 this would be 0 years, for 2006 it would be 1 year and for 2012 it would be 7 years. What is the predicted value of y (in millions of dollars) when the year is 2012?

70.4

Assume that you have a data set which produces the following data plot. You wish to predict if a new case would be a 'red' case as opposed to a 'blue' case based upon the input attribute data. Which technique should you use?

Logistic Regression

Assuming you have the following data values (12,11,2,13,8,10), what is the min-max normalized value for 6. Provide your response rounded to the thousandths place.

0.364

Assuming a data set with 1000 training outcomes of which 332 are positive, what is the Entropy? Round your answer to the nearest thousandths place.

0.917

Assuming a data set with 20 training outcomes of which 12 are positive, what is the Entropy? Round your answer to the nearest thousandths place.

0.971

Assuming a data set with 512 training outcomes of which 292 are negatives, what is the Entropy? Round your answer to the nearest thousandths place.

0.986

Regression analysis involves developing a model where one or more inputs are used to predict an output variable. Regression, in this context, represents what kind of learning.

Supervised learning

A linear regression model is expressed as y ≈ β0+ β1x where β0 is the intercept and β1 is the slope of the line. The following equations can be used to compute the value of the coefficients β0 and β1. Using the following set of data, find the coefficients β0 and β1rounded to the nearest thousandths place and the predicted value of y when x is 10.{(-1 , 0), (0 , 2), (1 , 4), (2 , 5)}

β0 = 1.9 β1 = 1.7 y = 18.9

True or False: Information gain is the reduction of entropy.

True

True or False: Qualitative variables are often referred to as categorical.

True

True or False: Residual plots are a useful tool for identifying non-linearity.

True

True or False: Weights in the artificial neuron are adjusted to ensure that the inputs will produce the output while training.

True

True or False: When training a neural network the goal is to minimize error which is defined as difference between the target output and the actual output.

True

True/False: A regression model has a R2 statistic of .15. This indicates that the regression model is NOT a good fit and does a poor job of predicting the outcome based upon the input variables.

True

True/False: Reinforcement learning features elements of both supervised learning and unsupervised learning as the outcome variable or predicted values are validated over time and feedback is used to continuously train the learning algorithm.

True

The objective of ________ is to identify valid novel and potentially useful, and understandable correlations and patterns in existing data.

data mining

Which of the following functions is used to generate a linear regression model within R?

lm()

True/False: A regression model has a R^2 statistic of .95. This indicates that the regression model is NOT a good fit and does a poor job of predicting the outcome based upon the input variables.

False

True/False: According to our textbook, residual plots are a useful tool for identifying clusters.

False

True/False: Shared nothing architectures distribute the processing of queries to access large volumes of data and provide near linear scalability in both storage volume and query performance.

True

Which of the following statements will generate a multiple linear regression model within R where the output or predicted variables is Sales and the prediction variables include temperature and unemploymentrate?

lm(sales~temperature+unemploymentrate)

Assuming you have the following sample data values (11,5,2,12,9), what is the Z-Score normalized value for 7. Provide your response rounded to the hundredths place.

-0.19

Assuming you have the following data values (3,6,9,14,2), what is the Z-Score normalized value for 5. X* = (Xv - mean(X)) / SD(X) Where X is the set of data values and Xv is the value to score. Provide your response rounded to the thousandths place:

-0.37

Assuming you have the following data values (4,6,9,20,8,7), what is the min-max normalized value for 6. X* = (Xv - min(X)) / (max(X) - min(X) Where X is the set of data values and Xv is the value to score. Provide your response rounded to the thousandths place:

0.125

Assuming you have a linear model in which the value of m is .05 and the value of b is 10 that explains the relationship between income and credit extended. If income is 50,000, what credit will be extended?

2510

8.41

Assume that you are the data scientist for a new email provider called fastmail.com. Fastmail intends to differentiate their email service from other email providers by providing excellent spam mail detection and elimination. You have been tasked with developing a machine learning solution that can detect whether an email message received into a users mail inbox is a spam mail. Which of the following techniques are you most likely to use?

A Classifier based upon Bayes theorem

Assuming K=1 how would the point X be classified using KNN? Training point Distance Class d1 4 Red d2 3 Blue d3 5 Red d4 7 Red

Blue

Assume that you had a variety of data including medical history, diet, heredity factors on individuals who developed cancer and you wanted to use this data to determine whether a person is likely to develop cancer. Which technique would be the most promising to start with?

Classification

A database where all of the values for a particular column are stored contiguously is called?

Column-oriented storage

Which of the following is not a category of transfer function in neural networks?

Correlation

The following diagram represents which technique? (Illustrates a curve diagram)

Curvilinear Regression

As a new data scientist for the Moogle corporation, you are asked to develop an algorithm that can detect spam emails and deliver them into a spam folder instead of an inbox. After looking at a plot of the data it looks like the following. Which technique would you use for your algorithm?

Logistic Regression

The term OLAP stands for?

Online Analytical Processing

Which of the following is NOT a statistical processing software package?

Vertica

When data observations are placed into specific groups according to observed characteristics in training data, this is known as ______________

Classification

Which of the following is an example of a NOSQL Analytics database?

Cassandra

True or False: In unsupervised learning, the learning algorithm must be trained using data attributes that have been paired with an outcome variable.

False

True or False: Information Retrieval or text analytics is NOT a form of data mining.

False

True or False: Linear regression is considered a non-parametric approach.

False

True or False: Logistic regression can be used to predict a continuous variable.

False

True or False: Map/Reduce refers to an optimized approach to process SQL queries.

False

True or False: NoSQL databases provide greater performance at the expense of availability.

False

True or False: Unsupervised learning involves building a statistical model for predicting, or estimating an output based upon one or more inputs.

False

True/False: A linear regression model can be used to predict categorical data values.

False

True or False: Colinearity refers to a situation in which two or more predictor variables are closely related to each other.

True

True or False: Data Mining can be said to be a process designed to detect patterns in data sets.

True

11.6

A farmer's yield of corn is expressed as a linear regression model based upon the input variable which is the number of days of sunlight during the growing season. The model is express as Y = mX + b where Y is the estimated corn yield in bushels per acre, m = 1.38 and b = 42. Assuming that during the growing season it is predicted that there will be 67 days of sun, what will the corn yield be in bushels per acre?

134.46

The income of a company that produces disaster equipment has been expressed as a linear regression model based upon the input variable which is the number of hurricanes projected for the upcoming hurricane season. The model is express as Y = mX + b where Y is the estimated sales in millions of dollars, m = .76 and b = 5. Assuming that the weather service is predicting 6 hurricanes during the season what are the sales in millions of dollars expected to be?

9.56

Assume that you are the data scientist for the GreatFoods! Supermarket chain. In an effort to increase sales of locally produced food such as eggs, milk, and bread, your manager asks you to develop a data mining solution that can identify the probability that a customer will purchase eggs when they purchase milk and vice versa. Which technique are you most likely to use?

Bayes Classifier

Assuming K=3 how would the point X be classified using KNN? Training point Distance Class d1 4 Red d2 3 Blue d3 5 Blue d4 7 Red

Blue

True or False: Principle components analysis is primarily a technique used to compute statistical variables but is not intended to be used as a tool to aid in the visualization of data.

False

True or False: The following data plot represents data that is linearly separable? (Everything is scattered)

False

True or False: The library() function lists all of the libraries that are loaded into memory within R

False

True or False: A predication outcome variable must be categorical?

False

True or False: An artificial neuron, like a biological neuron can have multiple inputs and multiple outputs.

False

True or False: Assuming we have 30,000,000 species and we eliminate one of them we can say that this decision has a large information gain?

False

True or False: Each training instance of data is presented to the network exactly once during training with the back propagation algorithm which is why a very large amount of training data is required.

False

True or False: In a biological neuron, signals are received through the axon to the synapses where an activation decision is made.

False

True or False: In a data warehouse, unidimensional data is stored in a star schema format.

False

True or False: In a supervised learning model, Bias refers to the error that is introduced from the assumptions of the data analyst.

False

What type of network does the following diagram illustrate? (Dots connected to each other)

Feedback network

The concept of the perceptron was developed in the 1960's by:

Frank Rosenblatt

True or False: Entropy is a measure of the uncertainty or variability in a decision or choice within a decision tree.

True

Which of the following is NOT a machine learning technique?

Linear Components Analytics

Assume that you are the data scientist for ACME sporting goods and you have been asked to develop a predictive model that will help the company determine which products to produce and in what quantities for the upcoming sports season. You have historical data from many years with data that includes all of the factors that impacted sales of different types of products. Which of the following techniques are you most likely to use?

Linear Regression

Which of the following is an example of a parametric approach.

Linear Regression

You are training the following perceptron. The neuron in this perceptron has a sigmoid activation function. The sigmoid function is represented by the following equation: f(x,w) = 1 / (1+e^(w1x1+w2x2)) Using the update function for the weights: wi(t+1) = wi(t)+n(y-f(x,w))f(x,w)(1-f(x,w))xi with a learning rate of η=1, and assuming that the current weights are w1 = 0.1 and w2 = 0.1, compute an iteration of the new weights by computing error and applying back to the inputs. .5 wi .5 w2 --> .25 Output

New W1 = 0.0657 New W2 = 0.0657

New W1 = 0.095 New W2 = 0.091

You are training the following perceptron. The neuron in this perceptron has a sigmoid activation function. The sigmoid function is represented by the following equation: f(x,w) = 1 / (1+e^(w1x1+w2x2)) Using the update function for the weights: wi(t+1) = wi(t)+n(y-f(x,w))f(x,w)(1-f(x,w))xi with a learning rate of η=1, and assuming that the current weights are w1 = 0.2 and w2 = 0.3, compute an iteration of the new weights by computing error and applying back to the inputs. 1 wi 0 w2 --> 1 Output

New W1 = 0.311 New W2 = 0.3

Which of the following is NOT a classification technique?

Principle components analysis

True or False: In the KNN algorithm, a small value for K provides the most flexible fit (low bias/high variance).

True

True/False: A learning algorithm that is very accurate to the available training data, but when exposed to new cases has a high degree of variance and error is set to be "Overfit".

True

True or False: Bayes theorem classifies cases by calculating the probability that the case belongs to each class and then selecting the one with the highest probability.

True

True or False: Clustering refers to a broad set of techniques for finding subgroups or clusters in a data set.

True

Assume that you have the following perceptron with the assigned weights and the input values as shown. Further assuming that your perceptron will use an activation function based upon a simple step function with a threshold value of .5(summation of weighted inputs must exceed .5), will the neuron fire and output a value and if so what value? (Play Baseball?)

Will not fire / no value

The values of x and their corresponding values of y are shown in the table below, identify the linear regression model in the form y=mx+b and report the values of m(slope) and b(intercept) as well as the estimated value of y when the value of x is 10. x 0 1 2 3 4 y 2 3 5 4 6

b = 2.2 m = 0.9 y = 11.2

The values of y and their corresponding values of y are shown in the table below, identify the linear regression model in the form y=mx+b and report the values of m (slope) and b (intercept) as well as the estimated value of y when the value of x is 3. Round to the nearest hundreds place. x -4 -2 0 2 1 y 2 3 4.3 6.1 5

b = 4.48 m = 0.66 y = 6.48

What R command could we use to generate a scatterplot diagram of our data to determine if it forms a linear pattern that would be suitable for linear regression or a non-linear pattern that would require some other technique?

plot()

Ver todos los conjuntos de estudio

CS 4407 - Data Mining and Machine Learning - Study

Conjuntos de estudio relacionados

MUSIC 132 MIDTERM

PAM3340 Final

measurements,density, acceleration

Business of Fashion and Retail Exam Final

Mins 301 - Chapter 3

Psychology statistics

Exam

Health Care Ethics (Chapter 10) Professionalism

ISDS 3115 CH 12 study plan concepts

Chapter 27 - Perioperative Care, Chapter 14-Perioperative Care, Chapter 11 pain management

Check for Understanding CH 6

Ch. 22 - Incident Response (Notes)

Nur 106 Module A Quiz

Divisibility Rules

Physics Test 2

APR 260 Illustrator Test

Chapter 7 Review

Chapter 23 APUSH Review

Ch. 12- The Periodic Table

N323 - Genomics - Ward