Data Mining

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

The following table reflects the observations made on the color and type of vehicle, if a speeding ticket was received (1) or a warning (0), and if there was a prior driving violation (yes or no). Using the naive Bayes calculation, what is the conditional probability of receiving a ticket with a red vehicle

0.53

Consider the partial data set in the table represents online hours spent shopping by age and income. The average and standard deviation for the full data set is $47,667 and $14,292 respectively. Using z-scored to standardize the observations, what is the average standard deviation of income for the three provided? ID 1: 62000 - 48 - 2 ID 2: 58000 - 52 - 4 ID 3: 53000 - 22 - 5

0.6997

Using the table below, find the k-nearest neighbor for record 4 using k=3 for age

24,31,&34

Consider the following estimated linear trend model and make a forecast for t=18 and =11.23 +1.04t

29.95

Mark is using a 3-period moving average to forecast the number of filters needed for the fourth quarter. Using the following data, what is the forecasted amount?

39 Filters

Sydney is evaluating monthly sales for her Etsy account. Based on the given data y(1)=4,321; y(2)=3876; y(3)=4190, what is her 3-period moving average?

4,129

Consider the following quadratic trend model and make a forecast for t=10 and y = 15.84 + 0.98t +0.02t^2

43.44

The marketing department is examining the data pulled from the retail stores over the month of December. In this time period, three items are of interest, Sound Bars, LED under counter lights, and shelving units. In researching if two of the items are purchased, if the third will be also, the following confidence level was calculated at 0.575, with an expected confidence of 0.10. Calculate the lift ratio.

5.75

What is the estimated probability that the cheese sample tested in NW will be Gouda? k=3

67%

If the performance measures are based on a cutoff value of 0.5, then if we lower the cutoff value, more cases will be in the target class, resulting in different performance measurement values. What chart can be used to review the data that are independent of the cutoff value?

All options are independent of the cutoff value

Which option is not one of the three common strategies used in creating ensemble models?

Bootstrapping

When a target variable is categorical, the CART algorithm produces a ___________ tree to predict the class memberships of new cases.

Classification

This chart measures the effectiveness of a predictive model, containing both a baseline and a lift curve

Cumulative lift chart

The process of dividing a data set into a training, a validation, and an optimal test data set is called:

Data partitioning

Cross-industry standard process for Data Mining (CRISP-DM) consists of six phases. Of the six, which represents the phase where data wrangling occurs?

Data preparation

Which chart allows for the categorization of large data sets from high to low values, dividing sets of observations into an easy visual representation of the data

Decile-wise chart

When visually inspecting data to confirm the existence of a trend, a scatterplot of the data with a superimposed linear trend line is advisable to view the series over time

True

Consider the following table of the derivations for the MSE, MSA, and MAPE in the validation set. Based on the results, which model is preferred and why?

Exponential, because the MSE, MAD, and MAPE are consistently lower

A diagram that represents the information in equal-sized intervals, deciles, is called a cumulative lift chart

False

A pure subset contains leaf nodes where cases have contradicting values to the target variable, to enhance the variable case outcomes and allow for further splits.

False

If-Then logical statements are constructed with the If portion being the consequent and the Then being the antecedent

False

In a 3-period moving average, when a new observation becomes available, the highest numerical observation is dropped

False

In reviewing stock growth in Amazon, the linear trend model would be best use for when an increase in the series happens over time

False

KNN belongs to a category of mining techniques called computer-based reasoning

False

The Jaccard's coefficient is appropriate when it is more informative to match negative outcomes between observations

False

The most commonly used approach for hierarchical clustering is divisive clustering

False

The naive Bayes method is an unsupervised data mining technique that uses partitioning to assess model performance

False

The overall MSE split for Age = 25 is $22987.29 and for Age = 23 is $21983.40. Of the two presented, Age = 25 is slightly higher and has a lower level of impurity for constructing a regression tree

False

The use of quantitative forecast can be criticized because biases in optimism and overconfidence may skew the results

False

Under the association rule, a lift ratio between 0 and 1 indicates a positive association

False

When a time series exhibits seasonal variations, the Holt exponential smoothing method, or double exponential smoothing method, is appropriate to capture the upward and downward movement of the time series

False

When using R, after the data is imported, set.seed function is used to set the random seed and the k function sets the k parameters to preselect the number of clusters

False

When using k-means clustering, the number of clusters are specified at the end of the analysis to remove overlapping clusters.

False

the best-pruned tree is the smallest set, least complex tree, with the smallest validation error.

False

While k-nearing neighbors is effective as a classifier, it provides no information on predictor importance

True

Using the Manhattan distance between pairwise observations, which pairwise observation is most similar? Observation 1: 2 - 3 Observation 2: 6 - 4 Observation 3: 8 - 2

Observation 2 & 3

In the k-Means Clustering Method, there is a general process of how k-means clustering algorithm can be classified. Which one of the following is not one of the general processes?

Reassign each observation to the nearest observation point

Of the following selections, which is NOT a descriptor of principal component analysis?

The first principal account is not suitable for analysis

Which is the best-fit definition for the use of Principal Component Analysis (PCA)?

The transformation of a large number of correlated variables into a smaller number of uncorrelated variables

Aimee's bookstore had a 45% increase in profits on Wednesday, June 12th, over the previous year's sales. Without the presence of a holiday, events in the area, or sale promotion, this business event is considered random.

True

Before constructing a decision tree, one of the first steps is identifying possible splits of the predictor variable

True

By combining the validation and the training set, the sample is larger for estimation and includes the most recent validation set for predictions

True

Decision trees produced by the CART algorithm are binary, meaning that there are two branches for each decision mode

True

If a time series reverses direction, then a quadratic trend model will allow for the curvature to be graphed

True

In real-world situations, data sets contain many variables. If some variables are eliminated, valuable information may be lost

True

In understanding the association rules, it is best to think of them as an If-Then statement

True

KNN is a simple data mining tool, known for developing personalized recommendations for many online company applications

True

Normalization is the process that makes the numerical data independent of scale

True

Oversampling involves intentionally selecting more samples from one class than from other classes to adjust the class distribution of a data set

True

The Ward's method is the use of a different algorithm to minimize the dissimilarity within clusters by using error sum of squares

True

The forming of groups into internally homogeneous groups where each has a unique characteristic, different from other groups, is called cluster analysis.

True

The key distinction between supervised and unsupervised data mining is that the identification of the target variable is identified in supervised data mining

True

The principal component analysis (PCA) is a dimension reduction technique used to reduce variables without removing variables

True

The process of applying a set of analytical techniques for the development of machine learning is called data mining

True

The triple exponential smoothing method uses seasonality variations in the analysis of the data

True

The use of classifying or predicting the value to create an outcome is called scoring a record

True

To measure impurity in a regression tree, mean square error (MSE) is used.

True

When a time series is expected to grow by fixed amounts each time period, then the linear trend model should be used

True

When evaluating large data sets, it is customary to cluster large data sets using the k-means to reduce the computation of measures during each iteration compared to hierarchical clustering methods

True

When performing a naive Bayes analysis, all predictor variables must be categorical

True

Which one of the following is not a step in cross-validation with time series?

Use both the training and validation set to re-estimate the preferred model

When using R for Agglomerative Clustering, the plot function is used to create the dendrogram as well as a banner plot. What function is used to split these results into distinct clusters?

cutree

When a predictive model is made overly complex to fit in the quirks of given sample data, it is called:

overfitting

If predictor variables are highly correlated, then repeated sampling of the training data, and a random selection of features are used to construct trees. This is an example of which strategy?

random Forest

An issue with the naive Bayes classifier is determining rare outcomes because the estimate is 0. To overcome this problem, the algorithm allows a replacement of zero probability with a nonzero value. This technique is called

smoothing

Which method would be the best fit for a sample containing seasonality, but no trend, and is further divided into structures depending on the type of seasonality exhibited by the series?

the Hold-Winters exponential smoothing method

Based on the following sorts 20 values for age, what are the possible split points? {20, 22, 24, 26, 28, 31, 33, 35, 40, 42, 43, 45, 47, 49, 50, 52, 53, 55, 57}

{21, 23, 25, 27, 29.5, 31.5, 32.5, 34, 37.5, 41, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56}


Kaugnay na mga set ng pag-aaral

Pharm Chapter 52, Pharm Chpt 18, Pharm Chpt 41, Pharm Chpt 43, Pharm Chpt 44, Pharm Chpt 45, Pharm Chpt 46, Pharm Chpt 48, Pharm Chpt 51, Pharm Chpt 49, Pharm Chpt 47, Pharm chpt 50

View Set

Management 475: Corporate Governance

View Set

Twitch, Summation, Incomplete and Complete Tetanus

View Set

Legal/Ethics practice questions (nclex style)

View Set

mass comm chapter 12 quiz questions

View Set

CSET SUBSET III PHYSICAL EDUCATION DOMAIN 1: MOVEMENT SKILLS AND MOVEMENT KNOWLEDGE

View Set

Forensic science the basics chapter 3

View Set

Articles, Nouns (gender&number), Hay

View Set