DSC 441 Final Exam Prep
Which of the following are methods of dimension reduction? - Feature selection - Data instance sampling - Feature extraction - Forward selection and backward selection - Noise reduction - Attribute relevance analysis (e.g. information gain)
Feature selection Data instance sampling Feature extraction Forward selection and backward selection
Which of the following are true about Forward Selection?
Forward selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model.
What are the different methods to look at node impurity?
Gini Index Information Gain Gain Ratio Misclassification Error
Supervised learning problems include:
Regression Classification
Recall is the same as
Sensitivity
ROC Curves makes use of
Sensitivity 1-Specificity
A perfect classifier will have a 100% sensitivity and 100% specificity. T or F
T
Data discretization is part of data reduction. T or F
T
During the process of building a decision tree, at each node, the algorithm decides which attribute to use for a new split based on some criteria evaluated for each possible attribute. T or F
T
In classification, a model or classifier is constructed to predict class (categorical) labels. T or F
T
Internal node in a decision tree denotes a test on a feature. T or F
T
Range, quartiles, variance, standard deviation and interquartile range can be used to measure dispersion or spread of categorical data. T or F
T
What are methods to measure accuracy of a classifier? - Bootstrap - Cross validation - Hold out - Random sub sampling
- Bootstrap - Cross validation - Hold out - Random sub sampling
Select all of the following that are true about Data Warehouses - These are considered the smallest useful unit of data. - Data will not be modified by the end user. - Data may be integrated and cleaned from many large sources. - Data must be accessed through OLAP.
- Data will not be modified by the end user. - Data may be integrated and cleaned from many large sources.
Select examples of supervised learning. Assume you have the appropriate data. - Examine a web page, and classify whether the content on the web page should be considered "child friendly" or "adult." - Discover that there are different categories or "types" of patients in terms of what they react to a new experimental drug - In farming, given data on crop yields over the last 20 years, learn to predict next year's crop yields. - Learn from historical data and determine whether a new user will respond to an add campaign (or not).
- Examine a web page, and classify whether the content on the web page should be considered "child friendly" or "adult." - In farming, given data on crop yields over the last 20 years, learn to predict next year's crop yields. - Learn from historical data and determine whether a new user will respond to an add campaign (or not).
The area under the ROC curve is a measure of the _____________________ of the model.
Accuracy
Which of the following is true about data normalization? - Normalization scales the range of the data into some (generally smaller) specified range. - Z-Score normalization is useful for finding outliers because each point is represented by how far from the mean it is - When subtracting an offset and dividing by a range we change the mean and standard deviation of data without actually changing the shape of its distribution (as seen in a histogram) - All of the above
All of the above
Which of the following are fields that contribute to Data Mining - Algorithms - Machine Learning and Statistics - Visualization - Hardware development (for fast computation) - Databases
All of them
Decision trees are an algorithm for which machine learning task?
Classification
The out put of KDD is
Useful Information
The two major types of data reduction are:
Dimensionality reduction and numerosity reduction (the number of variables and the number of points)
All results are interesting. T or F
F
Order the steps in the data mining pipeline - Pattern evaluation and knowledge presentation - Data cleaning and preprocessing - Data mining - Choosing tasks/functions of data mining - Creating a target data set - Learning the application domain - Choosing the mining algorith(s) - Data reduction and transformation - Use of discovered knowledge
1. Learning the application 2. Creating a target data set 3. Data cleaning and preprocessing 4. Data reduction and transformation 5. Choosing tasks/functions of data mining 6. Choosing the mining algorithm(s) 7. Data mining 8. Pattern evaluation and knowledge presentation 9. Use of discovered knowledge
Match the sampling method to its properties. - Repeatedly generate random samples for training (using remainder for testing). - Split data into train and test sets randomly or with purpose. - Split data into k pieces and evaluate with each piece as the test set while the remaining are used for training. 1. Bootstrapping 2. k-fold cross validation 3. hold out
1. Repeatedly generate random samples for training (using remainder for testing). 2. Split data into k pieces and evaluate with each piece as the test set while the remaining are used for training. 3. Split data into train and test sets randomly or with purpose.
A model with perfect accuracy will have an area under receiver operator characteristics curve of _______.
1.0
How many classes can SVM handle?
2
Which of these describes the shape of all k-means clusters?
Convex
Which of the following statements are true? - Correlation between two features ranges between [-1, 0] - If correlation is equal to 0 then two features are perfectly positively correlated - If correlation is equal to -1 then two features are perfectly negatively correlated - Correlation between two features ranges between [-1, 1] - If correlation is equal to 1 then two features have no correlation - If correlation is equal to 1 then two features are perfectly positively correlated - If correlation is equal to 0 then two features have no correlation - All of the above
- If correlation is equal to -1 then two features are perfectly negatively correlated - Correlation between two features ranges between [-1, 1] - If correlation is equal to 1 then two features are perfectly positively correlated - If correlation is equal to 0 then two features have no correlation
Which of the following are true of the k-means algorithm? - It requires specifying k (number of clusters) in advance. - It's not good at oddly-shaped clusters or when there are outliers. - It creates hierarchical clusters. - It only requires a pairwise distance matrix. - It's sensitive to the initialization (starting positions). - It is relatively computationally efficiently (runs fast).
- It requires specifying k (number of clusters) in advance. - It's not good at oddly-shaped clusters or when there are outliers. - It's sensitive to the initialization (starting positions). - It is relatively computationally efficiently (runs fast).
We discussed one method of Feature Extraction, Principle Component Analysis (PCA). Which of the following describes PCA? - PCA creates subsets of the original data features clustered by their variance. - PCA creates new features from the original attributes which can efficiently account for most of the variance of the data with fewer dimensions. - PCA automatically finds the best subset of the original data attributes that account for the most variance possible. - PCA creates a new set of features with more dimensions than the original, maximizing the total variance.
- PCA creates new features from the original attributes which can efficiently account for most of the variance of the data with fewer dimensions.
Which ones of these are classification metrics? - Precision and Recall - Sensitivity and Specificity - Correlation of features with target - Accuracy
- Precision and Recall - Sensitivity and Specificity - Accuracy
We've discussed several uses of clustering. Which of the following are included? - Smoothing noise - Numerosity reduction - Finding outliers - Creating a model to classify data points
- Smoothing noise - Numerosity reduction - Finding outliers
Hierarchical clustering is much slower than k-means. Which of these are potential advantages of hierarchical clustering over k-means? - The number of clusters to examine can be chosen after the algorithm is run. - The algorithm builds clusters up from the lowest scatter criteria value to the highest. - Running hierarchical clustering multiple times with the same parameters returns the same result. - It's not necessary to know the data values themselves, just the pairwise distances, so categorical values can be dealt with.
- The number of clusters to examine can be chosen after the algorithm is run. - Running hierarchical clustering multiple times with the same parameters returns the same result. - It's not necessary to know the data values themselves, just the pairwise distances, so categorical values can be dealt with.
Select the pieces of advice that were provided for people new to using data mining.
- Understand the needs of the end user (along with the problem domain) - Data preparation (cleaning and pre-processing) - The only way to be sure if a technique works well is to try it with your data - Remember that business or scientific need is a higher priority than technical excitement
Which of the following are ways to deal with missing data values? - Use a special value like "unknown" to capture that there is meaning to the fact that value is missing. - All you can do is use only the data mining algorithms that can handle data with values missing. - Replace with the average value of the attribute among data points with the same class. - Predict missing value with a model based on the data you do have (i.e. classification or regression).
- Use a special value like "unknown" to capture that there is meaning to the fact that value is missing. - Replace with the average value of the attribute among data points with the same class. - Predict missing value with a model based on the data you do have (i.e. classification or regression).
The result of clustering is
- a grouping of data points (each data point in a group) - a label for each data point showing which group it belongs in
Binning numerical data into chunks (bins) can be useful for
- dealing with noisy data by smoothing out lots of variation into chunks with reasonable ranges - drawing a histogram
Which of these indicate high quality of clustering? - high intra-class similarity - low inter-class similarity - corresponds to some partitioning of the data with real-life meaning - low sum over data of distance to adjacent cluster
- high intra-class similarity - low inter-class similarity - corresponds to some partitioning of the data with real-life meaning
As opposed to supervised learning, unsupervised learning
- includes clustering - find groups in data without provided labels
Clustering points ...
- is related to anomaly detection - is based on the similarities between them - does not require any labelled data
What is the minimum and maximum value for Misclassification Error?
0,0.5
What is the minimum and maximum value of GINI?
0,0.5
What is the minimum and maximum value for entropy?
0,1
Arrange the following sequences: Information Data Wisdom Knowledge
1. Data 2. Information 3. Knowledge 4. Wisdom
Match the chart type to the data properties. - Good for categorical X values and cases where the Y value is ratio scaled. - Implies some importance of the connection between the data points - Can handle multiple Y values per X value 1. Bar Chart 2. Line Graph 3. Scatter Plot
1. Good for categorical X values and cases where the Y value is ratio scaled. 2. Implies some importance of the connection between the data points 3. Can handle multiple Y values per X value
- Decimal scaling - Z score normalization(standardization) - Min-max normalization 1. the new values tell how many standard deviations the sample is from the mean of the original data. 2. result is guaranteed to be between -1 and 1, but original zeros stay zero 3. the values are linearly scaled from one interval into another; the middle value means nothing special.
1. z score 2. decimal scaling 3. min-max normalization
Which of the following is true about data normalization? - Normalization scales the range of the data into some (generally smaller) specified range. - Z-Score normalization is useful for finding outliers because each point is represented by how far from the mean it is - When subtracting an offset and dividing by a range we change the mean and standard deviation of data without actually changing the shape of its distribution (as seen in a histogram) - All the above
All the above
Decision trees are created by solving an optimization algorithm over the set of possible attribute and node pairs. T or F
F
During the process of building a decision tree, at each node, we consider the similarities between all the points and take the most similar ones to create the next nodes based on a criteria evaluated at each similar group of data. T or F
F
If all available data cleaning algorithms are run in sequence, there is no need to include human judgement in the process. T or F
F
If the covariance between two variables x and y is equal to 0 then we can say for certain that x and y are independent. T or F
F
Internal node in a decision tree has more than one incoming edge and two or more outgoing edges. T or F
F
Non-Homogeneous class distribution are preferred to determine the best split. T or F
F
Running k-means multiple times with the same number of clusters and same distance function will always produce the same result. T or F
F
Scatter plot is not an effective graphical method to look for correlation between two numerical variables. T or F
F
The cost associated with a false positive (incorrectly yet conservatively labeling a noncancerous patient as cancerous) is far greater than a false negative (such as incorrectly predicting that a cancerous patient is not cancerous). T or F
F
Identify all the ways to measure the central tendency for a set of data objects.
Mean Median Midrange Mode
Data mining is
Non-trivial extraction of implicit previously unknown and potentially useful information from data
The main criteria optimized in methods for projecting high dimensional data to 2D (like MDS) is to
Pairwise distances between points in the new 2D space are as close as possible to the corresponding distances in high-dimensional space.
A good classifier will have a ROC curve that
Passes through the upper left hand side corner
Looking at clusters when smoothing data helps you see outliers with respect to multiple variables at a time, as opposed to outliers in just one dimension. T or F
T
Receiver operating characteristic curves are a useful visual tool for comparing two classification models. T or F
T
Root node in a decision tree has no incoming edges and zero or more outgoing edges. T or F
T
T or F: SVM treat all data as numeric variables.
T
Terminal node in a decision tree has exactly one incoming edge and no outgoing edge. T or F
T
The closer the area is to 0.5, the less accurate the corresponding model is. T or F
T
The key difference between supervised and unsupervised machine learning problems is due to the presence or absence of labeled data. T or F
T
The process of building a decision tree is recursive. It is built one node at a time and the algorithm is applied at each node the same way. T or F
T
The true positives, true negatives, false positives, and false negatives are also useful in assessing the costs and benefits (or risks and gains) associated with a classification model. T or F
T
There is no cloud. T or F
T
Which of these describes a linear separator type of classifier?
The classifier is a hyperplane separating the data of different classes
Which of the following are the parameters to hierarchical clustering?
The distance metric for data points (or their pairwise distances) and the distance function for clusters.
Which of these describes the problem of estimating accuracy on a classifier? - The standard deviation is more important than accuracy - The model learning process means that accuracy on the training data is not representative of general performance, so separate testing data is needed - Accuracy can only measure where the classifier is correct on a given sample - Making a prediction to test is too expensive
The model learning process means that accuracy on the training data is not representative of general performance, so separate testing data is needed
Which of these are true of using clustering for smoothing? - Clustering is used for replacing missing values, not smoothing. - We replace data points by an average or representatives of points in their cluster. - Each cluster must have the same number of points. - The best smoothing for a point uses centers of other clusters.
We replace data points by an average or representatives of points in their cluster.
Which of the following are steps of the data science pipeline (note: processes not products)? a. Data Selection b. Data Preprocessing c. Data transformation d. Databases e. Data Mining f. Evaluation/Interpretation g. Knowledge
a. Data Selection b. Data Preprocessing c. Data Transformations e. Data Mining f. Evaluation/Interpretation
Using slack variables in the SVM optimization allows us to
aim for a more general model by sacrificing correctness of some training data
In data science, visualization is used
at the beginning of the process to understand data, in the middle to debug results and at the end to communicate.
What are other names for features?
attributes predictors explanatory variables
_________ is a summarization of the general characterisitcs or features of a specific set of data. a. Data selection b. Data characterization c. Data discrimination d. Data classification
b. Data characterization
Which of the following is an example of data mining from the following choices? a. Looking up a number in a phone book b. Discovering groups of companies that are performing similarly on the stock exchange c. Finding out how many graduate students are in your database of students d. Querying an existing search engine
b. Discovering groups of companies that are performing similarly on the stock exchange
Which of the following in not a data mining functionality? a. Clustering and Outlier Analysis b. Selection and interpretation c. Characterization and Discrimination d. Classification and regression
b. Selection and interpretation
A classifier built on data with 3 categorical variables and 6 numerical variables...
can only be used to classify data with the same variables in the same order
To classify a point v with SVM...
check the sign of w dot v + b (if positive, predict positive; if negative, predict negative)
Recall can be thought of as a measure of _____________
completeness
A classifier is used to
discover a pattern that can predict a class that a new data instance falls into
Text data can be stored in a matrix with a "bag-of-words" model. This means:
each row represents a unit of text (e.g. document) and each column represents a word.
Precision can be thought of as a measure of ____________
exactness
The dot product between an SVM model's w and a new vector v tells
how far the data point is from the hyperplane
True positives refer to
positive tuples that were correctly classified as such by the classifier
The special subset of data critical to defining the hyperplane are called ____, and the fact that they are the only points that actual matter make the classifier ____
support vectors, robust to noise
The special column of a data table that specifies the desired output of a trained classifier is called
target class label
Line graphs show a different type of pattern from bar graphs because
the lines imply there is a connection between the plotted points, often helping showcase a trend.
Which of the following best describes the intuition behind SVM as a specific linear separator?
the separating hyperplane is chosen to maximize the margin between different classes to improve generalizability
The kernel trick in SVM allows us to change..
the shape of the hyperplane, so it can curve in the data space
Classification algorithms learn from data examples. These data are called ____ and they must include a special variable called the ____
training data, label
What is the relationship between w and SVM's separating hyperplane?
w is a vector learned by the algorithm that helps define the hyperplane but points perpendicular to it