Data Mining Final Exam
What does data mining aim to do?
1) Identify patterns in data 2) Validate patterns identified in the data 3) Use the patterns identified in the data for prediction
How do you construct a decision tree?
1) Select an attribute to place at the root node and make one branch for each possible value. 2) Split the instances into subsets, one for each branch extending from the node. 3) Repeat recursively for each branch using only instances of that branch. 4) Stop if all instances have the same class
A "divide-and-conquer" approach to the problem of learning data attribute values results in what kind of model?
A Decision Tree Model
What is a Classification Rule?
A classification rule predicts the class or category of an instance based on the values of its attributes.
What are Lift Charts?
A lift chart is a visual data mining tool that, for a promotional mailout to 1,000,000 households, identifies a subset of the 100,000 most promising of responding. It is about identifying which subset of the population will likely respond in order to reduce costs. The overall goal is to find subsets of test instances that have a high proportion of positive instances, higher than in the test set as a whole.
What is Prepruning?
A pruning strategy where you stop growing the decision tree when there is no statistically significant association between any attribute and class at a particular node. Early stopping is an issue and you may lose crucial information.
What is an Association Rule?
An association rule focuses on discovering relationships and patterns between different items in a dataset, typically in the context of market basket analysis or similar scenarios.
What is the Zero-Frequency problem?
An attribute value does not occur with every class value. You solve this by adding 1 to the count for every attribute value-class combination and the probabilities will never be zero (the Laplace estimator).
In this type of learning, there is no specified class.
Association Learning
What is Association Learning?
Association Learning (Apriori Algorithm) - a learning scheme focused on looking at a dataset and extracting important association information based on the present data. Association learning can predict any attribute's values, not just the class, and more than one attribute's value at a time.
What are association rules? Name the algorithm that is used for association rule How can the compute intensive problem of the algorithm circumvented?
Association rules can predict any attribute, not just the class. It involves the coverage of an association rule (also called support) and the accuracy of an association rule (also called confidence). The algorithm that is used for association rules is called Apriori. The compute intensive problem of the algorithm can be circumvented by restricting the output to only showing the most predictive associations (those with high support and confidence).
What is the basic idea of the Perceptron & Winnow?
At each iteration, if the observation given to the algorithm is correctly classified, do nothing. If the observation is incorrectly classified, move the decision boundary toward that point.
What two things do Bayesian modeling of data assume?
Bayesian modeling assumes that the data attributes are independent and that the attributes are equally important.
What is Classification Learning?
Classification Learning (supervised) - the model is trained to learn the relationship between input data and their corresponding classes/labels. The classification learning algorithm uses the provided scheme along with the actual outcomes to learn the patterns in the data.
What is a concept? There are four to focus on within the field of data mining.
Classification Learning, Numeric Prediction, Association Learning, Clustering
What is Clustering?
Clustering (unsupervised) focuses on grouping items in a dataset together based on some set of similar features. You want to group items that are more similar to each other than they are to the other items in the other clusters.
What is Data Mining?
Data mining is a method where you are trying to extract information from data that is potentially useful, previously unknown, or implicit. Essentially, we want to detect patterns and regularities in the data in order for prediction purposes. If there are strong patterns or regularities in the data, the model for predicting will be very good!
The correct order of the life cycle of a data mining project involves these four steps:
Data understanding, Data preparation, Data modeling, and Evaluation
How are Decision Trees and Covering Rules similar?
Decision Trees take both classes into account whereas the covering algorithm would only cover a single class.
How many rule sets do decision trees use for each class?
Decision trees use MORE THAN ONE rule set for each class
Explain Decision Trees
Decision trees utilize a divide-and-conquer approach when learning a set of independent instances. Each node in a decision tree involves testing particular attributes. Each test compares an attribute value with a constant.
Explain the Covering Algorithm for generating classification rules
Essentially, you keep adding conditions to a rule in order to improve its accuracy, then you add the condition that improves the accuracy the most.
What is understood by evaluating a model used for data mining?
Evaluating a model used for data mining is understanding how the chosen model will perform when using real world data. The model must take in numeric attributes, allow missing values, and also deal with noisy data effectively. Evaluation takes place after a model has run its course on a set of data if performed correctly, which then leads to predictive modeling, regression, and classification.
Finding the telephone number of your friend from a phone directory is a data mining activity.
False
What is Holdout Estimation?
If data is limited, reserve a certain amount for testing and use the remainder for training. Typically, 1/3 is for testing and the other 2/3 is for training. Stratification is used in order for equal representation in both subsets.
What is supervised learning?
In supervised learning, the machine is trained on a set of labeled data, which means that the input data is paired with the desired output. It is often used for tasks like classification and regression.
What is unsupervised learning?
In unsupervised learning, the machine is trained on a set of unlabeled data, which means that the input data is not paired with the desired output. It is often used for tasks like clustering and dimensionality reduction.
Decision Trees use what quantity to choose a node?
Information Gain
What is Instance-Based Representation?
Instead of trying to create rules, work directly from the examples themselves. So you utilize the instances themselves to represent what is learned rather than inferring a rule set or decision tree and storing it instead. You also compare each new instance with existing ones using a distance metric, and the closest existing instance is used to assign the class to the new one.
What is Instance Based Learning also referred to?
It can also be called plain memorization.
In a rule based classifier, the precondition of a rule is called what?
It is called an antecedent.
In a rule based classifier, the conclusion that gives the class is called what?
It is called the consequent.
Why is a hidden layer in a neural network important?
It is required to represent non-linearity.
What is Bootstrap?
It is sampling without replacement. Bootstrap samples a dataset of "n" instances "n" times with replacement to form a new dataset of n instances and use this as the training set.
Machine learning involves what?
Machine learning involves techniques that capture structural descriptions from data.
What is Market Basket Analysis?
Market Basket Analysis is an unsupervised data mining technique used to uncover purchasing patterns in a retail setting. Essentially, it is analyzing the combination of products which have been bought together.
How does 1R deal with missing values and numeric attributes?
Missing values are treated as another attribute value. Numeric attributes are converted into nominal attributes using a simple discretization method.
What do Radial Basis Function networks use?
Multilayer perceptrons and centers of instance as hidden units.
Is a maximum margin hyperplane one that classifies all the data in a two class problem correctly?
No, a maximum margin hyperplane does not classify all the data in a binary classification problem correctly.
Is backpropagation used to train SVMs?
No, backpropagation is not used to train SVMs.
Are attributes with a large number of possible values are preferred in decision trees?
No, decision trees prefer a smaller number of possible values.
Are linear models capable of representing non-linear boundaries between classes?
No, linear models are limited to representing linear relationships between input features and the output. Linear models assume that the decision boundary, which separates different classes in a classification task or models the relationship in a regression task, is a linear function of the input features.
Are the most sophisticated model the best classifiers?
No, the simplest models that also have a high predictive accuracy are the best classifiers.
How are Nominal and Numeric attributes tested in a Decision Tree?
Nominal attributes are tested based on equality and inequality conditions. Each branch from a node corresponds to a specific category or value of the nominal attribute (car color is either black, white, or gray). Nominal attributes will only be tested once in a decision tree. Numeric attributes uses inequality conditions to split the data. The goal is to find a threshold value that optimally separates the data into two subsets. Numeric attributes will be tested multiple times from root to leaf, with each test involving a different threshold value.
What is Numeric Prediction?
Numeric Prediction (supervised) is a variant of classification learning. Numeric Prediction is where "class" is numeric and we are predicting some quantity. An example would include a prediction of play time (numeric) is predicted based on outlook, temperature, humidity, and windy.
How are numeric data attributes handled when building decision trees?
Numeric attributes are handled by creating binary splits.
What are Classification Rules?
One rule is generated for each leaf. The antecedent includes a condition for every node on the path down the tree. The consequent is the rule assigned by the leaf. In other words, the antecedent describes the conditions to be satisfied and the consequent specifies the predicted class label based on the antecedent's conditions.
A neural network model with several hidden layers may suffer from what?
Overfitting and curse of dimensionality.
What is PRISM?
PRISM is a separate-and-conquer method for constructing rules and it only generates correct or "perfect" rules. It does this by continuously adding clauses to each rile until an accuracy of 100% is met on the basis of p/t.
What three problems are present in data mining?
Problem 1: Most patterns are not interesting Problem 2: Patterns may be inexact Problem 3: Data may be garbled or missing
What are receiver operating characteristic curves? How are they useful in choosing models?
Receiver Operating Characteristic Curves (ROC Curves) are graphical representations used to assess the performance of a classification model. Furthermore, they are utilized in signal detection to distinguish the tradeoff between the hit rate (or TPR) and the false alarm rate (or FPR) over a noisy channel. ROC Curves are useful in choosing models because the learner selects samples of test instances that have a higher proportion of positives opposed to negatives.
What models do Support Vector Machines use?
SVMs use linear models to represent non-linear class boundaries. SVMs are algorithms for learning linear classifiers. They are resilient to overfitting because they learn a linear decision boundary called the maximum margin hyperplane.
What is the most common strategy for covering algorithms?
Separate-and-conquer
How are missing values handled in decision trees?
Sometimes they're treated as attribute values, or sometimes they're sent down the most popular branch.
Market basket data is an example of what data?
Sparse data
Signatures in data that can be examined, reasoned about and used to inform future decisions are called
Structural Patterns
Given a set of features that describe a class, then each record of that data table can be described as a rule or using decision tress. These are called what?
Structural Patterns.
Instances closest to the maximum margin hyperplane are called what?
Support Vectors. They reduce overfitting.
What is the standard way of measuring the error rate of a learning scheme?
Tenfold Cross Validation
How does the 1R classifier classify data?
The 1R Classifier classifies data by testing a single attribute. For each feature, the 1R classifier evaluates the error rate of predicting the most common class based on that feature alone.
What is the Laplace Estimator?
The Laplace estimator, also referred to as pseudo counts, are used to account for an unseen value in the data but could be present in the universe of data.
Describe the minimum description length principle
The MDL Principle refers to the fact that simple theories are preferable to complex theories. It reference Occam's Razor, which states "the best scientific theory is the one that is the smallest and can explain all the facts." The goal of the MDL principle is to find a classifier with minimal description length. The description length refers to the space required to describe a theory + the space required to describe the theory's mistakes. The theory is the classifier and the theory's mistakes are the errors on the training data.
Compare and contrast the Zero-R and 1-R classifiers.
The Zero-R classifier differs from the 1-R classifier because it assigns new data to the majority class whereas the 1-R classifier classifies data by testing a single attribute. Both of these classifiers classify data using nominal data attributes.
What is Cross Validation?
The basic idea is to divide the dataset into multiple equal subsets of "k" size. Subsets are commonly referred to as folds and are usually stratified (equal representation). The model is trained on some of these folds and tested on the remaining folds.
What is 0.632 Bootstrap?
The chance of a particular instance not being picked for the test set is 0.368, so that leaves the other 0.632 left for the training set. Bootstrap may be the best way for estimating the error on very small datasets.
Regardless of the type of learning involved, the thing to be learned is referred to as what?
The concept
The output produced by a learning scheme is called what?
The concept description
When Generating Good Rules, why have a growing set and a pruning set?
The growing set is used to form a rule using a covering algorithm, and the pruning set is used for reducing errors (reduced error pruning). REP involves some original data being held back and used as an independent test set.
The computational complexity problems associated with SVMs are solve using what trick?
The kernel trick. This technique helps address computational complexity problems associated with explicitly transforming the data into a higher-dimensional space.
Rules that classify a small portion of the data may be overfitting the data. How does this make the model perform?
The model will perform poorly on the test data set.
In subtree raising, the raising is restricted to what branch?
The most popular branch.
What is Leave-One-Out Cross Validation?
The number of folds is equal to the number of instances in the dataset. Essentially, for "n" training instances, build the classifier "n" times. This form of CV makes the best use of data and involves no random subsampling.
Explain Linear Models
The output of a linear regression model is the sum of all the attribute values except with weights applied to each attribute. All the attribute values are numeric. The point of a linear model is to predict a numeric quantity.
What is the perceptron learning rule? What type of data can be classified using a perceptron?
The perceptron learning rule is a supervised learning algorithm for binary classification tasks. The perceptron is a type of artificial neuron that takes multiple binary inputs, applies weights to those inputs, sums them up, and passes the result through an activation function to produce an output. The perceptron learning rule is used to adjust the weights of the inputs during the training process. The perceptron learning rule works well when data is linearly separable.
If an attribute is continuous, then the probability of the attribute to have a particular value can be calculated using what function?
The probability density function
What is meant by "learning scheme"?
The process by which an algorithm learns to make predictions.
Justify the statement "The error rate on training set is not likely to be a good indicator of future performance"
The statement "the error rate on training set is not likely to be a good indicator of future performance" is true because we use algorithms to train a certain set of data in order to effectively test with it. So, if the error rate is high in the training set, then the test set is likely to perform poorly.
What is an Example?
These are also referred to as instances. These are the things to be classified, associated, or clustered.
What are Association Rules?
These rules can predict any attribute, not just the class. The coverage of an association rule involves the number of instances for which it predicts correctly. The accuracy of an association rule is the number of instances that it predicts correctly, expressed as a proportion of all instances to which it applies.
What are Multilayer Perceptrons?
They consist of an input layer, hidden layers, and and output layer. The parameters of MLPs are found through back-propagation. An MLP is learned by determining the weights of a fixed network structure and uses gradient descent to minimize error.
How does the Perceptron & Winnow differ?
They're both methods for binary classification. The perceptron and winnow, however, differ in how their weights are updated. The perceptron uses an additive rule that alters the weight vector by adding/subtracting the instance's attribute vector. Winnow uses multiplicative updates and alters weights individually by multiplying them by a user-specified parameter.
What is Subtree Raising?
This involves having a lower subtree replace a subtree above.
In this type of model, this output is just the sum of the attribute valued that are weighted to match the desired output.
This is a linear model.
When pruning decision trees, prepruning is a suitable choice for attributes that are informative when taken together.
This is inaccurate because prepruning decisions are typically made based on individual features and their relationships with the target variable, not on combinations of attributes.
In this type of error estimation, some of the original data is held back and used as an independent test set.
This is referred to as reduced-error pruning.
What is denormalization?
This is the process of generating a flat file, where several relations are joined together to make one flat file.
Trees with numeric attributes are complex and tests on any numeric attributes are not located together but can be scattered along the path.
This is true.
What is an Attribute?
This is what each example/instance are characterized by. These are the nominal, ordinal, interval, and ratio values; the measuring aspects of each example/instance.
What is Subtree Replacement?
This refers to the process of selecting some subtrees and replacing them with single leaves. This process proceeds from the leaves and works back up toward the root.
This technique involves partitioning the sequence of data by placing a breakpoint in it.
This technique is called discretization.
Out of the following propositional logic: OR, AND, XOR, and NOT, which cannot be represented by a perceptron?
XOR cannot be represented by a perceptron.
Do covering algorithms and decision trees treat numeric attributes in the same way?
Yes
Can missing and numeric values be handled by 1R?
Yes, 1R can handle missing and numeric values.
Is subtree replacement a bottom up approach?
Yes, subtree replacement is a bottom up approach.
Can Decision Trees easily express the disjunction implied among different rules?
Yes, the structure of the tree inherently allows for the expression of disjunctions. Each path from the root to a leaf node represents a rule or condition, and the decision at each internal node is based on a feature test.
How can missing values be handled?
You can use them as the value of the attribute, and you can send them down the majority branch.
Explain a Simple Covering Algorithm
You generate a rule by adding tests that maximize a rule's accuracy. This involves finding an attribute to split on. The covering algorithm chooses an attribute-value pair to maximize the probability of the desired classification. Each new test reduces the rule's coverage.
If a nominal attribute is used to branch on, then what is the case?
You have used all of the information it has to offer.
What is Incremental Reduced Error Pruning?
You simplify each rule as soon as it is built. RIPPER is an algorithm that does this and it involves global optimization.
What is Post-Pruning?
You take a fully grown decision tree and discard unreliable parts. This involves subtree replacement (a bottom-up approach and considers replacing a tree only after considering all subtrees) OR subtree raising (deleting the nodes and redistributing instances)
What is a Covering Algorithm?
You take each class and seek a way of covering all the instances while excluding the instances not in the class. This is called the "covering approach" because at each stage you identify a rule that "covers" some instances. This leads to a set of rules, not a decision tree.
How are attributes for a decision tree selected?
You want to get the smallest tree, so go for the attribute that produces the purest nodes. Node purity is related to information gain. Choose the attribute that will lead to the greatest information gain.
Why do we prune decision trees?
You're "trimming" the tree in order to prevent overfitting.
The Zero-R Method classifies data by
assigning the new data to the majority class and does not consider any features.