Data Mining Final Exam Review

Ace your homework & exams now with Quizwiz!

what machine learning algorithms exploit patterns for prediction?

Classification and regression, both supervised machine learning algorithms

Differentiate between classification and other predictive techniques

Classification is a type of predictive modeling technique used to classify or categorize data into predefined classes or categories based on certain features or attributes. It is a supervised learning technique where a model is trained on labeled data to make predictions on new, unseen data. Other predictive techniques include regression, clustering, and association rule mining. Regression is used to predict a continuous numerical value or outcome, while clustering is used to group similar objects together based on their features or attributes. Association rule mining is used to discover associations or patterns among a set of variables.

Identify common challenges faced by modelers on real projects

Data Quality: Data quality is one of the most common challenges faced by modelers. The data used to train the model needs to be accurate, complete, and consistent. If the data is not clean or contains errors, the model's accuracy and reliability may be compromised. Data Availability: In some cases, the data needed to train the model may not be readily available or may be difficult to collect. This can be a significant challenge, especially if the data is proprietary or sensitive. Feature Engineering: The selection and engineering of the right features is a critical step in the model development process. Selecting the wrong features or failing to engineer them correctly can result in a model with poor performance. Overfitting: Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor generalization to new data. This is a common challenge, and modelers need to ensure that their models are not overfitting. Model Interpretability: Understanding how a model arrives at its predictions is important in many applications. However, some models, such as neural networks, are inherently complex and difficult to interpret. Modelers must balance model performance with interpretability. Scalability: As the volume of data and complexity of models increase, the computational requirements for training and deploying models also increase. Ensuring that models can be trained and deployed efficiently at scale can be a significant challenge. Deployment and Maintenance: Once a model has been developed, it needs to be deployed and maintained in production. This requires careful consideration of infrastructure, monitoring, and updates to ensure that the model continues to perform well over time.

True or False: Logistic regression models are evaluated against one another based on ChiSquare

False - Accuracy, some measure of error or otherwise put, performance

True or false: association rules implies event A causes event B

False NOT CAUSALITY

_____ & ________ are issues associated with using linear regression for a binary dependent variable

Heteroskedasticity Non-conforming probabilities

Dividing your data into training, validation, and testing data is called what?

Hold out approach

A value of 2.0 for the odds ratio for a variable means

It is going to positively influence the event occuring Greater than 1

Identify the requirements for latent semantic analysis models

Large Textual Dataset: The LSA algorithm requires a large textual dataset to perform the analysis. Text Preprocessing Techniques: The algorithm requires effective text preprocessing techniques such as tokenization, stemming, and stop-word removal. SVD Implementation: The algorithm requires an implementation of SVD to compress the term-document matrix and reduce the dimensionality of the data.

What do models do?

Models predict an output given a set of inputs

What are the different types of classification algorithms to choose from?

Naive Bayes Logistic Regression Perceptron .... Decision Trees/Random Forests Neural Networks

Interpret the results of Naïve Bayes models

Naïve Bayes models output the probability of each class given the input features. The class with the highest probability is the predicted class. The results can be interpreted as the likelihood of the input belonging to each class based on the available evidence. The output probabilities can also be used to calculate the expected utility or cost of each decision based on the predicted class.

Identify the requirements for Naïve Bayes models

Naïve Bayes models require a set of labeled training data and the assumption of conditional independence between the features given the class. They also require the data to be in numerical form, as the algorithm works with probabilities and requires numerical inputs. relies on the assumption that predictors are statistically independent

Identify the steps of the collaborative filtering algorithm

Predict, Rank, Recommend Data Collection: The first step is to collect data on user behavior, such as purchases, ratings, or reviews. User Similarity Calculation: The algorithm then calculates the similarity between different users based on their behavior. This can be done using different methods, such as Pearson correlation or cosine similarity. Neighborhood Selection: The algorithm selects a subset of users who are most similar to the target user, based on the similarity calculation from the previous step. Item Recommendation: The algorithm then recommends items that have been positively rated by the users in the neighborhood, but have not yet been rated by the target user.

Identify the steps of the logistic regression algorithm

Prepare the data by converting it into numerical form and splitting it into training and test sets. Initialize the model parameters (coefficients) randomly. Calculate the probabilities of the output variable (binary) based on the input features using the logistic function. Calculate the cost function (negative log-likelihood) to measure the error between the predicted probabilities and the actual labels. Update the model parameters using gradient descent to minimize the cost function. Repeat steps 3-5 until convergence or a stopping criterion is met. Make predictions on new, unseen data by applying the trained model to the input features.

Identify the steps of the Naïve Bayes algorithm

Prepare the data by converting it into numerical form. Calculate the prior probabilities of each class. Calculate the likelihood of each feature given the class. Calculate the posterior probabilities using Bayes' theorem. Make predictions based on the highest posterior probability.

________ is a numerical measurement of the likelihood of an event occurring

Probability

Choose the appropriate evaluation metric(s) for a given business problem

The appropriate evaluation metric(s) for a given business problem depend on the specific requirements and goals of the problem. For example, if the goal is to maximize overall accuracy, then accuracy may be the most appropriate metric. However, if there is a class imbalance in the data, then metrics such as precision and recall may be more appropriate. It is important to consider the business context and use case when selecting evaluation metrics. •Accuracy: Proportion of correct predictions Sensitivity (True Positive Rate): Proportion of positive cases correctlyclassified as positive Specificity (True Negative Rate): Proportion of negative cases correctlyclassified as negative Accuracy: Accuracy measures the proportion of correct predictions out of the total number of predictions. It is useful when the business problem requires an overall evaluation of the model's performance without focusing on any specific class. For example, in a credit scoring model, accuracy can be used to measure the overall proportion of correct predictions of whether a customer will default on a loan. Sensitivity/Recall: Sensitivity measures the proportion of true positives (TP) out of all actual positives (TP+FN), and it is useful when the business problem requires the model to correctly identify as many positive cases as possible. For example, in a medical diagnosis model, sensitivity can be used to measure the proportion of correctly identified patients who have a particular disease, so that they can receive the appropriate treatment. Specificity: Specificity measures the proportion of true negatives (TN) out of all actual negatives (TN+FP), and it is useful when the business problem requires the model to correctly identify as many negative cases as possible. For example, in a fraud detection model, specificity can be used to measure the proportion of correctly identified non-fraudulent transactions, so that legitimate transactions are not unnecessarily flagged as fraudulent. False Positive Rate (FPR): FPR measures the proportion of actual negatives that were incorrectly classified as positives. In other words, it is the ratio of false positives (FP) to the sum of true negatives (TN) and false positives (FP). FPR is useful when the business problem requires minimizing the number of false alarms or false positives. For example, in a spam email filter, FPR can be used to measure the proportion of legitimate emails that are incorrectly classified as spam. False Negative Rate (FNR): FNR measures the proportion of actual positives that were incorrectly classified as negatives. In other words, it is the ratio of false negatives (FN) to the sum of true positives (TP) and false negatives (FN). FNR is useful when the business problem requires minimizing the number of missed opportunities or false negatives. For example, in a medical diagnosis model, FNR can be used to measure the proportion of patients who have a particular disease but were not identified by the model, potentially leading to delayed treatment. Precision: Precision measures the proportion of true positives (TP) out of all predicted positives (TP+FP), and it is useful when the business problem requires a high degree of confidence in the positive predictions. For example, in a credit scoring model, precision can be used to measure the proportion of correctly identified customers who are likely to default on a loan, so that the business can take appropriate risk management measures.

Interpret the results of collaborative filtering models

The results of the Collaborative Filtering algorithm can be interpreted in terms of the recommendations it generates. These recommendations can be evaluated using different metrics, such as precision, recall, or F1 score. The algorithm can also be used to generate insights into user behavior, such as identifying patterns of behavior or predicting future purchases. Additionally, the algorithm can be used to improve the user experience in an online application, by providing personalized recommendations that are tailored to each user's preferences.

Differentiate between supervised and unsupervised learning methods

•Unsupervised Learning •The computer is presented only with inputs (independent variables) •The computer attempts to classify things based on similarity/dissimilarity •Supervised Learning •The computer is presented with inputs (independent variables) and associated labels indicating the class of the observation (dependent variable) •The computer attempts to learn the rule that maps inputs to each class •New data is classified based on the rule learned by the computer

Identify strategies for blending social and technical elements of modeling

Blending social and technical elements of modeling can be a challenging task, but there are some strategies that can be used to help achieve this goal: Involve Stakeholders: Modelers should involve stakeholders in the model development process to ensure that the model reflects their needs and priorities. This can be achieved through workshops, focus groups, and other engagement activities. Communication: Effective communication is critical when blending social and technical elements of modeling. Modelers should communicate technical concepts in a way that is accessible to non-technical stakeholders, and ensure that feedback and concerns are addressed in a timely and effective manner. Transparency: Modelers should be transparent about the assumptions, data, and methods used in the modeling process. This can help build trust with stakeholders and ensure that the model is perceived as credible. Flexibility: Modelers should be flexible and willing to adapt their approach based on stakeholder feedback and changing circumstances. This can help ensure that the model remains relevant and useful over time. Participatory Modeling: Participatory modeling involves engaging stakeholders in the modeling process, from problem definition through to model development and implementation. This approach can help ensure that the model reflects stakeholder needs and priorities and can increase stakeholder buy-in and support for the model. Ethical Considerations: Modelers should consider the ethical implications of their modeling activities and ensure that the model does not perpetuate or exacerbate social inequalities or biases. By implementing these strategies, modelers can effectively blend social and technical elements of modeling, resulting in models that are relevant, credible, and useful for stakeholders.

What is the purpose of the test data set?

Predict error, predict the performance of the model with respect to error

Construct and interpret model assessment charts

•Many of these charts share a set of algorithmic steps 1.Using the model, produce estimated probabilities for the target event for each case 2.Sort all cases by decreasing estimated probability 3.Split the cases evenly into n bins so that bin #1 has the highest probabilities and bin #n the lowest probabilities 4.Now look at the number of cases where the target event actually happens 5.Calculate statistic of interest for each bin We often evaluate model performance visually using charts. In many cases, these charts are produced automatically by the data mining tool being used. Model assessment charts such as ROC (receiver operating characteristic) curves and lift charts can be used to evaluate the performance of a classification model. ROC curves plot the true positive rate (TPR) against the false positive rate (FPR) for different classification thresholds, while lift charts show the ratio of the true positive rate to the expected rate for different deciles of the data. These charts can be used to compare the performance of different models and to select an appropriate classification threshold based on the business requirements and trade-offs.

What as a business could I do if I have an association rules analysis that tells me two products are highly related?

Product placement Bundling Make recommendations

The assumption in item based collaborative filtering is that items are similar based on what?

Ratings

What does support tell me about a rule?

Relevance

Outputs are otherwise known as ________ or ________

Responses or dependent variables

The hyperbolic tangent function has what kind of shape?

S Shaped, very much like the logit function but count by 1 and -1

What is an advantage of model-based approaches over memory-based approaches

Scalability. We can train the model, do the bulk of the calculations ahead of time based on historical data

Which metric would be appropriate when examining a test for a fatal disease?

Sensitivity

The logic function describes a _______ curve

Sigmoid (S) shaped

What does confidence tell me about a rule?

Strength

What is the role of the cost function in back propagation?

The role of the cost function is to determine 2 things, in what direction should I alter my weights and biases

A nearest neighbor is what?

They rate things the way that you rate things

Identify the roles of training, validation, and test data sets in the model and development and evaluation

Training data is used to construct the classification model Validation data is used to fine tune the models, assess their performance, and select the "best" model for a given phenomenon Test data is used to estimate the accuracy/future performance of the selected model new/unseen data contains only inputs and the predicted outputs enable decision makers to extract value from the data

True or false: Inputs are known as predictors or indecent variables

True

True or false: Neural networks are easy to overfit

True

True or false: Outputs are known as responses or dependent variables

True

All of the model comparison charts with the exception of the ROC chart start with what?

We predict probabilities, sort the probabilities, then split the data into bins

In classification techniques, there are ________ outputs

categorical

What does classification predict?

categorical class labels (discrete or nominal)

When your dependent variable is nominal/categorical, a _______ is equivalent to each possible level of that variable

class

What machine learning algorithms are supervised?

classification regression

Models are otherwise known as _____-

classifiers

In regression techniques there are __________ outputs

continuous

What kind of model is classification?

A predictive model

In the context of supervised machine learning, the role of the training data set is to what?

Beyond train the model to allow the algorithm to extract the rules

Compare and contrast neural networks to logistic regression

Both neural networks and logistic regression are machine learning models used for classification tasks. However, they differ in the following ways: Neural networks can handle more complex relationships between the input features and the output variable than logistic regression, as they can learn non-linear representations of the data through multiple layers of neurons. Logistic regression assumes a linear relationship between the input features and the log-odds of the output variable, while neural networks can learn non-linear relationships. Neural networks require more data and computational resources than logistic regression due to their complexity.

Interpret the results of CaRT/ID3 models

CaRT/ID3 models output a decision tree that represents the hierarchy of input features and their corresponding splits that lead to the predicted classes. The tree can be interpreted as a set of rules that describe the decision-making process of the model. The results can be evaluated using metrics such as accuracy, precision, recall, and F1 score. The model can also be visualized to aid in understanding and interpretation. The results can be interpreted as the predicted class of the input based on the decision rules of the tree.

Identify the requirements for CaRT/ID3 models

CaRT/ID3 models require a set of labeled training data and the assumption that the data can be split into binary categories based on the input features. They also require the data to be in numerical form, as the algorithm works with numerical inputs. Additionally, the algorithm assumes that the input features are independent and that the target variable has a discrete set of values.

Total weighted entropy is a measure of what?

Certainty Perfectly certain if it is 0 Perfectly uncertain if it is 1

What machine learning algorithms are unsupervised?

Clustering Association Dimensionality Reduction

What is required to demonstrate causality

Covariation Non-spuriousness Temporal Precedence

What does the IDF portion of td-IDF do?

Dampens the effect of words that appear across the entire corpus at a high rate. it suppresses their importance

Identify the steps of the latent semantic analysis algorithm

Data Collection: The first step is to collect a large dataset of documents, such as news articles or academic papers. Text Preprocessing: The text is then preprocessed, which includes tasks such as tokenization, stemming, and stop-word removal. This step is important to reduce the dimensionality of the data and improve the efficiency of the algorithm. Term-Document Matrix Creation: The next step is to create a term-document matrix, which represents the frequency of each term in each document. This matrix is typically very large and sparse, and is therefore compressed using techniques such as singular value decomposition (SVD). Dimensionality Reduction: The dimensionality of the matrix is reduced using SVD, which converts the matrix into a lower-dimensional space while preserving the most important relationships between terms and documents. Calculation of Semantic Similarity: Once the dimensionality of the matrix has been reduced, the semantic similarity between words and documents can be calculated. This is done by measuring the cosine similarity between the vectors representing the terms and documents in the reduced dimensional space.

____________ variables have a finite number of values

Discrete

Identify the steps of the associate rule algorithm

Data Preparation: This involves collecting and cleaning the data and converting it into a format suitable for analysis. Support Calculation: This involves calculating the frequency of occurrence of each itemset in the dataset. Itemset Generation: This involves generating all possible itemsets that meet the minimum support threshold. Rule Generation: This involves generating association rules by testing all possible combinations of itemsets and evaluating their confidence levels. Rule Evaluation: This involves evaluating the generated rules based on user-defined metrics such as support, confidence, and lift. Rule Pruning: This involves eliminating uninteresting or redundant rules to produce a set of actionable insights.

______________ is a set of nested tests we use to divide and conquer a prediction problem

Decision tree

What is LSA?

Decomposition of that term document matrix into 3 matrices

What is the difference between collaborative filtering and content-based filtering?

Difference between the user and the item

A ____________ is when a group of models are used to make a prediction

Ensemble

True or false: Association is a supervised machine learning algorithm

False

True or false: Neural networks are easy to interpret

False

True or false: Outputs are known as predictors or independent variables

False

What do machine learning algorithms do?

Find patterns

The role of the activation function is _________-

Firing, taking the aggregated data that comes in and determining if it is high enough to fire

Interpret the results of the association rules models

Frequent Itemsets: The algorithm can identify frequent itemsets, which are sets of items that frequently co-occur in the dataset. These itemsets can provide insights into which products or services are often purchased or used together. Association Rules: The algorithm can generate association rules that describe the relationships between different items or attributes in the dataset. These rules can be used to identify patterns or trends in consumer behavior or to make recommendations for cross-selling or up-selling opportunities. Metrics: The algorithm can calculate various metrics such as support, confidence, and lift for each association rule. These metrics can be used to evaluate the strength and significance of the relationships between different items or attributes.

What is the purpose of the validation data set?

Generally validation is used to select among competing models

What is the difference between unsupervised and supervised learning?

In unsupervised learning, the computer is tasked with assigning each observation to a group, without knowing what those groups really are. Constructed based on similarities in the data for the objects given. In supervised machine learning, the computer examines data where we know to what class the observation belongs. It then attempts to find the patterns in the data (extract rules) that it can use to classify a new observation (for which the class is unknown)

How is classification different from association rule mining and clustering?

It is a predictive model, unlike association rule mining and clustering which are descriptive

______ is a modeling approach which uses MLE to predict a nominal variable

Logistic Regression

Interpret the results of logistic regression models

Logistic regression models output the predicted probability of the output variable (binary) given the input features. The model coefficients represent the strength and direction of the relationship between each input feature and the log-odds of the output variable. These coefficients can be used to calculate the odds ratio, which represents the change in odds of the output variable given a one-unit increase in the corresponding input feature. The model can also be evaluated using metrics such as accuracy, precision, recall, and F1 score. The results can be interpreted as the likelihood of the input belonging to the positive class based on the available evidence.

Identify the requirements for logistic regression models

Logistic regression models require a set of labeled training data and the assumption of a linear relationship between the input features and the log-odds of the output variable. They also require the data to be in numerical form, as the algorithm works with probabilities and requires numerical inputs. Additionally, logistic regression assumes that the data follows a binomial distribution and that the observations are independent.

Collaborative filtering is what?

Memory Based

What is the classification process?

Model Building Validation Testing Application

Interpret the results of neural network models

Neural network models output the predicted probability of the output variable (binary or multi-class) given the input features. The model weights represent the strength and direction of the connections between the neurons and can be used to understand the learned representations of the input data. The model can also be evaluated using metrics such as accuracy, precision, recall, and F1 score. The results can be interpreted as the likelihood of the input belonging to each class based on the learned relationships between the input features.

Identify the requirements for neural network models

Neural network models require a set of labeled training data and a large number of training iterations to learn the optimal weights of the connections between the neurons. They also require the data to be in numerical form, as the algorithm works with numerical inputs. Additionally, the number of neurons, layers, and activation functions must be specified, along with the learning rate and other hyperparameters that affect the training process.

________ is when a model is trained too well and models the idiosyncrasies of the training data

Overfitting

Interpret the confusion matrix and associated metrics

The confusion matrix is a table that shows the predicted and actual values of a classification model. It contains four metrics: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These metrics can be used to calculate additional metrics such as accuracy, precision, recall, and F1 score, which provide information on the performance of the model.

What does lift tell me about a rule?

The direction of a relationship in essence

Interpret the results of the latent semantic analysis models

The results of the LSA algorithm can be interpreted in terms of the relationships between words and documents. The reduced-dimensional space allows for easier visualization of these relationships and can be used to identify clusters of related terms or documents. The algorithm can also be used for tasks such as document classification, information retrieval, and sentiment analysis. The performance of the LSA algorithm can be evaluated using metrics such as precision, recall, and F1 score.

Identify the requirements for collaborative filtering models

User Behavior Data: The algorithm requires a dataset that includes information on user behavior, such as purchases, ratings, or reviews. Similarity Calculation Method: The algorithm requires a method for calculating the similarity between different users. This can be done using different methods, such as Pearson correlation or cosine similarity. Neighborhood Selection Criteria: The algorithm requires a criteria for selecting a subset of users who are most similar to the target user. This can be based on the similarity calculation from step 2, or other factors such as user location or preferences. •Cold-start problem - Limited knowledge of users means it is difficult to determine similarity •Sparsity of records - With a large set of items, users will likely only have rated a few items •First-rater problem - Cannot predict rating for new item until some users have rated it •Popularity bias - cannot recommend items to someone with unique tastes •Scalability - Computations become slower as the number of users and items increase

Machine learning is the

field of study that gives computers the ability to learn without being explicitly programmed

A human analog to the perceptron is a ________

neuron

Inputs are otherwise known as ________ or ________

predictors or independent variables

Classification is a _________ method and the model is constructed using a training data set

supervised

Why are decision trees considered greedy?

they look at each variable to make the best decision they can, and then forget that variable and move on to the next

Identify the steps of the CaRT/ID3 algorithm

tree construction is performed in a top-down, recursive, divide-and-conquer manner The CaRT (Classification and Regression Trees) and ID3 (Iterative Dichotomiser 3) algorithms are decision tree algorithms that involve the following steps: 1.Using your training data, select the best attribute to split on 2.Identify all possible values for that attribute 3.For each value, create a new child node 4.Allocate the observations to the appropriate child node 5.For each child node •If the node is pure, STOP •Else, recursively call the algorithm to split again

Why might cross validation be favored over hold out?

with cross-validation we wash that randomness out and do the error estimation iteratively

Identify the requirements for association rule models

•Association rules require data in transactional format Meaningful rules can only be computed for nominal data •It is important to recognize that the support, confidence and lift of each rule are need in order to determine its value to the business •A rule has to meet a minimum support and a minimum confidence level •Both thresholds are determined by the modeler •Consider an association rule found in a cell phone company database containing all call destinations for each account:

Identify approaches for estimating error in models

•There are multiple methods commonly used to gather data for the evaluation of classification models •Hold out •Cross validation •Bootstrapping Cross-validation: splitting the data into training and testing sets multiple times and averaging the performance across each split. Holdout validation: splitting the data into training and testing sets once and evaluating the performance on the testing set. Bootstrap: resampling the data with replacement to create multiple training and testing sets and evaluating the performance across each set. Bayesian methods: using prior distributions and posterior probabilities to estimate the uncertainty and error of the model.

See all study sets

Data Mining Final Exam Review

Related study sets

BIT SHIFTING

Module 4 exam

AWS Services

ECON 203 quiz/test

Networking Devices and Initial Configuration 4-6

MORGAN UNIT 6 PREP U

Childhood Disorders Final Exam (PAST CHAPTERS)

Chapter 6-8 drainage

Chapter 14 - Learning Curve

PHR 913 Block 4

2 Balancing Chemical Equations

Lewis: Chapter 42

Domain 2: Safety Management Systems

Energetics, Metabolism, and Enzymes (Quiz 5)

Environmental Science Unit 7

Final Exam - Oklahoma Life and Health

English

Deposit Insurance

Carol Martz Computer User Support for Help Desk and Support Specialists Chapter 11 Check Your Understanding Questions

IST 769 Advanced Database Management QUIZ