Data Mining Final Exam Review
What as a business could I do if I have an association rules analysis that tells me two products are highly related?
Product placement Bundling Make recommendations
The assumption in item based collaborative filtering is that items are similar based on what?
Ratings
What does support tell me about a rule?
Relevance
Outputs are otherwise known as ________ or ________
Responses or dependent variables
The hyperbolic tangent function has what kind of shape?
S Shaped, very much like the logit function but count by 1 and -1
What is an advantage of model-based approaches over memory-based approaches
Scalability. We can train the model, do the bulk of the calculations ahead of time based on historical data
Which metric would be appropriate when examining a test for a fatal disease?
Sensitivity
The logic function describes a _______ curve
Sigmoid (S) shaped
What does confidence tell me about a rule?
Strength
What is the role of the cost function in back propagation?
The role of the cost function is to determine 2 things, in what direction should I alter my weights and biases
A nearest neighbor is what?
They rate things the way that you rate things
Identify the roles of training, validation, and test data sets in the model and development and evaluation
Training data is used to construct the classification model Validation data is used to fine tune the models, assess their performance, and select the "best" model for a given phenomenon Test data is used to estimate the accuracy/future performance of the selected model new/unseen data contains only inputs and the predicted outputs enable decision makers to extract value from the data
True or false: Inputs are known as predictors or indecent variables
True
True or false: Neural networks are easy to overfit
True
True or false: Outputs are known as responses or dependent variables
True
All of the model comparison charts with the exception of the ROC chart start with what?
We predict probabilities, sort the probabilities, then split the data into bins
In classification techniques, there are ________ outputs
categorical
What does classification predict?
categorical class labels (discrete or nominal)
When your dependent variable is nominal/categorical, a _______ is equivalent to each possible level of that variable
class
What machine learning algorithms are supervised?
classification regression
Models are otherwise known as _____-
classifiers
In regression techniques there are __________ outputs
continuous
What kind of model is classification?
A predictive model
In the context of supervised machine learning, the role of the training data set is to what?
Beyond train the model to allow the algorithm to extract the rules
Identify strategies for blending social and technical elements of modeling
Blending social and technical elements of modeling can be a challenging task, but there are some strategies that can be used to help achieve this goal: Involve Stakeholders: Modelers should involve stakeholders in the model development process to ensure that the model reflects their needs and priorities. This can be achieved through workshops, focus groups, and other engagement activities. Communication: Effective communication is critical when blending social and technical elements of modeling. Modelers should communicate technical concepts in a way that is accessible to non-technical stakeholders, and ensure that feedback and concerns are addressed in a timely and effective manner. Transparency: Modelers should be transparent about the assumptions, data, and methods used in the modeling process. This can help build trust with stakeholders and ensure that the model is perceived as credible. Flexibility: Modelers should be flexible and willing to adapt their approach based on stakeholder feedback and changing circumstances. This can help ensure that the model remains relevant and useful over time. Participatory Modeling: Participatory modeling involves engaging stakeholders in the modeling process, from problem definition through to model development and implementation. This approach can help ensure that the model reflects stakeholder needs and priorities and can increase stakeholder buy-in and support for the model. Ethical Considerations: Modelers should consider the ethical implications of their modeling activities and ensure that the model does not perpetuate or exacerbate social inequalities or biases. By implementing these strategies, modelers can effectively blend social and technical elements of modeling, resulting in models that are relevant, credible, and useful for stakeholders.
Compare and contrast neural networks to logistic regression
Both neural networks and logistic regression are machine learning models used for classification tasks. However, they differ in the following ways: Neural networks can handle more complex relationships between the input features and the output variable than logistic regression, as they can learn non-linear representations of the data through multiple layers of neurons. Logistic regression assumes a linear relationship between the input features and the log-odds of the output variable, while neural networks can learn non-linear relationships. Neural networks require more data and computational resources than logistic regression due to their complexity.
Interpret the results of CaRT/ID3 models
CaRT/ID3 models output a decision tree that represents the hierarchy of input features and their corresponding splits that lead to the predicted classes. The tree can be interpreted as a set of rules that describe the decision-making process of the model. The results can be evaluated using metrics such as accuracy, precision, recall, and F1 score. The model can also be visualized to aid in understanding and interpretation. The results can be interpreted as the predicted class of the input based on the decision rules of the tree.
Identify the requirements for CaRT/ID3 models
CaRT/ID3 models require a set of labeled training data and the assumption that the data can be split into binary categories based on the input features. They also require the data to be in numerical form, as the algorithm works with numerical inputs. Additionally, the algorithm assumes that the input features are independent and that the target variable has a discrete set of values.
Total weighted entropy is a measure of what?
Certainty Perfectly certain if it is 0 Perfectly uncertain if it is 1
what machine learning algorithms exploit patterns for prediction?
Classification and regression, both supervised machine learning algorithms
Differentiate between classification and other predictive techniques
Classification is a type of predictive modeling technique used to classify or categorize data into predefined classes or categories based on certain features or attributes. It is a supervised learning technique where a model is trained on labeled data to make predictions on new, unseen data. Other predictive techniques include regression, clustering, and association rule mining. Regression is used to predict a continuous numerical value or outcome, while clustering is used to group similar objects together based on their features or attributes. Association rule mining is used to discover associations or patterns among a set of variables.
What machine learning algorithms are unsupervised?
Clustering Association Dimensionality Reduction
What is required to demonstrate causality
Covariation Non-spuriousness Temporal Precedence
What does the IDF portion of td-IDF do?
Dampens the effect of words that appear across the entire corpus at a high rate. it suppresses their importance
Identify the steps of the latent semantic analysis algorithm
Data Collection: The first step is to collect a large dataset of documents, such as news articles or academic papers. Text Preprocessing: The text is then preprocessed, which includes tasks such as tokenization, stemming, and stop-word removal. This step is important to reduce the dimensionality of the data and improve the efficiency of the algorithm. Term-Document Matrix Creation: The next step is to create a term-document matrix, which represents the frequency of each term in each document. This matrix is typically very large and sparse, and is therefore compressed using techniques such as singular value decomposition (SVD). Dimensionality Reduction: The dimensionality of the matrix is reduced using SVD, which converts the matrix into a lower-dimensional space while preserving the most important relationships between terms and documents. Calculation of Semantic Similarity: Once the dimensionality of the matrix has been reduced, the semantic similarity between words and documents can be calculated. This is done by measuring the cosine similarity between the vectors representing the terms and documents in the reduced dimensional space.
____________ variables have a finite number of values
Discrete
Identify the steps of the associate rule algorithm
Data Preparation: This involves collecting and cleaning the data and converting it into a format suitable for analysis. Support Calculation: This involves calculating the frequency of occurrence of each itemset in the dataset. Itemset Generation: This involves generating all possible itemsets that meet the minimum support threshold. Rule Generation: This involves generating association rules by testing all possible combinations of itemsets and evaluating their confidence levels. Rule Evaluation: This involves evaluating the generated rules based on user-defined metrics such as support, confidence, and lift. Rule Pruning: This involves eliminating uninteresting or redundant rules to produce a set of actionable insights.
Identify common challenges faced by modelers on real projects
Data Quality: Data quality is one of the most common challenges faced by modelers. The data used to train the model needs to be accurate, complete, and consistent. If the data is not clean or contains errors, the model's accuracy and reliability may be compromised. Data Availability: In some cases, the data needed to train the model may not be readily available or may be difficult to collect. This can be a significant challenge, especially if the data is proprietary or sensitive. Feature Engineering: The selection and engineering of the right features is a critical step in the model development process. Selecting the wrong features or failing to engineer them correctly can result in a model with poor performance. Overfitting: Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor generalization to new data. This is a common challenge, and modelers need to ensure that their models are not overfitting. Model Interpretability: Understanding how a model arrives at its predictions is important in many applications. However, some models, such as neural networks, are inherently complex and difficult to interpret. Modelers must balance model performance with interpretability. Scalability: As the volume of data and complexity of models increase, the computational requirements for training and deploying models also increase. Ensuring that models can be trained and deployed efficiently at scale can be a significant challenge. Deployment and Maintenance: Once a model has been developed, it needs to be deployed and maintained in production. This requires careful consideration of infrastructure, monitoring, and updates to ensure that the model continues to perform well over time.
______________ is a set of nested tests we use to divide and conquer a prediction problem
Decision tree
What is LSA?
Decomposition of that term document matrix into 3 matrices
What is the difference between collaborative filtering and content-based filtering?
Difference between the user and the item
A ____________ is when a group of models are used to make a prediction
Ensemble
True or false: Association is a supervised machine learning algorithm
False
True or false: Neural networks are easy to interpret
False
True or false: Outputs are known as predictors or independent variables
False
True or False: Logistic regression models are evaluated against one another based on ChiSquare
False - Accuracy, some measure of error or otherwise put, performance
True or false: association rules implies event A causes event B
False NOT CAUSALITY
What do machine learning algorithms do?
Find patterns
The role of the activation function is _________-
Firing, taking the aggregated data that comes in and determining if it is high enough to fire
Interpret the results of the association rules models
Frequent Itemsets: The algorithm can identify frequent itemsets, which are sets of items that frequently co-occur in the dataset. These itemsets can provide insights into which products or services are often purchased or used together. Association Rules: The algorithm can generate association rules that describe the relationships between different items or attributes in the dataset. These rules can be used to identify patterns or trends in consumer behavior or to make recommendations for cross-selling or up-selling opportunities. Metrics: The algorithm can calculate various metrics such as support, confidence, and lift for each association rule. These metrics can be used to evaluate the strength and significance of the relationships between different items or attributes.
What is the purpose of the validation data set?
Generally validation is used to select among competing models
_____ & ________ are issues associated with using linear regression for a binary dependent variable
Heteroskedasticity Non-conforming probabilities
Dividing your data into training, validation, and testing data is called what?
Hold out approach
What is the difference between unsupervised and supervised learning?
In unsupervised learning, the computer is tasked with assigning each observation to a group, without knowing what those groups really are. Constructed based on similarities in the data for the objects given. In supervised machine learning, the computer examines data where we know to what class the observation belongs. It then attempts to find the patterns in the data (extract rules) that it can use to classify a new observation (for which the class is unknown)
How is classification different from association rule mining and clustering?
It is a predictive model, unlike association rule mining and clustering which are descriptive
A value of 2.0 for the odds ratio for a variable means
It is going to positively influence the event occuring Greater than 1
Identify the requirements for latent semantic analysis models
Large Textual Dataset: The LSA algorithm requires a large textual dataset to perform the analysis. Text Preprocessing Techniques: The algorithm requires effective text preprocessing techniques such as tokenization, stemming, and stop-word removal. SVD Implementation: The algorithm requires an implementation of SVD to compress the term-document matrix and reduce the dimensionality of the data.
Interpret the results of Naïve Bayes models
Naïve Bayes models output the probability of each class given the input features. The class with the highest probability is the predicted class. The results can be interpreted as the likelihood of the input belonging to each class based on the available evidence. The output probabilities can also be used to calculate the expected utility or cost of each decision based on the predicted class.
Identify the requirements for Naïve Bayes models
Naïve Bayes models require a set of labeled training data and the assumption of conditional independence between the features given the class. They also require the data to be in numerical form, as the algorithm works with probabilities and requires numerical inputs. relies on the assumption that predictors are statistically independent
______ is a modeling approach which uses MLE to predict a nominal variable
Logistic Regression
Interpret the results of logistic regression models
Logistic regression models output the predicted probability of the output variable (binary) given the input features. The model coefficients represent the strength and direction of the relationship between each input feature and the log-odds of the output variable. These coefficients can be used to calculate the odds ratio, which represents the change in odds of the output variable given a one-unit increase in the corresponding input feature. The model can also be evaluated using metrics such as accuracy, precision, recall, and F1 score. The results can be interpreted as the likelihood of the input belonging to the positive class based on the available evidence.
Identify the requirements for logistic regression models
Logistic regression models require a set of labeled training data and the assumption of a linear relationship between the input features and the log-odds of the output variable. They also require the data to be in numerical form, as the algorithm works with probabilities and requires numerical inputs. Additionally, logistic regression assumes that the data follows a binomial distribution and that the observations are independent.
Collaborative filtering is what?
Memory Based
What is the classification process?
Model Building Validation Testing Application
What do models do?
Models predict an output given a set of inputs
What are the different types of classification algorithms to choose from?
Naive Bayes Logistic Regression Perceptron .... Decision Trees/Random Forests Neural Networks
Interpret the results of neural network models
Neural network models output the predicted probability of the output variable (binary or multi-class) given the input features. The model weights represent the strength and direction of the connections between the neurons and can be used to understand the learned representations of the input data. The model can also be evaluated using metrics such as accuracy, precision, recall, and F1 score. The results can be interpreted as the likelihood of the input belonging to each class based on the learned relationships between the input features.
Identify the requirements for neural network models
Neural network models require a set of labeled training data and a large number of training iterations to learn the optimal weights of the connections between the neurons. They also require the data to be in numerical form, as the algorithm works with numerical inputs. Additionally, the number of neurons, layers, and activation functions must be specified, along with the learning rate and other hyperparameters that affect the training process.
________ is when a model is trained too well and models the idiosyncrasies of the training data
Overfitting
What is the purpose of the test data set?
Predict error, predict the performance of the model with respect to error
Identify the steps of the collaborative filtering algorithm
Predict, Rank, Recommend Data Collection: The first step is to collect data on user behavior, such as purchases, ratings, or reviews. User Similarity Calculation: The algorithm then calculates the similarity between different users based on their behavior. This can be done using different methods, such as Pearson correlation or cosine similarity. Neighborhood Selection: The algorithm selects a subset of users who are most similar to the target user, based on the similarity calculation from the previous step. Item Recommendation: The algorithm then recommends items that have been positively rated by the users in the neighborhood, but have not yet been rated by the target user.
Identify the steps of the logistic regression algorithm
Prepare the data by converting it into numerical form and splitting it into training and test sets. Initialize the model parameters (coefficients) randomly. Calculate the probabilities of the output variable (binary) based on the input features using the logistic function. Calculate the cost function (negative log-likelihood) to measure the error between the predicted probabilities and the actual labels. Update the model parameters using gradient descent to minimize the cost function. Repeat steps 3-5 until convergence or a stopping criterion is met. Make predictions on new, unseen data by applying the trained model to the input features.
Identify the steps of the Naïve Bayes algorithm
Prepare the data by converting it into numerical form. Calculate the prior probabilities of each class. Calculate the likelihood of each feature given the class. Calculate the posterior probabilities using Bayes' theorem. Make predictions based on the highest posterior probability.
________ is a numerical measurement of the likelihood of an event occurring
Probability
Choose the appropriate evaluation metric(s) for a given business problem
The appropriate evaluation metric(s) for a given business problem depend on the specific requirements and goals of the problem. For example, if the goal is to maximize overall accuracy, then accuracy may be the most appropriate metric. However, if there is a class imbalance in the data, then metrics such as precision and recall may be more appropriate. It is important to consider the business context and use case when selecting evaluation metrics. •Accuracy: Proportion of correct predictions Sensitivity (True Positive Rate): Proportion of positive cases correctlyclassified as positive Specificity (True Negative Rate): Proportion of negative cases correctlyclassified as negative Accuracy: Accuracy measures the proportion of correct predictions out of the total number of predictions. It is useful when the business problem requires an overall evaluation of the model's performance without focusing on any specific class. For example, in a credit scoring model, accuracy can be used to measure the overall proportion of correct predictions of whether a customer will default on a loan. Sensitivity/Recall: Sensitivity measures the proportion of true positives (TP) out of all actual positives (TP+FN), and it is useful when the business problem requires the model to correctly identify as many positive cases as possible. For example, in a medical diagnosis model, sensitivity can be used to measure the proportion of correctly identified patients who have a particular disease, so that they can receive the appropriate treatment. Specificity: Specificity measures the proportion of true negatives (TN) out of all actual negatives (TN+FP), and it is useful when the business problem requires the model to correctly identify as many negative cases as possible. For example, in a fraud detection model, specificity can be used to measure the proportion of correctly identified non-fraudulent transactions, so that legitimate transactions are not unnecessarily flagged as fraudulent. False Positive Rate (FPR): FPR measures the proportion of actual negatives that were incorrectly classified as positives. In other words, it is the ratio of false positives (FP) to the sum of true negatives (TN) and false positives (FP). FPR is useful when the business problem requires minimizing the number of false alarms or false positives. For example, in a spam email filter, FPR can be used to measure the proportion of legitimate emails that are incorrectly classified as spam. False Negative Rate (FNR): FNR measures the proportion of actual positives that were incorrectly classified as negatives. In other words, it is the ratio of false negatives (FN) to the sum of true positives (TP) and false negatives (FN). FNR is useful when the business problem requires minimizing the number of missed opportunities or false negatives. For example, in a medical diagnosis model, FNR can be used to measure the proportion of patients who have a particular disease but were not identified by the model, potentially leading to delayed treatment. Precision: Precision measures the proportion of true positives (TP) out of all predicted positives (TP+FP), and it is useful when the business problem requires a high degree of confidence in the positive predictions. For example, in a credit scoring model, precision can be used to measure the proportion of correctly identified customers who are likely to default on a loan, so that the business can take appropriate risk management measures.
Interpret the confusion matrix and associated metrics
The confusion matrix is a table that shows the predicted and actual values of a classification model. It contains four metrics: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These metrics can be used to calculate additional metrics such as accuracy, precision, recall, and F1 score, which provide information on the performance of the model.
What does lift tell me about a rule?
The direction of a relationship in essence
Interpret the results of collaborative filtering models
The results of the Collaborative Filtering algorithm can be interpreted in terms of the recommendations it generates. These recommendations can be evaluated using different metrics, such as precision, recall, or F1 score. The algorithm can also be used to generate insights into user behavior, such as identifying patterns of behavior or predicting future purchases. Additionally, the algorithm can be used to improve the user experience in an online application, by providing personalized recommendations that are tailored to each user's preferences.
Interpret the results of the latent semantic analysis models
The results of the LSA algorithm can be interpreted in terms of the relationships between words and documents. The reduced-dimensional space allows for easier visualization of these relationships and can be used to identify clusters of related terms or documents. The algorithm can also be used for tasks such as document classification, information retrieval, and sentiment analysis. The performance of the LSA algorithm can be evaluated using metrics such as precision, recall, and F1 score.
Identify the requirements for collaborative filtering models
User Behavior Data: The algorithm requires a dataset that includes information on user behavior, such as purchases, ratings, or reviews. Similarity Calculation Method: The algorithm requires a method for calculating the similarity between different users. This can be done using different methods, such as Pearson correlation or cosine similarity. Neighborhood Selection Criteria: The algorithm requires a criteria for selecting a subset of users who are most similar to the target user. This can be based on the similarity calculation from step 2, or other factors such as user location or preferences. •Cold-start problem - Limited knowledge of users means it is difficult to determine similarity •Sparsity of records - With a large set of items, users will likely only have rated a few items •First-rater problem - Cannot predict rating for new item until some users have rated it •Popularity bias - cannot recommend items to someone with unique tastes •Scalability - Computations become slower as the number of users and items increase
Machine learning is the
field of study that gives computers the ability to learn without being explicitly programmed
A human analog to the perceptron is a ________
neuron
Inputs are otherwise known as ________ or ________
predictors or independent variables
Classification is a _________ method and the model is constructed using a training data set
supervised
Why are decision trees considered greedy?
they look at each variable to make the best decision they can, and then forget that variable and move on to the next
Identify the steps of the CaRT/ID3 algorithm
tree construction is performed in a top-down, recursive, divide-and-conquer manner The CaRT (Classification and Regression Trees) and ID3 (Iterative Dichotomiser 3) algorithms are decision tree algorithms that involve the following steps: 1.Using your training data, select the best attribute to split on 2.Identify all possible values for that attribute 3.For each value, create a new child node 4.Allocate the observations to the appropriate child node 5.For each child node •If the node is pure, STOP •Else, recursively call the algorithm to split again
Why might cross validation be favored over hold out?
with cross-validation we wash that randomness out and do the error estimation iteratively
Identify the requirements for association rule models
•Association rules require data in transactional format Meaningful rules can only be computed for nominal data •It is important to recognize that the support, confidence and lift of each rule are need in order to determine its value to the business •A rule has to meet a minimum support and a minimum confidence level •Both thresholds are determined by the modeler •Consider an association rule found in a cell phone company database containing all call destinations for each account:
Construct and interpret model assessment charts
•Many of these charts share a set of algorithmic steps 1.Using the model, produce estimated probabilities for the target event for each case 2.Sort all cases by decreasing estimated probability 3.Split the cases evenly into n bins so that bin #1 has the highest probabilities and bin #n the lowest probabilities 4.Now look at the number of cases where the target event actually happens 5.Calculate statistic of interest for each bin We often evaluate model performance visually using charts. In many cases, these charts are produced automatically by the data mining tool being used. Model assessment charts such as ROC (receiver operating characteristic) curves and lift charts can be used to evaluate the performance of a classification model. ROC curves plot the true positive rate (TPR) against the false positive rate (FPR) for different classification thresholds, while lift charts show the ratio of the true positive rate to the expected rate for different deciles of the data. These charts can be used to compare the performance of different models and to select an appropriate classification threshold based on the business requirements and trade-offs.
Identify approaches for estimating error in models
•There are multiple methods commonly used to gather data for the evaluation of classification models •Hold out •Cross validation •Bootstrapping Cross-validation: splitting the data into training and testing sets multiple times and averaging the performance across each split. Holdout validation: splitting the data into training and testing sets once and evaluating the performance on the testing set. Bootstrap: resampling the data with replacement to create multiple training and testing sets and evaluating the performance across each set. Bayesian methods: using prior distributions and posterior probabilities to estimate the uncertainty and error of the model.
Differentiate between supervised and unsupervised learning methods
•Unsupervised Learning •The computer is presented only with inputs (independent variables) •The computer attempts to classify things based on similarity/dissimilarity •Supervised Learning •The computer is presented with inputs (independent variables) and associated labels indicating the class of the observation (dependent variable) •The computer attempts to learn the rule that maps inputs to each class •New data is classified based on the rule learned by the computer