Data mining Exam 2
Divisive
(Top Down) •Start with one all-inclusive cluster •Repeatedly divide into smaller cluster •One common method is to recursively use k-means •Less popular
Basic Agglomerative Nesting Algorithm (AGNES)
1.Let each data point be a cluster 2.Compute distances between clusters 3.Merge the two closest clusters 4.Repeat steps 2 and 3 5.End when a single cluster remains
market basket analysis
Association rules are heavily utilized in marketing and this technique is also commonly known as
Agglomerative
Bottom Up. •Begin with n-clusters (each observation is a singleton cluster) •Keep joining clusters with smallest distance until one cluster is left (the entire data set) •Most popular approach
First-rater problem
Cannot predict rating for new item until some users have rated it
Models
Classifiers
Scalability
Computations become slower as the number of users and items increase
Lemmatization
Dealing with base words and their related terms (run vs. ran)
Group Average (Average Linkage)
Distance between two clusters is the average pairwise distance between points in the (different) clusters
Distance Between Centers
Distance between two clusters is the distance between the cluster centroids
MIN (Single Linkage)
Distance between two clusters is the distance between the two closest points in the (different) clusters. •is better at handling non-elliptical shapes •It will likely result in cleaner (more interpretable) clusters •more sensitive to noise and outliers
MAX (Complete Linkage)
Distance between two clusters is the distance between the two farthest points in the (different) clusters. •MAX has a tendency to "jump gaps" •It often breaks large clusters •Its results would be less interpretable •MAX manages noise and outliers better
Machine Learning
Field of study that gives computers the ability to learn without being explicitly programmed
The Apriori Algorithm addresses two sub-problems
Find all frequent itemsets (i.e., all sets of items with support above min_sup) •From each frequent itemset, generate and retain strong rules that use items from that itemset (i.e., all rules that have confidence above min_conf)
Cold-start problem
Limited knowledge of users means it is difficult to determine similarity
Association Phases
Phase 1: Finding Frequent Itemsets Phase 2: Finding Association Rules
Accuracy
Proportion of correct predictions Accuracy=((a+d))/((a+b+c+d) )
Specificity (True Negative Rate):
Proportion of negative cases correctly classified as negative. Specificity=((d))/((c+d) )
False Positive Rate
Proportion of negative cases incorrectly classified as positive. False Positive Rate=((c))/((c+d) )
Sensitivity (True Positive Rate):
Proportion of positive cases correctly classified as positive. Sensitivity=((a))/((a+b) )
False Negative Rate
Proportion of positive cases incorrectly classified as negative. False Negative Rate=((b))/((a+b) )
Precision
Proportion of predicted positive cases that were correctly classified. Precision=((a))/((a+c) )
Stemming
Removal of term suffices (big, bigger, biggest à big-)
Outputs
Responses or Dependent Variables
Tokenization
Splitting up a string of characters into a set of tokens to build a dictionary
Sparsity of records
With a large set of items, users will likely only have rated a few items
Information retrieval
automatic retrieval of all relevant documents while retrieving as few of the non-relevant documents as possible
Popularity bias
cannot recommend items to someone with unique tastes
Information extraction
extracting relevant facts from given documents
Multicollinearity
highly correlated independent variables
Hierarchical vs. partitional clustering
in hierarchical clustering a given observation may be present in multiple clusters as you move up the hierarchy. In partitional cluster analysis, a given observation is included in one, and only one, cluster.
Confidence
is a measure of the strength of the rule. it is the conditional probability that a transaction containing the antecedent also contains the consequent
Classification
is a predictive method, unlike association rule mining and clustering which are descriptive
root mean squared error (RMSE)
is also often used as it has the same units as the original data
Text mining
is the process of applying data mining algorithms and approaches to textual, rather than numeric or categorical, data. The goal of text mining is often to be able to identify similarities between documents within a corpus. •The problem is that text is unstructured •While organized, it is not organized in a manner that is easily interpretable by computers
training data set
is used to construct the classification model. •More training data à better classifier (to a point)
Support
it is the probability that a transaction contains the antecedent and the consequent
Mean squared error (MSE)
measures the mean difference between actual and predicted values
Odds
odds=(p(event occurs))/(p(event does not occur))=p/(1-p)
Probability
p=(outcome of interest)/(all possible outcomes)
odds ratio
represents how the odds of the event occurring change with a one unit increase in the associated variable, all other things being equal
Goal of hierarchical clustering
to identify the hierarchies between objects n in a dataset such that they can be represented in a nested tree structure
test data set
used to estimate the accuracy/future performance of the selected model
Linear regression
uses ordinary least squares to allow us to predict a target (dependent) variable based on one or more input (independent) variables
Natural language processing
using computers to interpret language delivered in natural form
Leave-One-Out Cross Validation
•A special case of k-fold where k=n, the number of observations in the data set •For each iteration, one observation is used for testing with the rest used for training •Computationally expensive
Traditional DB Queries
•Can be tedious and difficult to quantify •Supports hypothesis verification about relationships (e.g., do diapers and beer co-occur)
Applications for classification
•Credit approval - high vs low risk •Targeted marketing - loyal vs non-loyal customers •Medical diagnosis - cancerous vs benign cells •Fraud - genuine vs fraudulent transactions
k-Fold Cross Validation
•Data is randomly split into k subsets of (approximately) equal size •For each iteration, one subset is used for testing and the rest are used for training •Cross validation uses sampling without replacement, thus test sets will not overlap
Decision Tree Strengths
•Easily interpreted •Easy to implementation •Relatively efficient •Can handle mixed measurement scales •Can handle missing values •Relatively robust •Extremely popular
Neural Network Advantages
•Easy to implement •Can be trained to address very complex problems •Can handle non-linear and non-normal data
Irrelevant Variables
•Inclusion of irrelevant variables can result in poor model fit •Pay attention to variable significance!
Neural Networks
•Model "learns" the structure of the data from a representative sample •The user needs to have some •Heuristic knowledge of how to select/prepare data •How to select an appropriate neural network •How to interpret the results •Uses a series of weights, biases, and hidden neurons to detect complex relationships •Can perform well in the presence of complicated, noisy, and/or imprecise data •Appropriate for - classification, regression, time series analysis and clustering
Omitted Variables
•Model leaves out one or more important causal factors •Biases the coefficients produced by the model
Hierarchical Cluster Analysis
•Multiple partitions of the data depending on the level of hierarchy •Number of clusters is not required in advance •SLOW on large datasets •May be used (with caution) on differently shaped data •Repeatable results
Neural Network Disadvantages
•Neural networks are "black boxes" •Training can be slow •May require very large training data set
Inputs
•Predictors or Independent Variables
memory based
•Ratings dataset directly used to find neighbors and make predictions •Efficiency suffers as the entire database is needed for each prediction (i.e., it does not scale) •Overfits the data as it ascribes all variability in rating to user preference
Association Rule Mining
•Relatively easy to automatically discover association rules from data •User does not have to specify what to look for in advance (data driven) •Potential for finding unexpected correlations
Filtering
•Removing unique terms (terms appearing only once in the entire corpus) •Removing common words or compiling a stopwords dictionary (the, a, an, etc.)
Logistic regression
•Selects regression coefficients to force predicted values for Y to fall between 1 and 0 •Produces an s-shaped (sigmoid) curve rather than a straight line to model probabilities •Selects coefficients using Maximum Likelihood Estimation (MLE) rather than Ordinary Least Squares (OLS)
k-Means Cluster Analysis (partitional)
•Single partition of the data •Number of clusters must be specified a priori •Relatively fast on large datasets •Ideally clusters are hyper-spherical •Non-repeatable due to random selection of initial seeds
•Unsupervised Learning
•The computer is presented only with inputs (independent variables) •The computer attempts to classify things based on similarity/dissimilarity
•Supervised Learning
•The computer is presented with inputs (independent variables) and associated labels indicating the class of the observation (dependent variable) •The computer attempts to learn the rule that maps inputs to each class •New data is classified based on the rule learned by the computer
Improper Functional Form
•The relationships between variables are not linear •Can result in biased coefficient estimates •Try different functional forms of the independent variables (log, squared terms, etc.)
Decision Tree Construction
•Tree construction is performed in a top-down, recursive, divide-and-conquer manner •All training examples begin in the root node •Attributes are assumed to be nominal (could be discretized interval) variables •Examples are partitioned recursively based on selected attributes. make locally optimal decisions
Collaborative Filtering
•Two forms: user-based collaborative filtering and item-based collaborative filtering •"What is popular among my peers?" •Based on the user's past behavior and the behavior of those similar to the user
model based
•Use an alternative, traditional machine learning algorithm to make predictions (i.e., clustering, deep neural network, etc.) •Often involve matrix factorization approaches which reduce dimensionality and alleviate the sparse matrix problem •Computationally expensive "model-learning" phase completed offline
Decision Tree Weaknesses
•Volatile •Sensitive to outliers •Can result in large error
Content-Based Filtering
•What else might I like?" •Based on the similarities shared by items the user has previously liked/purchased/etc. •May be over-specialized
Dendrogram
•a graphical representation of the hierarchical structure of the clusters •Height of each connection reflects the distance between clusters
Bootstrapping
•a procedure that uses random sampling with replacement •A dataset of n instances is randomly sampled n times (with replacement) to form training data •Note that the same observation could be selected more than once for use in the training data set in the same iteration •Items not selected for the training set are used for testing
Latent semantic analysis (LSA),
•also known as latent semantic indexing (LSI) is a method of automatic indexing and retrieval •It reveals the essence of a text by discarding surface elements and deducing a new vector space corresponding to "hidden" terms •LSA utilizes singular value decomposition (SVD) to reduce vector space dimensionality •It builds a semantic space in which similar words and documents are near one another
Random Forests
•are an ensemble approach to classification •Rather than using a single decision tree, multiple trees are constructed •Each tree performs the classification and the results are aggregated
backpropagation
•calculation of partial derivatives of the cost function for each weight and bias), weights are iteratively adjusted to descend down the error function •Backpropagation is a special case of gradient decent
input layer
•connects to a layer of neurons called a hidden layer, which, in turn, connects to a final layer called the output layer.
Recommender systems
•guide people to interesting material based on information (i.e., they help match users with items). Recommender systems help overcome information overload
association rule
•is a pattern that suggests when one event occurs, another event is likely to occur as well •Association rules are structured as sets of if/then statements •Each rule suggests co-occurrence, not causality
decision tree
•is a set of nested tests which we use to "divide and conquer" a prediction problem •Each branch represents a test •Each node represents the result of a test Each leaf (terminal node)
Entropy
•is the information required to predict an event with certainty •Information is measured in bits
Web mining
•is the process of discovering, useful and previously unknown information from web content and usage •Web mining can be used to investigate various aspects of the web
Receiver Operating Characteristics (ROC) Curve
•is used in signal detection to characterize the tradeoff between hit rate and false alarm rate •Characterizes the performance of a model using a wide range of cutoff values
validation data set
•is used to fine tune the models, assess their performance, and select the "best" model for a given phenomenon. •More validation/testing data à more accurate error estimate
Overfitting
•occurs when we use an overly flexible model that accommodates the nuances of the random noise in the training data
Lift
•provides information about the increase in probability of the consequent, given the antecedent. I.e., does including the antecedent improve the probability of finding the consequent over random chance. •Lift takes into account statistical (in)dependence
Multi-layer perceptron models
•were originally inspired by neurophysiology and the interconnections between biological neurons. The basic model form arranges neurons in layers.