Data mining Exam 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Divisive

(Top Down) •Start with one all-inclusive cluster •Repeatedly divide into smaller cluster •One common method is to recursively use k-means •Less popular

Basic Agglomerative Nesting Algorithm (AGNES)

1.Let each data point be a cluster 2.Compute distances between clusters 3.Merge the two closest clusters 4.Repeat steps 2 and 3 5.End when a single cluster remains

market basket analysis

Association rules are heavily utilized in marketing and this technique is also commonly known as

Agglomerative

Bottom Up. •Begin with n-clusters (each observation is a singleton cluster) •Keep joining clusters with smallest distance until one cluster is left (the entire data set) •Most popular approach

First-rater problem

Cannot predict rating for new item until some users have rated it

Models

Classifiers

Scalability

Computations become slower as the number of users and items increase

Lemmatization

Dealing with base words and their related terms (run vs. ran)

Group Average (Average Linkage)

Distance between two clusters is the average pairwise distance between points in the (different) clusters

Distance Between Centers

Distance between two clusters is the distance between the cluster centroids

MIN (Single Linkage)

Distance between two clusters is the distance between the two closest points in the (different) clusters. •is better at handling non-elliptical shapes •It will likely result in cleaner (more interpretable) clusters •more sensitive to noise and outliers

MAX (Complete Linkage)

Distance between two clusters is the distance between the two farthest points in the (different) clusters. •MAX has a tendency to "jump gaps" •It often breaks large clusters •Its results would be less interpretable •MAX manages noise and outliers better

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed

The Apriori Algorithm addresses two sub-problems

Find all frequent itemsets (i.e., all sets of items with support above min_sup) •From each frequent itemset, generate and retain strong rules that use items from that itemset (i.e., all rules that have confidence above min_conf)

Cold-start problem

Limited knowledge of users means it is difficult to determine similarity

Association Phases

Phase 1: Finding Frequent Itemsets Phase 2: Finding Association Rules

Accuracy

Proportion of correct predictions Accuracy=((a+d))/((a+b+c+d) )

Specificity (True Negative Rate):

Proportion of negative cases correctly classified as negative. Specificity=((d))/((c+d) )

False Positive Rate

Proportion of negative cases incorrectly classified as positive. False Positive Rate=((c))/((c+d) )

Sensitivity (True Positive Rate):

Proportion of positive cases correctly classified as positive. Sensitivity=((a))/((a+b) )

False Negative Rate

Proportion of positive cases incorrectly classified as negative. False Negative Rate=((b))/((a+b) )

Precision

Proportion of predicted positive cases that were correctly classified. Precision=((a))/((a+c) )

Stemming

Removal of term suffices (big, bigger, biggest à big-)

Outputs

Responses or Dependent Variables

Tokenization

Splitting up a string of characters into a set of tokens to build a dictionary

Sparsity of records

With a large set of items, users will likely only have rated a few items

Information retrieval

automatic retrieval of all relevant documents while retrieving as few of the non-relevant documents as possible

Popularity bias

cannot recommend items to someone with unique tastes

Information extraction

extracting relevant facts from given documents

Multicollinearity

highly correlated independent variables

Hierarchical vs. partitional clustering

in hierarchical clustering a given observation may be present in multiple clusters as you move up the hierarchy. In partitional cluster analysis, a given observation is included in one, and only one, cluster.

Confidence

is a measure of the strength of the rule. it is the conditional probability that a transaction containing the antecedent also contains the consequent

Classification

is a predictive method, unlike association rule mining and clustering which are descriptive

root mean squared error (RMSE)

is also often used as it has the same units as the original data

Text mining

is the process of applying data mining algorithms and approaches to textual, rather than numeric or categorical, data. The goal of text mining is often to be able to identify similarities between documents within a corpus. •The problem is that text is unstructured •While organized, it is not organized in a manner that is easily interpretable by computers

training data set

is used to construct the classification model. •More training data à better classifier (to a point)

Support

it is the probability that a transaction contains the antecedent and the consequent

Mean squared error (MSE)

measures the mean difference between actual and predicted values

Odds

odds=(p(event occurs))/(p(event does not occur))=p/(1-p)

Probability

p=(outcome of interest)/(all possible outcomes)

odds ratio

represents how the odds of the event occurring change with a one unit increase in the associated variable, all other things being equal

Goal of hierarchical clustering

to identify the hierarchies between objects n in a dataset such that they can be represented in a nested tree structure

test data set

used to estimate the accuracy/future performance of the selected model

Linear regression

uses ordinary least squares to allow us to predict a target (dependent) variable based on one or more input (independent) variables

Natural language processing

using computers to interpret language delivered in natural form

Leave-One-Out Cross Validation

•A special case of k-fold where k=n, the number of observations in the data set •For each iteration, one observation is used for testing with the rest used for training •Computationally expensive

Traditional DB Queries

•Can be tedious and difficult to quantify •Supports hypothesis verification about relationships (e.g., do diapers and beer co-occur)

Applications for classification

•Credit approval - high vs low risk •Targeted marketing - loyal vs non-loyal customers •Medical diagnosis - cancerous vs benign cells •Fraud - genuine vs fraudulent transactions

k-Fold Cross Validation

•Data is randomly split into k subsets of (approximately) equal size •For each iteration, one subset is used for testing and the rest are used for training •Cross validation uses sampling without replacement, thus test sets will not overlap

Decision Tree Strengths

•Easily interpreted •Easy to implementation •Relatively efficient •Can handle mixed measurement scales •Can handle missing values •Relatively robust •Extremely popular

Neural Network Advantages

•Easy to implement •Can be trained to address very complex problems •Can handle non-linear and non-normal data

Irrelevant Variables

•Inclusion of irrelevant variables can result in poor model fit •Pay attention to variable significance!

Neural Networks

•Model "learns" the structure of the data from a representative sample •The user needs to have some •Heuristic knowledge of how to select/prepare data •How to select an appropriate neural network •How to interpret the results •Uses a series of weights, biases, and hidden neurons to detect complex relationships •Can perform well in the presence of complicated, noisy, and/or imprecise data •Appropriate for - classification, regression, time series analysis and clustering

Omitted Variables

•Model leaves out one or more important causal factors •Biases the coefficients produced by the model

Hierarchical Cluster Analysis

•Multiple partitions of the data depending on the level of hierarchy •Number of clusters is not required in advance •SLOW on large datasets •May be used (with caution) on differently shaped data •Repeatable results

Neural Network Disadvantages

•Neural networks are "black boxes" •Training can be slow •May require very large training data set

Inputs

•Predictors or Independent Variables

memory based

•Ratings dataset directly used to find neighbors and make predictions •Efficiency suffers as the entire database is needed for each prediction (i.e., it does not scale) •Overfits the data as it ascribes all variability in rating to user preference

Association Rule Mining

•Relatively easy to automatically discover association rules from data •User does not have to specify what to look for in advance (data driven) •Potential for finding unexpected correlations

Filtering

•Removing unique terms (terms appearing only once in the entire corpus) •Removing common words or compiling a stopwords dictionary (the, a, an, etc.)

Logistic regression

•Selects regression coefficients to force predicted values for Y to fall between 1 and 0 •Produces an s-shaped (sigmoid) curve rather than a straight line to model probabilities •Selects coefficients using Maximum Likelihood Estimation (MLE) rather than Ordinary Least Squares (OLS)

k-Means Cluster Analysis (partitional)

•Single partition of the data •Number of clusters must be specified a priori •Relatively fast on large datasets •Ideally clusters are hyper-spherical •Non-repeatable due to random selection of initial seeds

•Unsupervised Learning

•The computer is presented only with inputs (independent variables) •The computer attempts to classify things based on similarity/dissimilarity

•Supervised Learning

•The computer is presented with inputs (independent variables) and associated labels indicating the class of the observation (dependent variable) •The computer attempts to learn the rule that maps inputs to each class •New data is classified based on the rule learned by the computer

Improper Functional Form

•The relationships between variables are not linear •Can result in biased coefficient estimates •Try different functional forms of the independent variables (log, squared terms, etc.)

Decision Tree Construction

•Tree construction is performed in a top-down, recursive, divide-and-conquer manner •All training examples begin in the root node •Attributes are assumed to be nominal (could be discretized interval) variables •Examples are partitioned recursively based on selected attributes. make locally optimal decisions

Collaborative Filtering

•Two forms: user-based collaborative filtering and item-based collaborative filtering •"What is popular among my peers?" •Based on the user's past behavior and the behavior of those similar to the user

model based

•Use an alternative, traditional machine learning algorithm to make predictions (i.e., clustering, deep neural network, etc.) •Often involve matrix factorization approaches which reduce dimensionality and alleviate the sparse matrix problem •Computationally expensive "model-learning" phase completed offline

Decision Tree Weaknesses

•Volatile •Sensitive to outliers •Can result in large error

Content-Based Filtering

•What else might I like?" •Based on the similarities shared by items the user has previously liked/purchased/etc. •May be over-specialized

Dendrogram

•a graphical representation of the hierarchical structure of the clusters •Height of each connection reflects the distance between clusters

Bootstrapping

•a procedure that uses random sampling with replacement •A dataset of n instances is randomly sampled n times (with replacement) to form training data •Note that the same observation could be selected more than once for use in the training data set in the same iteration •Items not selected for the training set are used for testing

Latent semantic analysis (LSA),

•also known as latent semantic indexing (LSI) is a method of automatic indexing and retrieval •It reveals the essence of a text by discarding surface elements and deducing a new vector space corresponding to "hidden" terms •LSA utilizes singular value decomposition (SVD) to reduce vector space dimensionality •It builds a semantic space in which similar words and documents are near one another

Random Forests

•are an ensemble approach to classification •Rather than using a single decision tree, multiple trees are constructed •Each tree performs the classification and the results are aggregated

backpropagation

•calculation of partial derivatives of the cost function for each weight and bias), weights are iteratively adjusted to descend down the error function •Backpropagation is a special case of gradient decent

input layer

•connects to a layer of neurons called a hidden layer, which, in turn, connects to a final layer called the output layer.

Recommender systems

•guide people to interesting material based on information (i.e., they help match users with items). Recommender systems help overcome information overload

association rule

•is a pattern that suggests when one event occurs, another event is likely to occur as well •Association rules are structured as sets of if/then statements •Each rule suggests co-occurrence, not causality

decision tree

•is a set of nested tests which we use to "divide and conquer" a prediction problem •Each branch represents a test •Each node represents the result of a test Each leaf (terminal node)

Entropy

•is the information required to predict an event with certainty •Information is measured in bits

Web mining

•is the process of discovering, useful and previously unknown information from web content and usage •Web mining can be used to investigate various aspects of the web

Receiver Operating Characteristics (ROC) Curve

•is used in signal detection to characterize the tradeoff between hit rate and false alarm rate •Characterizes the performance of a model using a wide range of cutoff values

validation data set

•is used to fine tune the models, assess their performance, and select the "best" model for a given phenomenon. •More validation/testing data à more accurate error estimate

Overfitting

•occurs when we use an overly flexible model that accommodates the nuances of the random noise in the training data

Lift

•provides information about the increase in probability of the consequent, given the antecedent. I.e., does including the antecedent improve the probability of finding the consequent over random chance. •Lift takes into account statistical (in)dependence

Multi-layer perceptron models

•were originally inspired by neurophysiology and the interconnections between biological neurons. The basic model form arranges neurons in layers.


Ensembles d'études connexes

World History A U5 Test Growth of World Empires

View Set

Real Estate Practice Exam (State)

View Set

AP European History CH 14: Scientific Revolution

View Set

Module 3 (Spiritual and Cultural Nursing Practices) 225

View Set

Introduction to Psychology Final Exam

View Set

English 3 B - Unit 2: The Great Gatsby

View Set