MGT 4500 Final Exam
Confidence is ________________________. conf (X →Y) may not equal to conf (Y →X), although both have the same support!
Directional
Support vs. Confidence
- Both are measures of how strong a relationship is - Support is needed to calculate confidence - Support often serves as a "filter" for finding candidate association rules (I.e., two itemsets need to occur frequently enough before we consider them)
Partition-Based Method
- Directly partition the data into K groups, where K is the desired cluster number - K-Means Clustering
Hierarchical Method
- Forming larger clusters from smaller clusters - Hierarchical Clustering
Q1: What kinds of patterns (output)? Q2: From what types of data (input)? Q3: How to find these patterns (technique)? Q4: What counts as meaningful (interpretation and evaluation, decision-making)?
1. Association between objects 2. Transaction data 3. Support, Confidence, Apriori Algorithm 4. Lift measure + subjective managerial judgement
What are the two broad types of predictive analytics?
1. Classification: predicts categorical labels of outcome variable Techniques: K-Nearest-Neighbors, Naive Bayes, Decision Trees, etc. 2. Numeric Prediction: predicts continuous/numeric values of outcome variable Techniques: K-Nearest-Neighbors, Regression Trees, etc.
We want clustering results to have which two properties?
1. High intra-similarity: data points in the same cluster should be similar to each other 2. Low inter-similarity: data points in different clusters should be different from each other
What are the 3 Measures of Central Tendency
1. Mean: could be heavily influenced by outliers 2. Median: the value in the middle after sorting. 50% of values are larger/smaller than the median 3. Mode: the most frequent value(s)
What are the 2 measures of dispersion
1. Range: max -min 2. Variance: Intuition: how far away are values from the mean
Min-Max Normalization
A data point with value x on this attribute should be normalized to 𝑧 = 𝑥 − 𝑚𝑖𝑛 / 𝑚𝑎𝑥 − 𝑚𝑖𝑛. The normalized value always falls within [0, 1]
F-measure
A special-purpose measure to combine precision and recall
Cross-Validation
A widely used method to understand the performance of your model Parameter: K,how many folds to create In each round, use 𝐾−1 folds to build the model, evaluate its performance on the remaining 1 fold Repeat for 𝐾rounds Average performance across 𝐾rounds More robust than a single training-validation split - less susceptible to an "unlucky" split
Stopping Criteria: When to Stop Splitting Nodes?
All data points associated with a node are from the same class - The node becomes a leaf node of that class There are no remaining attributes to further split the data - No attribute further increases "purity" much - Use "majority vote" to label that node
Recall_Good / Recall_Bad
Among all actually good/bad records, the percentage that the model correctly predicts
Precision_Good / Precision_Bad
Among all predictions of good/bad, the percentage of correct predictions of good/bad
Accuracy
Among all predictions, the percentage of correct predictions (an "overall" measure)
An ___________________________ describes relationship between two itemsets: X →Y
Association Rule - X and Y are two non-overlapping itemsets - This association rule reads "if X then Y" - X is called the antecedent, or left-hand-side (LHS) - Y is called the consequent, or right-hand-side (RHS) - In the context of shopping, this rule means "customers who buy X are likely to also buy Y"
Nominal
Categorical values are just different "names". No order EX: ID, gender, eye color, ethnicity
Ordinal
Categorical values have implied order EX: Education/income levels, "low, medium,high"
The output of Hierarchical clustering is a graph called the ____________________________.
Dendrogram (means "tree" in Greek) The dendrogram contains solutions of any number of clusters you may want Shows which clusters were merged at what time, indicating their similarity visually determine the natural cluster numbers
K-Means Clustering
Directly partition your data into K clusters, then try to adjust the partition to make it more appropriate. Procedure: Step 1: choose K data points at random, K is user-specified -They represent the centroid (center) of your clusters Step 2: assign the remaining data points to the cluster to which they are closest Step 3: update the cluster centroids based on the newly formed clusters -Finding the mean of all data points within one cluster Repeat Step 2 and 3, until it converges, i.e., the clusters don't change anymore
Association Rule Mining
Discovering associations and correlations among items/events from transaction data Market Basket Analysis: which items are purchased together -e.g.,when people buy bread, they also tend to buy milk-{bread} →{milk} Applications in Many Settings: In the store, in the catalog - place items closer together Product recommendation - Personalization, targeted mailings
Measure Similarity between Data Points: If your data is numeric?
Euclidean Distance Manhattan Distance Max-Coordinate Distance
Exploratory(Descriptive)
Explore/discover meaningful patterns in the data. "Let the data speak" Association Rules; Clustering EX: Explore/discover consumer groups in the data
Classification vs. Numeric Prediction
For a classification model, performance depends on whether the model can put data points into the correct categories For a numeric prediction model, performance depends on how close the predictions are to the actual values
Understand how to convert categorical and continuous data into binary format
For categorical data, create one new variable for each category For numeric data, discretize into categorical data first, then convert to binary
Imagine you are a store manager at Target. After working with your data team, you find out that customers who buy beer are likely to buy diaper. In other words, you find an association rule {beer}→{diaper}. Now, what should you do to increase the sales of your store?
Here are some possible strategies: -Put diaper next to beer in your store -Put diaper away from beer in your store (why?) -Bundle beer and diaper into "New Parent Coping Kit" -Lower price of beer, raise price of diaper
K-Nearest-Neighbors (k-NN)
Intuition: classify a record as the majority class among its "neighbors", i.e., other records near it Why it works: records near each other are similar in their attributes, and similar records are likely to have the same class/label Procedure of k-NN: - User picks a distance metric and a specific k - Normalize the data if needed! - For every unlabeled record, identify the k nearest labeled records - Classify that unlabeled record as the majority class among the k nearest records. In case of a tie, randomly choose a class
Apriori Algorithm: Pruning Infrequent Itemsets
Intuition: if an itemset X is NOT frequent, then any larger itemsets containing X cannot be frequent First check all 1-item sets, only keep the frequent ones; then check 2-item sets made from frequent 1-item sets, only keep the frequent ones; so on and so forth.
Hierarchical Clustering
Intuition: starting from individual data points or smaller clusters, then try to form larger clusters in a hierarchical manner Also called Botton-Up method Procedure: Step 1: Assign each data point as its own cluster, i.e., 1 cluster per data point Step 2: Merge the 2 clusters that are nearest to each other Step 3: repeat step 2, until there is only 1 cluster left, i.e., all data points are assigned to one big cluster Distance Matrix: Suppose you have N data points to cluster, the distance matrix is a N ×N matrix where each element, 𝑎𝑖𝑗, is the distance between data i and data j This matrix makes it easy to find clusters that are nearest to each other
How to Pick the "Splitting Point"?
Intuition: we want to pick the attribute that maximizes leaf node "purity" after split - There are mathematical metrics that measure the "purity". Use them to choose the attribute that splits the data in the most informative way Information Gain (Entropy) - Suppose your data contains 70% class "1" and 30% class "0", the entropy of your data is: entropy= −[0.7 ×log2(0.7) + 0.3 ×log2(0.3)] ≈0.88 Higher the entropy, less "pure" your data is Combined entropy = weighted average
Transaction data usually takes one of two formats: ________________________ or ____________________________.
Item List, Binary Matrix
An __________________ is a set of items in a particular application domain
Itemset 1-item itemset: {coffee} 2-item itemset: {coffee, tea} 3-item itemset: {tea, bagel, cookie}
Sum-of-Squared-Errors (SSE)
Let's say the centroid of each cluster is 𝑚1, 𝑚2, ..., 𝑚𝐾 For a given data point x, the error is defined as the distance to its own cluster center 𝑚𝑖, i.e., 𝑑(𝑥,𝑚𝑖) We add up the squared errors of all data points - Lower individual cluster SSE = a better cluster - Lower total SSE = a better set of clusters - More clusters will reduce SSE Reducing SSE within a cluster increases cohesion (what we want)
What does this situation mean: Lift(X →Y ) = 5
Lift(X →Y) > 1: means customers who buy X are 4x (400%, 5-1=4) more likely to buy Y than customers in general
Support Count
raw count of transactions containing both X and Y, denoted as supp(X →Y)
Measure Similarity between Data Points: If your data is categorical?
Matching Distance - Intuition: number of "mis-matches" divided by total number of attributes Typically used for Symmetric binary data, i.e., 𝑁00 and 𝑁11 are equally "important" Jaccard Distance - Intuition: excluding "matches" where 𝑎𝑖 = 0 and 𝑏𝑖 = 0 Typically used for Asymmetric binary data, where 𝑁00 is not as "important" as 𝑁11 (E.g., in a supermarket, "two people both buy X" is more informative than "neither people buy X")
____________________________ is an essential step before calculating distances, to make sure all attributes of your data take value from the same range
Normalization
Naïve (or Majority) Rule
Not a "real" classification method. It can serve as a benchmark against which to evaluate other methods Simply classify all records as the majorityclass EX: if your training data has 70% records with class "1" and 30% with class "0", the naive classifier simply classifies all records as class "1"
Numerical
Numerical values reflect a measurement EX: Number of "Likes",height, weight
Clustering
Organizing data points/objects (e.g., customers) into homogeneous (and, hopefully, meaningful) groups Each group is called a cluster
What is the measure for evaluation?
Sum-of-Squared-Errors (SSE)
A _____________________ is a particular itemset
Transaction (Multiple transactions comprise a dataset)
How to Build a Decision Tree?
We use an approach called Recursive Partitioning - Pick one of the attributes (i.e., predictor variables) 𝑋𝑖 - Pick a value of 𝑋𝑖 (say 𝑠𝑖) that divides the training data into two (not necessarily equal) portions, one with 𝑋𝑖>𝑠𝑖and the other with 𝑋𝑖≤𝑠𝑖 - Measure how "pure" each of the resulting portions is - More "pure"⇔Containing more records of mostly one class•Repeat the process Not necessary to normalize your data, because no distance is being calculated
What can we Use Clustering for?
What can we Use Clustering for? Understanding - Discover the natural groups and patterns in your data. Helps to gain insights into your data Summarization - Instead of looking at each individual data point (which can be a lot in a large dataset), you can look at each cluster, and study its features Customization - Businesses can set different and customized strategies for each cluster, catering to their specific characteristics
Apriori Algorithm
With a large number of items, there could be a huge (exponential) amount of itemsets to consider a smart way to reduce the burden
Predictive
You know exactly what you are looking for (predict certain well-defined outcome-using data) Classification EX: Predict which group Consumer A belongs to
Support
a measure of how frequently X and Y occur together Intuition: X and Y must frequently occur together
Lift
a measure of how much more likely two itemsets co-occur than pure chance 𝐿𝑖𝑓𝑡𝑋→𝑌 = 𝒔𝒖𝒑𝒑𝑿→𝒀 / 𝒔𝒖𝒑𝒑𝑿×𝒔𝒖𝒑𝒑𝒀 - Here, we must use support percentage in calculation - supp(X) × supp(Y) is the probability of seeing X co-occurring with Y by pure chance, if X and Y are actually independent of each other.
Confidence
a measure of how often Y appears within transactions that contain X Intuition: among the transactions that contain X, many also contain Y Denoted as conf (X →Y) Calculated as: 𝑠𝑢𝑝𝑝𝑋→𝑌 / 𝑠𝑢𝑝𝑝𝑋 Conceptually related to conditional probability Pr(Y|X)
Decision Tree
a set of decision rules, organized as a tree Classify data into pre-defined classes based on attributes Decision nodes contain the attribute on which the tree splits Leaf nodes contain the final prediction Read down any path from the root to any leaf to derive an IF-THEN decision rule EX: IF(salary >=$50,000) AND(commute >1 hour) THEN(decline offer)
What are the two data types?
categorical & numerical
Data Scientists spend most of their time _________________ data
cleaning
Noisy
containing errors or outliers E.g., data entry errors; highly non-representative data (outliers) remove outliers when calculating summary statistics
External data
data acquired from external sources
Internal data/In-House data
data collected and stored by an organization itself
Inconsistency
different measurement scales, different ways to store the same value, etc. E.g., height, phone numbers, etc
Clustering analysis is a type of _____________________ data analytics
exploratory - Clusters come from data - Once the clusters are discovered from data, we can then describe and name them
Support Percentage
fraction of transactions containing both X and Y
How to choose k?
k is the number of nearby neighbors to be used to classify the new record - Typically choose the value of k that has lowest error rate on validation data - Try different k values, see which one gives you the lowest error rate (i.e., most accurate predictions) on the validation data
Incomplete
missing some values
TP + TN + FP + FN = ?
number of all records
dependent variables are the __________________and independent variables ____________________
outcomes, attributes
Bar chart
shows the distribution of categorical variables
Histogram
shows the distribution of numerical variables
Boxplot
shows the distribution of one numerical variable and enables side-by-side comparisons of different groups on a numerical variable Q1: 25% percentile and Q3: 75% percentile, respectively "min" and "max" are not actual minimum and maximum, they are indicators of a "normal" range
Scatterplot matrix
shows the pair-wise relationship among multiple numerical variables
Scatterplot
shows the relationship between two numerical variables
Labeled data
the data for which you know the outcome that you are trying to predict Randomly split your labeled data into two parts: 1. Training data: to build your model 2. Validation/Test data: to evaluate the performance of your model
Unlabeled data
the data for which you want to predict the outcome
Market Segmentations
the process of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers based on some type of shared characteristics
How to choose the number of clusters for K-Means Clustering?
try different K, choose the one with the best clustering results We can also calculate and plot SSE for different cluster numbers (i.e., different K), and look for the "elbow" of SSE plot The elbow is a point where SSE drops a lot before but not too much after SSE tends to drop as we increase cluster numbers. The "elbow" typically means we have hit the natural cluster numbers
How to find the # of potential itemsets?
with N items, there will be 2^𝑁 potential itemsets
We use ____________ to represent the distance between A and B
𝑑(𝐴,𝐵)