idsc 4444
Association Rules mining usually makes use of...
"transactions"datasets • Example 1: All the combination of products bought by different customers • Example 2: All the combinations of symptoms in different patients • Each row of the dataset is a "transaction"
DESCRIPTIVE analytics
(exploratory method, don't know what looking for) - Association rules (traditional, old) - good way to ease into data analysis - Cluster analysis - certain groups that form naturally
PREDICTIVE analytics
(largest class of method in machine learning, know exactly what looking for, clear objective) - Classification - Numeric Prediction
Distance Measures
- To measure similarity between individual data points we will use Distance Measures • There exists different Distance Measures, for different data-types Not comparing distances to each other, pick one and decide if similar or not based off of ONE TYPE
Matching Distance
0 = no 1=1 yes N00 = both zeros, mutual absenses, a and b both have a 0 (In this case 0) N01 = mismatch, A = 0, B = 1 N10 = mismatch, A = 1, B = 0 N11 = mutual presences Intuition: number of mismatches divided by the total number of attributes (k) Used for Symmetric Binary Data, where N00 and N11 are equally important Example: D(A,B) = (2 + 0) / 3 ~ 0.66 -range is always [0,1] The higher, the more distant ("different") the data points How many mismatchs between point / number of attributes in data (k)? Range is always between 0 and 1 Symmetric binary data, Knowing if the individual has the drivers license is as informative of if they don't (Mutuals are as import)
Data Pre-Processing
1. Data Cleaning 2. Data Integration
Applications of Clustering
1. Discover natural groups and patterns in the data. Helps gaining insights • Examples: • Marketing Analytics: Create groups of similar customers (segments) • Finance Analytics: Discover which stocks have similar price fluctuations • Health Analytics: Group patients based on how they respond to treatments 2. Facilitate the analysis of very large datasets • Instead of looking at each individual data point, can look at each cluster and study its features
k-means clustering procedure
1. Example: let us say run we set k = 2. The algorithm picks 2 data-points at random; they will represent the initial cluster centroids 2. Next, it will assign the other data-points to the cluster centroid to which they are closest, based on a specified distance measure (usually, squared Euclidian is used) 3. Then, it will update (re-calculate) the clusters centroids based on the new clusters 4. Repeat step 2: re-assign points to the cluster'scentroid to which they are closest 5. Recalculate the centroids for the updated clusters 6. Keep repeating step 2 until convergence: that is, until the points re-assignment does not change anymore • In particular: further rearranging the points does not improve the within-clusters variance • The algorithm stops when the within-clusters variance is minimized
Desired properties of a cluster
1. High intra-similarity 2. Low inter-similarity
Evaluate the quality and meaningfulness of the clusters obtained
1. Need measure to decide if point is similar or different enough from other 2. Are groups close or not to one other? 3. How am I going to do the clustering? Lots of ways to cluster Hierarchal K-Means 4. Stopping criteria (K-Means) = stop at like 4 clusters 5. Algorithm gives solution Do clusters make sense? Meet objectives/good enough
Dendrogram
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering. One of the main outputs of Hierarchical Clustering A diagram that shows the clusters hierarchy: which points/clusters are merged together, at the different iterations Dendrogram: a diagram that shows clusters hierarchy The length of the "lines"is proportional to the distance among involved clusters Lowest to bottom are more similar to each other than any other point, as they were clustered together first, then you go on to the next lowest to see next similar • We can use the dendrogram to identify clustering solutions that include a chosen number of clusters • Example: We want the best 2-clusters solution • Start from the top, identify the last "bar"(clade) • Identify the two branches of the clade and the clusters they point to
Max-Coordinate Distance
Absolute differences between dimensions and take the largest of the two modules (INSTEAD OF ADDING) Considers the max among the absolute differences
Hierarchical clustering
Agglomerative or bottom-up Clustering • Forming larger clusters from smaller ones • Agglomerative or Bottom-Up (HAC) • Start from the individual data points or smaller clusters, and form larger clusters in a hierarchical manner • We do not need to specify the number of clusters. The algorithm will produce a hierarchy of clusters from which we can choose • We need to specify which distance measure (Euclidian, Manhattan, etc.) to use and which linkage method (single, average, centroid, etc.) to use
Non-comparable data
Attributes that take values in very different ranges (example, age and income have very different ranges)
Scatterplots:
Best for: displaying the relationship between two continuous variables.
What is Business Analytics or Data Science for Business?
Business analytics (BA) refers to the skills, technologies, practices for continuous exploration and investigation of data to extract knowledge, gain insights and drive business decisions and strategies.
Lift = 1
Completely independent, pure chance Customers who buy Bread are AS LIKELY to buy Milk as other customers
Lift > 1
Customers who buy x are ___% more likely to buy y compared to other customers in data Customers who buy diapers are 25% more likely to buy {Milk andBeer} than other customers When Lift > 1 means that customers who buy X are more likely to buy Y
Unsupervised Learning Methods
Data is mined to uncover previously unknown useful patterns without having a clear outcome in mind Make use of unlabeled data: there is no outcome variable to predict or classify. As such, may be harder to establish whether the method performed well or not Examples: What items do customers frequently buy together? Do customers naturally fall into different groups based on their features (age, location,etc..)? Methods we will cover: Association Rules, Cluster Analysis
Mixed-Data
Different types of attributes (numerical, binary, etc.)
Manhattan Distance
Distance of two points is the sum of the absolute difference of their coordinates The absolute (or modules) operator | | transforms any number inside into a positive number Used where rectilinear distance is relevant Go around instead of doing a straight line ABSOLUTE DIFFERENCE Absolute operators, always positive Difference between features w absolute value (module) added together
clustering method
Hierarchical Clustering K-Means Clustering
Descriptive and Predictive Business Analytics
It requires a mixture of skills, ranging from an intersection of math, stats, computer science, communication and business
•An Association Rule describes the relationship between
Item-sets • X → Y This is read: "If X then Y" Example: X = {Coffee}, Y = {Bagel) Association Rule: If X -> Y: "If Coffee Then Bagel" BUT they must be not overlapping item- sets (they do not share any item in common, X intersect Y is null) COFFEE AND MIKE ->> BAGEL Any combination of items, not overlapping.
Partitioning-based clustering
K-Means Clustering (most common approach) • Directly partition the data into K-groups, where k is the number of clusters (pre- specified)
lift
LIFT: A measure of how much more likely two item-sets co-occur than pure chance FORMULAS Support Percentage(X then Y) / Support (X) *Support (Y) OR Confidency (x-->y) / S(y) • Here, we must use the support percentage S() in calculation • S(X) * S(Y) is the probability of seeing X co-occurring with Y by pure chance • So, if the numerator > denominator, the association is More likely than pure chance • Note: Lift has No direction
Min-max normalization
MIN/MAX- 0-1 rescale the attributes to have values between 0 and 1 using the min and max. A point with value X will be normalized to: NewValue = (X -min) / (max -min) Has driving license is already between 0 and 1, so we do not need to apply Min-Max #don't want to be skewed w outliers, can use min/max since not normal values between 0 and 1
Binary Data (0-1 or data with only 2 categories)
Matching Distance Jaccard Distance Still between 0 and 1 Close to 0 = 0 similar, close to 1 = 1 different, same as above
Jaccard Distance
Measure of dissimilarity between observations based on Jaccard's coefficient. Used for Asymmetric Binary Data, where N00 is not as important as N11 Cases in which knowing mutual presences is more important than knowing mutual absences Not symmetric binary (cases where mutual absence is not as important and don't take them into account) Grocery store, transactions, knowing they didn't buy the product isn't important because SO MANY PRODUCTS, and lots of people don't buy both What they do buy? Mismatches? But both zeros do not matter
apiori algorithim process
Once we have found all the item-sets (of any size) that are frequent, that is the item-sets that satisfy support>=minsupp, generate all possible association rules between those frequent item-sets. • Compute the Confidence measure for all the generated association rules and assess whether confidence >= minconfidence • minconfidence is the minimum threshold of confidence acceptable by the data scientist • Similarly to minsupp, minconfidence is set based on domain knowledge and goals
k-means pros and cons
Pros: • Computationally less demanding -time complexity is linear 𝑂(𝑛) • Very scalable Cons: • Poor initialization (initial random centroids) can lead to bad results • Not ideal for clusters that have irregular shapes (or not-convex shapes), noisy data (outliers) or clusters with different densities K-means clustering is not ideal for clusters that have different densities
Hierarchical Clustering pros and cons
Pros: • Flexible, data-driven, no need to pre-specify the number of clusters • Produces a solution for different numbers of clusters • It is good at identifying "small"clusters • It works better at identifying "weirder"clusters Cons: • Computationally demanding - time complexity of most of the hierarchical clustering algorithms is quadratic i.e. 𝑂(𝑛2), where n is the number of data points • As such, it is not very scalable
Knowledge Discovery from Data
Selection: What data should I collect? Preprocessing: Raw data is messy, need to structure the data to do the methods Transformation : way to reshape data for a specific method Data mining: implementing methods Interpretation: does it make sense? Use raw data for new insights
Histograms
Show the distribution of a variable/attribute Best for: distribution of a single continuous variable May be used for: • Distribution assumption checking • Anomaly detection
To measure "similarity" For clusters (groups of points) we use the linkage methods
Single, Complete, Average, Centroid and Ward
Numerical Data • Euclidean Distance
The length of a line segment between the two points "The straight line distance" keep going with different coordinates Square difference between dimensions, sum them, and square root Can keep going with third dimension, fourth, etc. Larger number = different Smaller = more similar 0 = identical to another customer
Ward's Method
The objective of Ward's linkage is to minimize the within-cluster sum of squares (within cluster variance). WARDS = base of K-means - Complicated need to go over again lmao - Minimum variance possible - Two desired properties ○ Close to centroid It is robust to outliers; it tends to create denser clusters, with spherical shapes.
Box-Plot
Used for: • Compare groups • Outlier detection Best for side by side comparisons of subgroups on a single continuous variable. Display distribution of variable (Revenue) for different quarters (4 sub-groups) Q1: the value below which 25% of the values lie Q3: the value below which 75% of the values lie Max = Q3 + 1.5(Q3-Q1). Min = Q1 -1.5(Q3-Q1) Outliers: > Max or < Min Details may differ across software
Types of "similarities" measures
Ways to think about similarity: Based off of distance and based off of context (ANSWER CAN CHANGE WITH CONTEXT) Distance : Based on direct distance, one might assume points B and C are more likely to be in the same cluster. Contextual : Take into consideration the context: points A and B belong to the same cluster. Most clustering techniques deal with these two types of similarity
How to Cluster
When the data is high-dimensional (we have a lot of attributes), we need to: Understand how to measure similarity between individual data-points and clusters
Categorical Data (More than 2 categories)
While there are specific measures that can be used if your data is (all) categorical, a practical approach consists of assigning a number to each category and treat is a numerical variable Brown = 1, Blue = 2, Green = 3
Association Rule: If X -> Y: "If Coffee Then Bagel" ANTECEDENT (BODY)
X
Association rules are directional.
XY can be different from YX Example: X: {Coffee, Muffin} Y: {Bagel} XY : {Coffee, Muffin} {Bagel} IS NOT THE SAME AS YX: {Bagel} {Coffee, Muffin} They are different association rules
Association Rule: If X -> Y: "If Coffee Then Bagel" CONSEQUENT (HEAD)
Y
Linkage Methods
agglomerative methods of hierarchical clustering that cluster objects based on a computation of the distance between them
• To measure "similarity": When data is mixed and in different ranges
apply normalization
A high confidence is a good start in comparing association rules but confidence alone it is not enough. We need to rely on lift as well, to make sure the item-sets
are not associated by pure chance (example: high volume/frequently bought products) Even if confidence (frequncy) is slightly lower, high > 1 tells us that when it happens, it is less likely to be a coincidence.
An observation or data-point in our data can be represented
as a point on a plan as function of the respective dimensions/attributes.
Average Linkage
average pairwise distance between points from two different clusters. Compute ALL the distances between points from the cluster and take the average. - Compute all distances between points, look at average and THAT IS the difference between the two clusters Centroid Distance: - Mid-point of cluster = centroid, mean of cluster, doesn't have to exist, MEAN OF POINTS IN CLUSTER, COMPARE TWO CLUSTER, DISTANCE BETWEEN CENTROID = DISTANCE BETWEEN CLUSTERS, observations, take mean attribute by attribute. - Average Linkage and Centroid Linkage :They tend to produce similar results, with clusters that tend to have spherical shapes.
Text Mining
can be both: extract info from text
Dataset
collection of data
item set
collection of items selected from the set of items in the store CAN CONTAIN ANY COMBINATION OF THE EXISTING ITEMS IN THE DOMAIN
Data Integration
combine different data sources
Lift < 1
customers who buy X are less likely to buy Y than other customers
Low inter-similarity
data points in different clusters should be different "enough" from each other
High intra-similarity
data points in the same cluster should be similar to each other
Centroid Linkage
distance between two clusters centroids, i.e. cluster means. Compute the cluster's centroids, then compute the distance between the centroids. - Average Linkage and Centroid Linkage :They tend to produce similar results, with clusters that tend to have spherical shapes.
summary of measures slides
distances EUCLIDEAN, MANHATTAN, MAX -COORDINATE How to interpret them: the higher the distance, the more different the two points or the lower the distance, the more similar the two points K = num of dimensions
Variable/Attribute/Feature
each column. Each column captures a feature for each observation.
Observation/Data Entry/Record
each row. Different datasets may have different units of observation. For example, each observation (or record) corresponds to a customer.
Support Percentage
fraction of transactions containing both X and Y SUPPORT COUNT/ TOTAL NUMBER OF TRANSACTIONS chance if chosen randomly, you'll choose an order with milk, diapers, and coke 0 empirical = working with data that you have, frequency within given transactions 0 support count and percentage both are not directional number of transactions w x and y/ total num of transactions empirical probability that a transaction selected randomly contains itemsets (can be individual) • S(X) = S({Milk, Diapers}) = 3/5 = 0.6; • S(Y) = S({Coke}) = 2/5 = 0.4
stopping criteria
how many clusters we should have
Association Rules •Algorithms
how measures are used in searching for association rules) Apriori Algorithm
Data Cleaning
identify and correct errors, and/or data inconsistencies • Addressing missing values • Identify and correct data entry errors • Identify and deal with outliers • Correct inconsistencies: • Unify units of measures, international standards
Since clustering is based on the features available in the data (so on the dimensions)
if the data is 3D or less, clustering would be very straightforward In low-dimensional spaces, clusters can even emerge from simple plots
clusters
investigate if naturally group together in an interesting way #states (don't include, no pattern just names, ID VARIABLE) #regions (grouping we already know about, no value add, looking for NEW interesting patterns) Main idea: Organizing data into most natural groups called clusters
Complete Linkage
max pairwise distance between points from two different clusters. compute ALL the distances between points and pick the maximum. - Distance between two clusters is the maximum difference we can find - Edges of clusters - It tends to break large clusters to create clusters with similar diameter.
standardization
mean = 0, std dev = 1 transforms data to have a mean of 0 and a standard deviation of 1. A point with value X will be standardized to: NewValue = (X -Sample mean) / (Sample standard deviation) Important: If using Standardization, we need to transform Binary data as well (AGES ALL) Can be skewed with outliers, needs data to have more of a normal distribution Standard deviation : (Age1 - sample mean)^2 + (Age2 - sample mean)^2 ... Divide by number of ages - 1 Take Square Root Standardization: values with mean 0 and SD 1
centroid
mean point of a cluster, coordinates are represented by mean values of dimensions, doesn't have to exist in dataset
Confidence
measures how often items in Y appear within transactions that contain X, estimate given that you have x, what is the probability that you'll see. CONDITIONAL ON KNOWING THAT YOU HAVE ALL THE ITEMS IN X. • "Given We have one thing, what is the probability that we will see the other" • Estimated conditional probability that a transaction selected randomly includes all the items in Y, given that the transaction includes all items in X • Confidence of X Y does not necessarily equal confidence of Y X -Given X, Y how many times X is (RATIO), if everytime x→ y, then we get 1. Support % X - @ random if u choose a transaction, you have a _____ % of milk and beer Support % X and Y - @ random, if u choose a transaction, you have a _____% of milk, beer, bagels X and Y together / number of just X - Confidence
Single Linkage
min pairwise distance between points from two different clusters. compute ALL the distances between points and pick the minimum one. - EX: have two clusters, combine, keep separate? - Using individual data point methods from before on all to find differences between 2 points, pick minimum one Single Linkage Sensitive to noise and outliers but good for "weird"shapes.
Transaction
n instance of an item-set (a given combination of items) :: when you go to the grocery store and buy a TRANSACTION, other contexts like SYMPTOMS, each transaction is a collection of symptoms at a given time • Multiple transactions comprise a dataset
support count
raw count of transactions containing itemsets of interest, denoted as supp(X --> Y) how often items appear in data, how many have MILK AND DIAPERS Not directional
APIORI ALGORITHIM key idea: if an item-set X is NOT frequent
then any larger item-sets containing X cannot be frequent
To measure "similarity": For individual data points,
we use distance measures: Euclidian, Manhattan, Max-coordinate for Numerical Data Matching and Jaccard for binary data
Data Normalization
we will need to transform our attributes so they take values from a common range (wide range of all data) objective is to eliminate specific units of measurement and transforms the attributes to a common scale the term normalization is somewhat used to refer to any method that can be used to scale attributes technically, normalization means rescaling the attribute to have a value between 0 and 1 Once your data is normalized, you can apply one of the numerical distance measures
Within-Cluster Variance Or Within Sum of Squared Errors (WSS)
• A measure of how cohesive the cluster are, within • More specifically, the WSS measures how "close"the points within a cluster are to their centroid • The lower the WSS, the closer the points within a cluster to its centroid, the more the cluster is cohesive within High Intra-Similarity Within-cluster variance is the sum of squared pairwise distances between the cluster centroid and cluster points. When comparing two clusters, the method compares the within-cluster variance obtained if the two clusters were merged into one to the (sum) of the within-cluster variances for each, if the two were separated Assessing Cluster's "Quality" For each data-point in the cluster (x), the error is defined as the (Euclidian) distance to its own cluster center: d(x, m ) . Sum the squared errors for all the data points in a cluster and get the WSS for the cluster Repeat for every cluster Sum the results for each cluster to get the TWSS The lower the (T)WSS , the more the clusters are cohesive within == High Intra-Similarity
How to Find Association Rules
• All the measures introduced before serve to assess how meaningful an association rule is • But how do we generate all the rules that would be "good"candidates? • In an ideal world: look at all possible combinations of items in your dataset • Problem: With a large number of items, there could be a huge (exponential) amount of item-sets to consider • In fact, with N items, there will be 2^(N -1) potential item-sets • Example: consider a dataset of 20 items. There would be 2^19 different item-sets to consider! • Solution: only consider combinations of items that occur with higher frequency in the dataset frequent item-sets
Choosing the Number of Clusters
• Clustering is exploratory in nature, there is no "right"number of clusters -BUT there are bad solutions and good solutions • It depends on needs and it is subjective • May be informed by business goal and domain knowledge • Nevertheless, there are some tools we can use to make a "reasonable"decision • With hierarchical clustering, the Dendrogram can help • Sometimes clusters may emerge "naturally" by looking at the dendrogram • Analyze the candidate clustering solution by looking at the features that characterize each cluster
Real Association or Coincidence?
• Consider the following situation: in a supermarket, 90% of all customers buy bread, and 95% of all customers buy milk. • By pure chance, 85% (0.9*0.95) of customers buy bread and milk (simply because those are basic grocery items used frequently by households) high volume product, people buy frequently regardless, not associated • As such, the association rule Bread Milk may have strong confidence even if there is no real association between them • We need a metric to assess that: Lift
Between Sum of Squared Errors (BSS): Assessing Cluster's "Quality"
• Define m* as the centroid of the dataset • For each cluster, compute the (Euclidian) distance of the cluster's centroid from m*, square it and weight it by ni, the number of points in each cluster • Sum results for all the clusters and obtain the BSS The higher the BSS means, the larger the distance of each cluster centroid from the data centroid, the more the clusters are separated from each other == Low Inter-Similarity
Association Rule Mining
• Discovering interesting relationships among items/events/variables • Find out which items predict the occurrence of other items • Also known as Affinity Analysis or Market Basket Analysis because it was born in Marketing Analytics • Used to find out which products tend to be purchased together • Other example: • Healthcare, analyze which symptoms and illnesses manifest together BEER AND DIAPERS
Elbow Plot
• Example, varying k from 1 to 10. For each k, compute the Total Within Sum of Squared Errors • Plot the WSS as function of the number of clusters k • Pick the number of clusters that corresponds to the "Elbow point": point where the curve bends Choosing the Number of Clusters • While the Elbow plot seen before mostly relies on the WSS, we can also look at the BSS • The higher BSS, the lower the inter-similarity • We can implement the clustering algorithm (e.g., k-means) for different values of k., where k is how many clusters we want. • Example, varying k from 1 to 10. For each k, compute WSS and BSS • Plot the WSS and BSS as function of the number of clusters k • While the Elbow plot seen before mostly relies on the WSS, we can also look at the BSS • The higher BSS, the lower the inter-similarity • We can implement the clustering algorithm (e.g., k-means) for different values of k., where k is how many clusters we want. • Example, varying k from 1 to 10. For each k, compute WSS and BSS • Plot the WSS and BSS as function of the number of clusters k
a lift < 1 does give an indication of a "negative"association, is it as informative as a lift > 1?
• Example. Think of the healthcare context. We find that two symptoms are LESS likely to occur together (lift < 1). Is this more important than knowing which symptoms DO occur together?
Hierarchical clustering procedure
• First, we need to compute the distance between points based on the distance measure of choice, and create a Distance Matrix • Distance Matrix: square matrix containing the distances, taken pairwise, between the data-points included in your data • The distance matrix will be fed to the algorithm and used to decide which points to cluster together The algorithm considers each data-point individually, as its own 1-point cluster It merges the two 1-point clusters that are nearest to each other (based on the distance matrix) and forms a new cluster It continues by merging the next two points or clusters (of any size) closer to each other (based on distance measure and linkage method selected) It repeats the process until there is only 1 cluster left (all data points assigned to one big cluster)
Data Visualization
• Graphical representation of information and data • Used before, while and after implementing Data-Mining methods • Descriptive graphs/plots, used to display general data patterns, summary statistics, usually before applying data-mining methods • Common plots include: histograms, box-plots, scatterplots, etc.. • Specific graphs produced to visualize intermediate and final results generated by data-mining methods
"desirable" Clusters should have
• High intra-similarity (The lower WSS, the more the clusters are cohesive within) • Low inter-similarity (The higher the BSS means, the larger the distance of each cluster centroid from the data centroid, the more the clusters are separated from each other)
steps to find association rules
• How frequently should an item-set appear, to be considered a frequent item-set? • We (the data scientists) need to specify a minimum threshold. • Specify the minimum support (minsupp) • frequent item-sets are the item-sets for which support >= minsupp • How to decide on the minsupp? Based on domain knowledge or business goals
Why do we need algorithms?
• If the data has 3 Dimensions or less, clustering would be very straightforward • Observation/Data-point: each row, observation. • Variable/Feature/Dimension: Each column that captures a feature (or a dimension) • A 2D dataset is a dataset with 2 dimensions, so 2 features; a 3D dataset has 3 dimensions, and so on. This has 3 dimensions, clustering = straightforward, anymore = complex
How to Deal with Some of the Issues of K-Means
• Pre-process the data: • Deal with missing values • Remove outliers Reduce dispersion of data points by removing atypical data • Normalize the data Preserve differences in data while dampening the effect of variables with wider ranges How to choose initial random centroids: While there is no one best way to choose the initial points, usually data scientists perform multiple runs: repeat clustering using different random points and see how the result changes
Take Action in regards to apiori algorithim, with a Caveat
• Remember: Association Rules are exploratory in nature. • They provide some initial directions to work on. • Setting specific business strategies requires domain expertise and more careful analysis and testing
Apriori Algorithm definition
• Still, if we have a large dataset with many items, checking whether each item-set is frequent, that is checking that support >= minsupp, can take forever • Many techniques have been proposed to reduce the computational burden • The classic algorithm is the Apriori Algorithm
Association Rules •Measures (commonly computed metrics)
• Support Count • Support Percentage • Confidence • Lift
Data Transformation
• Transform, reduce or discretize your data -strongly depends on the context and type of analysis • Examples: • Normalization or Standardization • Scale values to a common range • Attribute/Variable Construction • Calculate new attribute or variable based on other observed attributes
Cluster analysis is an exploratory tool
• Useful when it produces meaningful clusters • Be aware of chance results: data may not have definite "real"clusters • An "optimal" or "best" clustering solution is not guaranteed - BUT there are good solutions and bad solutions
N-Dimensional Spaces
• We usually have to deal with N-Dimensional Spaces! • In other words, our dataset will likely have more than 3 dimensions (attributes) • Generally, a dataset of N columns-features and M rows (records) can be considered as having M observations of N dimensions (M times N Space)
assess k-means clusters
• Within Sum of Squared Errors (WSS) • Between Sum of Squared Errors (BSS)
Once Confident in strong association rule, Take Action
• You find an association rule {beer}-> {diaper} and conclude it is strong enough. Now what? Possible Marketing Actions • Put diapers next to beer in your store • Or, put diapers away from beer in your store • Bundle beer and diapers in a "new parent coping kit" • Lower the price of diapers, raise it on beer