MGT 4500 Final Exam

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Confidence is ________________________. conf (X →Y) may not equal to conf (Y →X), although both have the same support!

Directional

Support vs. Confidence

- Both are measures of how strong a relationship is - Support is needed to calculate confidence - Support often serves as a "filter" for finding candidate association rules (I.e., two itemsets need to occur frequently enough before we consider them)

Partition-Based Method

- Directly partition the data into K groups, where K is the desired cluster number - K-Means Clustering

Hierarchical Method

- Forming larger clusters from smaller clusters - Hierarchical Clustering

Q1: What kinds of patterns (output)? Q2: From what types of data (input)? Q3: How to find these patterns (technique)? Q4: What counts as meaningful (interpretation and evaluation, decision-making)?

1. Association between objects 2. Transaction data 3. Support, Confidence, Apriori Algorithm 4. Lift measure + subjective managerial judgement

What are the two broad types of predictive analytics?

1. Classification: predicts categorical labels of outcome variable Techniques: K-Nearest-Neighbors, Naive Bayes, Decision Trees, etc. 2. Numeric Prediction: predicts continuous/numeric values of outcome variable Techniques: K-Nearest-Neighbors, Regression Trees, etc.

We want clustering results to have which two properties?

1. High intra-similarity: data points in the same cluster should be similar to each other 2. Low inter-similarity: data points in different clusters should be different from each other

What are the 3 Measures of Central Tendency

1. Mean: could be heavily influenced by outliers 2. Median: the value in the middle after sorting. 50% of values are larger/smaller than the median 3. Mode: the most frequent value(s)

What are the 2 measures of dispersion

1. Range: max -min 2. Variance: Intuition: how far away are values from the mean

Min-Max Normalization

A data point with value x on this attribute should be normalized to 𝑧 = 𝑥 − 𝑚𝑖𝑛 / 𝑚𝑎𝑥 − 𝑚𝑖𝑛. The normalized value always falls within [0, 1]

F-measure

A special-purpose measure to combine precision and recall

Cross-Validation

A widely used method to understand the performance of your model Parameter: K,how many folds to create In each round, use 𝐾−1 folds to build the model, evaluate its performance on the remaining 1 fold Repeat for 𝐾rounds Average performance across 𝐾rounds More robust than a single training-validation split - less susceptible to an "unlucky" split

Stopping Criteria: When to Stop Splitting Nodes?

All data points associated with a node are from the same class - The node becomes a leaf node of that class There are no remaining attributes to further split the data - No attribute further increases "purity" much - Use "majority vote" to label that node

Recall_Good / Recall_Bad

Among all actually good/bad records, the percentage that the model correctly predicts

Precision_Good / Precision_Bad

Among all predictions of good/bad, the percentage of correct predictions of good/bad

Accuracy

Among all predictions, the percentage of correct predictions (an "overall" measure)

An ___________________________ describes relationship between two itemsets: X →Y

Association Rule - X and Y are two non-overlapping itemsets - This association rule reads "if X then Y" - X is called the antecedent, or left-hand-side (LHS) - Y is called the consequent, or right-hand-side (RHS) - In the context of shopping, this rule means "customers who buy X are likely to also buy Y"

Nominal

Categorical values are just different "names". No order EX: ID, gender, eye color, ethnicity

Ordinal

Categorical values have implied order EX: Education/income levels, "low, medium,high"

The output of Hierarchical clustering is a graph called the ____________________________.

Dendrogram (means "tree" in Greek) The dendrogram contains solutions of any number of clusters you may want Shows which clusters were merged at what time, indicating their similarity visually determine the natural cluster numbers

K-Means Clustering

Directly partition your data into K clusters, then try to adjust the partition to make it more appropriate. Procedure: Step 1: choose K data points at random, K is user-specified -They represent the centroid (center) of your clusters Step 2: assign the remaining data points to the cluster to which they are closest Step 3: update the cluster centroids based on the newly formed clusters -Finding the mean of all data points within one cluster Repeat Step 2 and 3, until it converges, i.e., the clusters don't change anymore

Association Rule Mining

Discovering associations and correlations among items/events from transaction data Market Basket Analysis: which items are purchased together -e.g.,when people buy bread, they also tend to buy milk-{bread} →{milk} Applications in Many Settings: In the store, in the catalog - place items closer together Product recommendation - Personalization, targeted mailings

Measure Similarity between Data Points: If your data is numeric?

Euclidean Distance Manhattan Distance Max-Coordinate Distance

Exploratory(Descriptive)

Explore/discover meaningful patterns in the data. "Let the data speak" Association Rules; Clustering EX: Explore/discover consumer groups in the data

Classification vs. Numeric Prediction

For a classification model, performance depends on whether the model can put data points into the correct categories For a numeric prediction model, performance depends on how close the predictions are to the actual values

Understand how to convert categorical and continuous data into binary format

For categorical data, create one new variable for each category For numeric data, discretize into categorical data first, then convert to binary

Imagine you are a store manager at Target. After working with your data team, you find out that customers who buy beer are likely to buy diaper. In other words, you find an association rule {beer}→{diaper}. Now, what should you do to increase the sales of your store?

Here are some possible strategies: -Put diaper next to beer in your store -Put diaper away from beer in your store (why?) -Bundle beer and diaper into "New Parent Coping Kit" -Lower price of beer, raise price of diaper

K-Nearest-Neighbors (k-NN)

Intuition: classify a record as the majority class among its "neighbors", i.e., other records near it Why it works: records near each other are similar in their attributes, and similar records are likely to have the same class/label Procedure of k-NN: - User picks a distance metric and a specific k - Normalize the data if needed! - For every unlabeled record, identify the k nearest labeled records - Classify that unlabeled record as the majority class among the k nearest records. In case of a tie, randomly choose a class

Apriori Algorithm: Pruning Infrequent Itemsets

Intuition: if an itemset X is NOT frequent, then any larger itemsets containing X cannot be frequent First check all 1-item sets, only keep the frequent ones; then check 2-item sets made from frequent 1-item sets, only keep the frequent ones; so on and so forth.

Hierarchical Clustering

Intuition: starting from individual data points or smaller clusters, then try to form larger clusters in a hierarchical manner Also called Botton-Up method Procedure: Step 1: Assign each data point as its own cluster, i.e., 1 cluster per data point Step 2: Merge the 2 clusters that are nearest to each other Step 3: repeat step 2, until there is only 1 cluster left, i.e., all data points are assigned to one big cluster Distance Matrix: Suppose you have N data points to cluster, the distance matrix is a N ×N matrix where each element, 𝑎𝑖𝑗, is the distance between data i and data j This matrix makes it easy to find clusters that are nearest to each other

How to Pick the "Splitting Point"?

Intuition: we want to pick the attribute that maximizes leaf node "purity" after split - There are mathematical metrics that measure the "purity". Use them to choose the attribute that splits the data in the most informative way Information Gain (Entropy) - Suppose your data contains 70% class "1" and 30% class "0", the entropy of your data is: entropy= −[0.7 ×log2(0.7) + 0.3 ×log2(0.3)] ≈0.88 Higher the entropy, less "pure" your data is Combined entropy = weighted average

Transaction data usually takes one of two formats: ________________________ or ____________________________.

Item List, Binary Matrix

An __________________ is a set of items in a particular application domain

Itemset 1-item itemset: {coffee} 2-item itemset: {coffee, tea} 3-item itemset: {tea, bagel, cookie}

Sum-of-Squared-Errors (SSE)

Let's say the centroid of each cluster is 𝑚1, 𝑚2, ..., 𝑚𝐾 For a given data point x, the error is defined as the distance to its own cluster center 𝑚𝑖, i.e., 𝑑(𝑥,𝑚𝑖) We add up the squared errors of all data points - Lower individual cluster SSE = a better cluster - Lower total SSE = a better set of clusters - More clusters will reduce SSE Reducing SSE within a cluster increases cohesion (what we want)

What does this situation mean: Lift(X →Y ) = 5

Lift(X →Y) > 1: means customers who buy X are 4x (400%, 5-1=4) more likely to buy Y than customers in general

Support Count

raw count of transactions containing both X and Y, denoted as supp(X →Y)

Measure Similarity between Data Points: If your data is categorical?

Matching Distance - Intuition: number of "mis-matches" divided by total number of attributes Typically used for Symmetric binary data, i.e., 𝑁00 and 𝑁11 are equally "important" Jaccard Distance - Intuition: excluding "matches" where 𝑎𝑖 = 0 and 𝑏𝑖 = 0 Typically used for Asymmetric binary data, where 𝑁00 is not as "important" as 𝑁11 (E.g., in a supermarket, "two people both buy X" is more informative than "neither people buy X")

____________________________ is an essential step before calculating distances, to make sure all attributes of your data take value from the same range

Normalization

Naïve (or Majority) Rule

Not a "real" classification method. It can serve as a benchmark against which to evaluate other methods Simply classify all records as the majorityclass EX: if your training data has 70% records with class "1" and 30% with class "0", the naive classifier simply classifies all records as class "1"

Numerical

Numerical values reflect a measurement EX: Number of "Likes",height, weight

Clustering

Organizing data points/objects (e.g., customers) into homogeneous (and, hopefully, meaningful) groups Each group is called a cluster

What is the measure for evaluation?

Sum-of-Squared-Errors (SSE)

A _____________________ is a particular itemset

Transaction (Multiple transactions comprise a dataset)

How to Build a Decision Tree?

We use an approach called Recursive Partitioning - Pick one of the attributes (i.e., predictor variables) 𝑋𝑖 - Pick a value of 𝑋𝑖 (say 𝑠𝑖) that divides the training data into two (not necessarily equal) portions, one with 𝑋𝑖>𝑠𝑖and the other with 𝑋𝑖≤𝑠𝑖 - Measure how "pure" each of the resulting portions is - More "pure"⇔Containing more records of mostly one class•Repeat the process Not necessary to normalize your data, because no distance is being calculated

What can we Use Clustering for?

What can we Use Clustering for? Understanding - Discover the natural groups and patterns in your data. Helps to gain insights into your data Summarization - Instead of looking at each individual data point (which can be a lot in a large dataset), you can look at each cluster, and study its features Customization - Businesses can set different and customized strategies for each cluster, catering to their specific characteristics

Apriori Algorithm

With a large number of items, there could be a huge (exponential) amount of itemsets to consider a smart way to reduce the burden

Predictive

You know exactly what you are looking for (predict certain well-defined outcome-using data) Classification EX: Predict which group Consumer A belongs to

Support

a measure of how frequently X and Y occur together Intuition: X and Y must frequently occur together

Lift

a measure of how much more likely two itemsets co-occur than pure chance 𝐿𝑖𝑓𝑡𝑋→𝑌 = 𝒔𝒖𝒑𝒑𝑿→𝒀 / 𝒔𝒖𝒑𝒑𝑿×𝒔𝒖𝒑𝒑𝒀 - Here, we must use support percentage in calculation - supp(X) × supp(Y) is the probability of seeing X co-occurring with Y by pure chance, if X and Y are actually independent of each other.

Confidence

a measure of how often Y appears within transactions that contain X Intuition: among the transactions that contain X, many also contain Y Denoted as conf (X →Y) Calculated as: 𝑠𝑢𝑝𝑝𝑋→𝑌 / 𝑠𝑢𝑝𝑝𝑋 Conceptually related to conditional probability Pr(Y|X)

Decision Tree

a set of decision rules, organized as a tree Classify data into pre-defined classes based on attributes Decision nodes contain the attribute on which the tree splits Leaf nodes contain the final prediction Read down any path from the root to any leaf to derive an IF-THEN decision rule EX: IF(salary >=$50,000) AND(commute >1 hour) THEN(decline offer)

What are the two data types?

categorical & numerical

Data Scientists spend most of their time _________________ data

cleaning

Noisy

containing errors or outliers E.g., data entry errors; highly non-representative data (outliers) remove outliers when calculating summary statistics

External data

data acquired from external sources

Internal data/In-House data

data collected and stored by an organization itself

Inconsistency

different measurement scales, different ways to store the same value, etc. E.g., height, phone numbers, etc

Clustering analysis is a type of _____________________ data analytics

exploratory - Clusters come from data - Once the clusters are discovered from data, we can then describe and name them

Support Percentage

fraction of transactions containing both X and Y

How to choose k?

k is the number of nearby neighbors to be used to classify the new record - Typically choose the value of k that has lowest error rate on validation data - Try different k values, see which one gives you the lowest error rate (i.e., most accurate predictions) on the validation data

Incomplete

missing some values

TP + TN + FP + FN = ?

number of all records

dependent variables are the __________________and independent variables ____________________

outcomes, attributes

Bar chart

shows the distribution of categorical variables

Histogram

shows the distribution of numerical variables

Boxplot

shows the distribution of one numerical variable and enables side-by-side comparisons of different groups on a numerical variable Q1: 25% percentile and Q3: 75% percentile, respectively "min" and "max" are not actual minimum and maximum, they are indicators of a "normal" range

Scatterplot matrix

shows the pair-wise relationship among multiple numerical variables

Scatterplot

shows the relationship between two numerical variables

Labeled data

the data for which you know the outcome that you are trying to predict Randomly split your labeled data into two parts: 1. Training data: to build your model 2. Validation/Test data: to evaluate the performance of your model

Unlabeled data

the data for which you want to predict the outcome

Market Segmentations

the process of dividing a broad consumer or business market, normally consisting of existing and potential customers, into sub-groups of consumers based on some type of shared characteristics

How to choose the number of clusters for K-Means Clustering?

try different K, choose the one with the best clustering results We can also calculate and plot SSE for different cluster numbers (i.e., different K), and look for the "elbow" of SSE plot The elbow is a point where SSE drops a lot before but not too much after SSE tends to drop as we increase cluster numbers. The "elbow" typically means we have hit the natural cluster numbers

How to find the # of potential itemsets?

with N items, there will be 2^𝑁 potential itemsets

We use ____________ to represent the distance between A and B

𝑑(𝐴,𝐵)


Set pelajaran terkait

General California Insurance Law

View Set

Peds Alteration in Urinary/Genitourinary Disorders

View Set

Econ 3010 Ch. 6 Production and Costs

View Set

Chapter 25 : Convertible Securities (Theories)

View Set