ISM Final Exam

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Dendrograms

- A treelike diagram that summarizes the process of clustering - At the bottom are the records - Similar records are joined by lines whose vertical length reflects the distance between the records - For any given number of clusters we want, we can determine the records in the clusters by sliding a horizontal line up and down until the number of vertical intersections of the horizontal line equals the number of clusters desired.

Average Distance

- Also called average linkage - Distance between two clusters is the average of all possible pair-wise distances - Average (distance(Ai, Bj)), i=1,2,...,m, j=1,2,...n.

Maximum Distance

- Also called complete linkage - Distance between two clusters is the distance between the pair of records Ai and Bj that are farthest from each other - Max(distance(Ai, Bj)), i=1,2,...,m, j=1,2,...n.

Text Table

- Also called cross-tabs or pivot tables - First placing one dimension on the Rows shelf and another dimension on the Columns shelf. - Then complete the view by dragging one or more measures to Text on the Marks card.

Minimum Distance

- Also called single linkage - Distance between two clusters is the distance between the pair of records Ai and Bj that are closest -Min(distance(Ai, Bj)), i=1,2,...,m, j=1,2,...n.

Summary - Classification Tree

- An easily understandable and transparent method for predicting or classifying new records - A graphical representation of a set of rules - Trees must be pruned to avoid over-fitting of the training data - Trees usually require large samples, because trees do not make any assumptions about the data structure

Pruning

- CART lets tree grow to full extent, then prunes it back - Idea is to find that point at which the validation error begins to rise - Generate successively smaller trees by pruning leaves at each pruning stage, multiple trees are possible

Examples of Clustering applications

- Chemistry - Periodic table of the elements - Biologists - Classification of species - Grouping securities in portfolios - Grouping firms for structural analysis of economy - Army uniform sizes - Land use - Grouping areas of similar land use in earth observation DB - City-planning - Grouping houses by house type, value, and location - Marketing: Help discover distinct groups of customers; use this knowledge to develop targeted marketing programs; Segmentation and price discrimination - Insurance: Identify groups of policy holders with high average claim cost; Earth-quake studies; Observed earth quake epicenters should be clustered along continent faults

Cluster Analysis Summary

- Cluster analysis is an exploratory tool. Useful only when it produces meaningful clusters. - Hierarchical clustering gives visual representation of different levels of clustering - Due to the an iterative nature, it can be unstable - It can vary highly depending on settings and be relatively computationally expensive - Non-hierarchical is computationally cheap and more stable, but requires user to set the pre-defined k - Be wary of chance results; data may not have definitive "real" clusters

How Does PCA Do This?

- Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables) - These linear combinations are uncorrelated (no information overlap), and only a few of them contain most of the original information - The new variables are called principal components. - Normalization (= standardization) is usually performed in PCA; otherwise measurement units affect results (use correlation matrix instead)

Data Reduction Summary

- Data reduction is useful for compressing the information in the data into a smaller subset - Categorical variables can be reduced by combining similar categories - Principal components analysis transforms an original set of numerical variables into a smaller set of weighted averages of the original data, which contain most of the original information with a smaller number of variables - Regression and subset selection algorithm can be used to choose a subset of predictor variables to reduce dimension - Classification trees/regression trees can be used to determine the important predictors

Centroid Distance

- Distance between two clusters is the distance between the two cluster centroids - Centroid is the vector of variable averages for all records in a cluster

Advantages

- Easy to use, understand - Produce rules that are easy to interpret & implement - Variable selection & reduction is automatic - Does not require the assumptions of statistical models - Can work without extensive handling of missing data

Misclassification in Classification

- Error = classifying a record as belonging to one class when it belongs to another class. - Error rate = percent of misclassified records out of the total records in the validation data - The confusion matrix

Clustering

- Grouping a set of data objects into clusters - Unsupervised classification: no predefined classes - Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. - Intra-cluster distances minimized, inter-cluster distances maximized

Euclidean Distance

- Highly scale dependent - Completely ignores the relationship between the measurements - Sensitive to outliers, scale, and variance

Histogram

- How your data are distributed across groups - By grouping your data into categories then plotting them with vertical bars along an axis, you will see the distribution of your pumpkins according to weight. - Example: you've got 100 pumpkins and you want to know how many weigh 2 pounds or less, 3-5 pounds, 6-10 pounds, etc.

Scatter Plot

- Investigating the relationship between different variables - Effective way to give you a sense of trends, concentrations and outliers that will direct you to where you want to focus your investigation efforts further. - Examples: Male versus female likelihood of having lung cancer at different ages, technology early adopters' and laggards' purchase patterns of smart phones, shipping costs of different product categories to different regions.

Dimension Reduction (cont.)

- Is to reduce number of variables or categories by removing redundant variables and combining similar categories (those are most likely highly correlated) (1) Domain knowledge (2) Data summaries to detect information overlap between variables (3) Data conversion techniques such as converting categorical variables into numerical variables (4) Automated reduction techniques, such as principal components analysis (PCA) (5) Data mining methods such as regression models, and regression, and classification trees

Lift chart with continuous y

- Lift charts are visual aids for measuring model performance. It consists of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model.

Disadvantages

- May not perform well where there is structure in the data that is not well captured by horizontal or vertical splits - Since the process deals with one variable at a time, no way to capture interactions between variables

Stopping Tree Growth

- Natural end of process is 100% purity in each leaf - This overfits the data, which ends up fitting noise in the data - Overfitting leads to low predictive accuracy of new data - Past a certain point, the error rate (i.e., misclassification rate) for the validation data starts to increase

Measure Predictive Error in Regression

- Note that a model with high predictive accuracy might not coincide with a model that fits the training data best - We want to know how well the model predicts NEW DATA, not how well it fits the data it was trained with (training set) - A key component of most measures is difference between actual y and predicted y (y_hat) ei = yi - ŷ

Line Chart

- One of the most frequently used chart types with bars - Connect individual numeric data points to view trends in data over time - Primarily display trends over a period of time (e.g., stock price change over a five year period, website page views during a month, revenue growth by quarter)

Recursive Partioning Steps

- Pick one of the predictor variables, xi - Pick a value of xi, say si, that divides the training data into two (not necessarily equal) portions - Measure how "pure" or homogeneous each of the resulting portions are - "Pure" = containing records of mostly one class - Algorithm tries different values of xi, and si to maximize purity in initial split - After you get a "maximum purity" split, repeat the process for a second split, and so on

Using Validation Error to Prune

- Pruning process yields a set of trees of different sizes and associated error rates - Minimum error tree: the lowest error rate on validation data

Bar Chart

- Quick to compare information, revealing highs and lows at a glance - The most common ways to visualize data to compare data across categories. (e.g., volume of shirts in different sizes, website traffic by origination site, percent of spending by department) - Effective when you have numerical data that splits nicely into different categories so you can quickly see trends within your data.

Hierarchical Clustering Steps (Using Agglomerative Method)

- Start with n clusters (each record is its own cluster) - Merge two closest records into one cluster - At each successive step, the two clusters closest to each other are merged - Keep merging until there is just one cluster left at the end, which consists of all the records

Dimension Reduction

- The dimension of a dataset --- the number of variables - The dimensionality of a model is the number of independent or input variables used by the model. - In AI, it is referred to as factor selection or feature extraction - It must be reduced for efficient data mining algorithms

PCA in Classification/Prediction

- Use PCA when the goal of the data reduction is to have a smaller set of variables as predictors - Apply PCA to the training data - Decide how many Principal Components are used - Apply the resulting variable weights of the PC's obtained from training data to the validation set, which yields a set of principal scores - These new variables are then treated as the new predictors

Naïve Rule in Classification

- classify all records as belonging to the most prevalent class - often used as a benchmark; we hope to do better than that

Summary of Hierarchical Clustering

Advantages: - Very appealing in that it has no need for specification of the number of clusters, and is purely data driven - Easier to understand, i.e., through dendrograms Limitations: - Requires the computation and storage of an n×n distance matrix - Records that are allocated incorrectly early in the process cannot be reallocated subsequently - Low stability: sensitive to reorder/drop/distance choice/outliers

Normalizing: example

For 22 utilities: Avg. sales = 8,914 Std. dev. = 3,550 Normalized score for Arizona sales: (9,077-8,914)/3,550 = 0.046 {9,077 given}

Trees and Rules

Goal: Classify or predict an outcome based on a set of predictors. The output is a set of rules Example: Goal: classify a record as "will accept credit card offer" or "will not accept" Rule might be "IF (Income > 92.5) AND (Education <1.5) AND (Family <= 2.5) THEN Class = 0 (nonacceptor) Also called CART, Decision Trees, or just Trees Rules are represented by tree diagrams

Principal Components Analysis (PCA)

Goal: Reduce a set of numerical variables. The idea: Remove the overlap of information between these variables. ["Information" is measured by the sum of the variances of the variables.] Final product: A smaller number of numerical variables that contain most of the information

Hierarchical Methods

Hierarchical algorithms create a hierarchical decomposition of the set of objects using some criterion Agglomerative Methods - Begin with n-clusters (each record its own cluster) Keep joining records into clusters until one cluster is left (the entire data set) - Most popular Divisive Methods - Start with one all-inclusive cluster - Repeatedly divide into smaller clusters

ROC Curve

How do we interpret performance from ROC curve? Sensitivity : ability to detect the important class members (1) correctly = n1,1/( n1,0 + n1,1) i.e., classifying C1 as C1 1-Specificity : 1- (ability to detect the unimportant class members (0) correctly) = n0,1/( n0,0 + n0,1) i.e., classifying C0 as C1 (toward left top, Better performance)

When One Class is More Important

In many cases it is more important to identify members of one class: - Tax fraud, Credit default, Response to promotional offer, Detecting electronic network intrusion, Predicting delayed flights - In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention

Normalizing

Problem: Raw distance measures are highly influenced by scale of measurements Solution: normalize (standardize) the data first - Subtract mean, divide by std. deviation (Also called z-scores) z = (x - μ)/σ

Desirable Cluster Features

Interpretability - explore the characteristics of each resulting cluster (e.g., summary statistics of each cluster on each measurement used in clustering) Stability - check if resulting clusters and cluster assignments are sensitive to changes in inputs Separation - check the ratio of between-cluster variation to within-cluster variation (higher, better)

Error Measures with continuous Y

MAE or MAD: Mean absolute error (deviation) Gives an idea of the magnitude of errors Average error Gives an idea of systematic over- or under-prediction MAPE: Mean absolute percentage error RMSE (root-mean-squared-error): Square the errors, find their average, take the square root Total SSE: Total sum of squared errors

Key Ideas of Trees

Recursive partitioning: repeatedly split the records into two parts so as to achieve maximum homogeneity within the new parts Pruning the tree: Simplify the tree by pruning peripheral branches to avoid overfitting

k-Means Strengths & Weaknesses

Strengths • Efficient • Implementation is straightforward Weakness •Need to specify k (number of clusters) in advance •Very sensitive to outliers: Objects with extremely large values may substantially distort distribution of data •Not suitable to discover clusters with odd (non-convex) shapes •Applicable only when mean is defined

Variance (2)

The variance of a population is: σ² = (x - μ)² / N (where n is population size) The variance of a sample is: (xi - x̄ )² / (n-1)

Variance

Variance and its related measure, standard deviation, are arguably the most important statistics. Used to measure variability, they also play a vital role in almost all statistical inference procedures. - Population variance is denoted by σ² (Lower case Greek letter "sigma" squared) - Sample variance is denoted by s² (Lower case "S" squared)

Clustering Algorithms: Hierarchical Clustering

e.g., agglomerative clustering (sequential merge) - useful when a goal is to arrange the clusters into a natural hierarchy.

Clustering Algorithms: Non-Hierarchical Clustering

e.g., k- means clustering (k-clusters selection) - less computationally intensive and thus preferred for large datasets

Tableau - Dimensions and Measures

• Dimensions set the granularity, or the level of detail in the view. Think of them as the things you group by or drill down by. Dimensions are usually discrete, categorical fields such as Order Priority and City. • E.g., Order Priority has 4 categories, so it would give us 4 marks. • Measures are usually numerical, continuous data like Shipping Cost. Inside of Tableau, measures are aggregations - they're aggregated up to the granularity set by the dimensions in the view. Think of them as the data elements that you want to perform calculations on. • E.g., the result for the sum of Shipping Cost (double click it) is different if we have no dimensions in the view (just a single overall sum) versus when we add Order Priority (double click it) as a dimension - now we have a sum for each priority level. • Dimensions come out onto the view as themselves! • Measures come out onto the view as aggregates!

Measures of Central Location

• The arithmetic mean, a.k.a. average, shortened to mean, is the most popular & useful measure of central location. • It is computed by simply adding up all the observations and dividing by the total number of observations: mean= Sum of the observations / Number of observations


Kaugnay na mga set ng pag-aaral

ms prepu 56: Management of Patients with Dermatologic Disorders

View Set

Irregular Verbs end with s, sh, ch, or x (-es) (-ies) - Present Tense

View Set

APUSH - Unit 5 + Midterm (Free Response)

View Set

California Real Estate Law Ninth Edition 2019 Unit 9-12

View Set

Unit 1 Review Schizophrenia / Psychosis

View Set

Unit 1 Pearson Practice Questions

View Set

Metafísica (Sustancia-accidentes, Acto-potencia)

View Set