BUSA 3131 Chapter 3,4, 5
scatter chart matrix
- A graphical presentation that uses multiple scatter charts arranged as a matrix to illustrate the relationships among multiple variables. - Allows the reader to easily see the relationships among multiple variables
heat map
- A two dimensional graphical representation of data that uses different shades of color to indicate magnitude - Helps reader to easily identify trends
Data visualization involves
- Creating a summary table for the data. - Generating charts to help interpret, analyze, and learn from the data.
uses of data visualization
- Helpful for identifying data errors. - Reduces the size of your data set by highlighting important relationships and trends in the data
data-ink ratio
- Measures the proportion of what Tufte terms - Edward R. Tufte first described the data-ink ratio - Increasing the data ink ratio by adding labels to axes and removing unnvessary lineswhite space in a table or a chart can imporve readibility and labels.
Goal of preprocessing
- generate a list of most-relevant terms that is sufficiently small so as to lend itself to analysis
Tables should be used when:
1. Reader needs to refer to specific numerical values 2. Reader needs to make precise comparisons between different values and not just relative comparisons 3. Values being displayed have different units or very different magnitudes
Scatter charts
A graphical representation of the relationship between two quantitative variables Trendline is a line that provides an approximation of the relationship between the variables
Why is data mining with text data more challenging than data mining with traditional numerical data?
A collection of text documents to be analyzed is called acorpus It requires more preprocessing to convert the text to a format amenable for analysis
bubble chart
A graphical means of visualizing three variables in a two dimensional graph and is therefore sometimes a preferred alternative 3-D graph
cluster column or bar chart
A special type of column (bar) chart in which multiple bars are clustered in the same class to compare multiple variables; also known as a side-by-side-column (bar) chart.
stacked column chart
A special type of column (bar) chart in which multiple variables appear on the same bar. - allows the reader to compare the relative values of quantitative variables for the same category in a bar chart
Association rules
An if-then statement describing the relationship between item sets.
Table design principles
Avoid the use of unnecessary ink in tables Horizontal lines are necessary only for separating column titles from data values or when indicating that a calculation has taken place Columns of numerical values in a table should be right aligned because it makes it easy to see the differences in the magnitude of values If you are showing digits to the right of the decimal point,all values should include the same number of digits to the right of the decimal It is generally best to left align text values within a column in a table Aligning the first letter of each data entry promotes readability - only if the text value sare approx the same length
multivariate predictive analysis
Cluster analysis is one of the many models in this area This will be brief intro to cluster analysis
Market segmentation
Commonly used in marketing to divide customers into different homogenous group
hierarchical clustering
Determines the similarity of two clusters by considering the similarity Between the observations composing either cluster. Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster. - suitable when we have small data and want to easily examine solutions with increasing numbers of clusters - convenient method if you want to observe how clusters are nested - very sensitive outliers
addition law for mutually exclusive events
Events A and B are mutually exclusive if the occurence of one event precludes the occurence of th eother Thus, a requirment for A and B to be mutually exclusive is that their intersection must contain no sample points
This measure of uncertainty is often communicated through a probability distribution:
Extremely helpful in providing additional information about an event. Can be used to help a decision maker evaluate possible actions and determine best course of action
K means clustering
Given a value of k (specified by the analyst), the k-means algorithm randomly assigns each observation to one of the k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated. Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid until there is no change in the clusters or a specified maximum number of iterations is reached. One can measure the strength of a cluster by comparing the average distance in a cluster to the distance between cluster centroids One rule of thumb is that the ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful cluster K represents the number of clusters - not suitable for binary data - partitions the observations, appropriate if trying to summarize the data with K "average" observations that describe the data with minimum amount of error - suitable when you know how many clusters you want and you have a large data set
complement of an Event
Given an event A, the complement of A is defined to be the event consisting of all outcomes that are not in A P(A) = 1 - P(A^c)
Prior Probability
Initial estimate of the probabilities of events.
Data-ink
Ink used in a table or chart that is necessary to convey the meaning of the data to the audience.
Non-data-ink:
Ink used in a table or chart that serves no useful purpose in conveying the data to the audience
multiplication law
It can be used to calculate the probability of the inersection of two events
What is not true of euclidean distance?
It is not affected by the scale on which variables are measured, it used to measure dissimilarity between categorical variable observations, it increases with the increase in similarity between variable values
Jaccards coefficiente
Number of variables with matching nonzero value for obsevraitons u and v/ (total number of variables) - (number of variables with martching zero values for observations u and v) Does not count matching zero entries Measure of similarity between observations consisting solely of binary categroical variables that considers only matches of nonzero entries
matching coefficient
Number of variables with matching vlaue for observaitons u and v / total number of variables a weakness - if two observations both have a 0 entry for a categroical variable, that is counted as a sign of similarity between two observations even though thats sometimes not th ecase When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations
Data preprocessing
Parses the original text data down to the set of tokens deemed relevant for the topic being studied.
cross tabulation
Provides a tabular summary of data for two variable
Support Count
The number of times that a collection of items occur together in a transaction data set.
independent events
The outcome of one event does not affect the outcome of the second event If the probability of event D is not changed by the existence of event M, then we would say that events D and M are independent events. Otherwise, the events are dependent
joint probabilities
The probability of two events both occurring; in other words, the probability of the intersection of two events.
marginal probabilities
The probability of two events both occurring; in other words, the probability of the intersection of two events. It is found by summing the joint probabilites in the corresponding row or column of the joint probability table
Frequency term document matrix
Used when the frequency of word occurrence is important to the context of the business problem Rows represent documents, columns represent tokens Entries in the matrix are the frequency of occurrence of each token in each document
3-d effect on charts
adds unnecessary detail that does not help explain the data, consider using multiple line son a line chart, employing multiple charts or creating bubble charts in which the size of the bubble can represent the z axis value
Pie charts
are another common form of chart used to compare categorical data Make visual comparsions is much easier in the bar chart than in the pie chart
Line charts
are similar to scatter charts but a line connects the points in the cart. Line charts are very useful for time series data collected over a period of time
K means clustering
assigns each observations assigned to the same cluster are as similar as possible
Lift ratio
confidence/support of consequent/totalnumber of transactions. The ratio of the performance of a data mining model measured against the performance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction.
single linkage
considers only the two most similar observations between the two clusters. (nearest neighbor). Measure of calculating dissimilarity between two clusters by considering only the two most similar obsevrations between the two clusters Can result in long,elongated clusters rather than compact circluar clusters
group average linkage
distance between each pair of observations in each cluster are added up and divided by the number of pairs to get an average Creates clusters rthat are less dominated by the similarity between singl epairs of observations If clustrer 1 consists of n1 observations and cluster 2 consists of n2 obervsations, the similarity of these clusters would be the average of n1 X n2 similairty measures
mutually exclusive events
events that have no outcomes in common
Sparkline
is a minimalist type of line chart that can be placed directly into a cell into Excel - Contain no axes, they display only the line of the data, can be effectively used to provide info on overall trends for time series data - Take sup very little space
a random experiment
is a process that generates well defined outcomes
Euclidean distance
is the most common method to measure dissimilairty between observations - Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values - is highly influenced by scale on which variables are measured
Probability
is the numerical measure of the likelihood that an event will occur.
Sentiment analysis
is the process of clustering/categrozing comments or ereviews a s positive, engative or neutral
tokenization
is the process of dividing text into separate terms, referred to as tokens
The goal of cluster analysis
is to group observations into clusters such that observations within a cluster are similar and observations in different lcusters are dissimilar
goal of clustering
is to segment observations into similar groups based on observed variables.
text data
is unstructured data because in its raw form, it cannot be stored in a traditional structured database. The process of extracting useful info from text data Example - video and audio data
median linkage
median distance between each observation in one cluster and each observation in the other cluster. Reduces effect of outliers.
wards method
merges two clusters such that the dissimilarity of the observations within the resulting single cluster increases as little as possible. It tends to produce clearly defined clusters of similar size. Computes the centroid of the resulting merged cluster and then calculates the sum of squared dissimilairty between this centorid and each observation in the union of the two clusters of similar size
Complete Linkage
onsiders only the two most dissimilar observations between the two clusters. (farther neighbor) It would be considered two clusters to be close if their most different pair of obserbartions are close Everything is clost to each other It can be distorted by outliers
Bar charts
provide a graphical summary of categorical data - Use horizontal bars to display the magnitude of the quantitative variable
Addition law
provides a way to compute the probability that event A or event B occurs or both events occur Used to find union of two events Intersted in knoiwing the probabilty thate vent A or event B occurs or both events occur The intersection of A And B is the event containing outcomes that belong to both A And B
stemming
s the process of converting a word to its stem or root word. Would drop the "ing" and "ed" and place only the stack in the list of words to be tracked
bottom up hierarchical clustering
starts with each observation beloning to its own cluster and then sequentially merges the mosrt similar clusters to create a series of nested clusters
confidence
support of antecedent and consequent/support of antecedent. The conditional probability that the conseqeuent of an assoication rule occurs given the antecedent occurs
random variable
takes on different numeric values based on chance
sample space
the set of all outcomes
Column charts
use vertical bars to display the magnitude of the quantitative variable Good at making comparisons between categorical variables
Centroid linkage
uses the distance between cluster centroids (means). Uses the averaging concept of cluster for cluster k1 denoted ck is found by calculating the average value for each variable across all observations of a cluster Centroid is the average observation of a cluster
conditional probability
when the probability of one event is dependent on whether some related event has already occurred It can be computed as the ratio of joint probability to a marginal probability We are considering the probability of event A givent he condition that event B has occurred The probability of A given B