BUSA 3131 Chapter 3,4, 5

Ace your homework & exams now with Quizwiz!

scatter chart matrix

- A graphical presentation that uses multiple scatter charts arranged as a matrix to illustrate the relationships among multiple variables. - Allows the reader to easily see the relationships among multiple variables

heat map

- A two dimensional graphical representation of data that uses different shades of color to indicate magnitude - Helps reader to easily identify trends

Data visualization involves

- Creating a summary table for the data. - Generating charts to help interpret, analyze, and learn from the data.

uses of data visualization

- Helpful for identifying data errors. - Reduces the size of your data set by highlighting important relationships and trends in the data

data-ink ratio

- Measures the proportion of what Tufte terms - Edward R. Tufte first described the data-ink ratio - Increasing the data ink ratio by adding labels to axes and removing unnvessary lineswhite space in a table or a chart can imporve readibility and labels.

Goal of preprocessing

- generate a list of most-relevant terms that is sufficiently small so as to lend itself to analysis

Tables should be used when:

1. Reader needs to refer to specific numerical values 2. Reader needs to make precise comparisons between different values and not just relative comparisons 3. Values being displayed have different units or very different magnitudes

Scatter charts

A graphical representation of the relationship between two quantitative variables Trendline is a line that provides an approximation of the relationship between the variables

Why is data mining with text data more challenging than data mining with traditional numerical data?

A collection of text documents to be analyzed is called acorpus It requires more preprocessing to convert the text to a format amenable for analysis

bubble chart

A graphical means of visualizing three variables in a two dimensional graph and is therefore sometimes a preferred alternative 3-D graph

cluster column or bar chart

A special type of column (bar) chart in which multiple bars are clustered in the same class to compare multiple variables; also known as a side-by-side-column (bar) chart.

stacked column chart

A special type of column (bar) chart in which multiple variables appear on the same bar. - allows the reader to compare the relative values of quantitative variables for the same category in a bar chart

Association rules

An if-then statement describing the relationship between item sets.

Table design principles

Avoid the use of unnecessary ink in tables Horizontal lines are necessary only for separating column titles from data values or when indicating that a calculation has taken place Columns of numerical values in a table should be right aligned because it makes it easy to see the differences in the magnitude of values If you are showing digits to the right of the decimal point,all values should include the same number of digits to the right of the decimal It is generally best to left align text values within a column in a table Aligning the first letter of each data entry promotes readability - only if the text value sare approx the same length

multivariate predictive analysis

Cluster analysis is one of the many models in this area This will be brief intro to cluster analysis

Market segmentation

Commonly used in marketing to divide customers into different homogenous group

hierarchical clustering

Determines the similarity of two clusters by considering the similarity Between the observations composing either cluster. Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster. - suitable when we have small data and want to easily examine solutions with increasing numbers of clusters - convenient method if you want to observe how clusters are nested - very sensitive outliers

addition law for mutually exclusive events

Events A and B are mutually exclusive if the occurence of one event precludes the occurence of th eother Thus, a requirment for A and B to be mutually exclusive is that their intersection must contain no sample points

This measure of uncertainty is often communicated through a probability distribution:

Extremely helpful in providing additional information about an event. Can be used to help a decision maker evaluate possible actions and determine best course of action

K means clustering

Given a value of k (specified by the analyst), the k-means algorithm randomly assigns each observation to one of the k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated. Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid until there is no change in the clusters or a specified maximum number of iterations is reached. One can measure the strength of a cluster by comparing the average distance in a cluster to the distance between cluster centroids One rule of thumb is that the ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful cluster K represents the number of clusters - not suitable for binary data - partitions the observations, appropriate if trying to summarize the data with K "average" observations that describe the data with minimum amount of error - suitable when you know how many clusters you want and you have a large data set

complement of an Event

Given an event A, the complement of A is defined to be the event consisting of all outcomes that are not in A P(A) = 1 - P(A^c)

Prior Probability

Initial estimate of the probabilities of events.

Data-ink

Ink used in a table or chart that is necessary to convey the meaning of the data to the audience.

Non-data-ink:

Ink used in a table or chart that serves no useful purpose in conveying the data to the audience

multiplication law

It can be used to calculate the probability of the inersection of two events

What is not true of euclidean distance?

It is not affected by the scale on which variables are measured, it used to measure dissimilarity between categorical variable observations, it increases with the increase in similarity between variable values

Jaccards coefficiente

Number of variables with matching nonzero value for obsevraitons u and v/ (total number of variables) - (number of variables with martching zero values for observations u and v) Does not count matching zero entries Measure of similarity between observations consisting solely of binary categroical variables that considers only matches of nonzero entries

matching coefficient

Number of variables with matching vlaue for observaitons u and v / total number of variables a weakness - if two observations both have a 0 entry for a categroical variable, that is counted as a sign of similarity between two observations even though thats sometimes not th ecase When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations

Data preprocessing

Parses the original text data down to the set of tokens deemed relevant for the topic being studied.

cross tabulation

Provides a tabular summary of data for two variable

Support Count

The number of times that a collection of items occur together in a transaction data set.

independent events

The outcome of one event does not affect the outcome of the second event If the probability of event D is not changed by the existence of event M, then we would say that events D and M are independent events. Otherwise, the events are dependent

joint probabilities

The probability of two events both occurring; in other words, the probability of the intersection of two events.

marginal probabilities

The probability of two events both occurring; in other words, the probability of the intersection of two events. It is found by summing the joint probabilites in the corresponding row or column of the joint probability table

Frequency term document matrix

Used when the frequency of word occurrence is important to the context of the business problem Rows represent documents, columns represent tokens Entries in the matrix are the frequency of occurrence of each token in each document

3-d effect on charts

adds unnecessary detail that does not help explain the data, consider using multiple line son a line chart, employing multiple charts or creating bubble charts in which the size of the bubble can represent the z axis value

Pie charts

are another common form of chart used to compare categorical data Make visual comparsions is much easier in the bar chart than in the pie chart

Line charts

are similar to scatter charts but a line connects the points in the cart. Line charts are very useful for time series data collected over a period of time

K means clustering

assigns each observations assigned to the same cluster are as similar as possible

Lift ratio

confidence/support of consequent/totalnumber of transactions. The ratio of the performance of a data mining model measured against the performance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction.

single linkage

considers only the two most similar observations between the two clusters. (nearest neighbor). Measure of calculating dissimilarity between two clusters by considering only the two most similar obsevrations between the two clusters Can result in long,elongated clusters rather than compact circluar clusters

group average linkage

distance between each pair of observations in each cluster are added up and divided by the number of pairs to get an average Creates clusters rthat are less dominated by the similarity between singl epairs of observations If clustrer 1 consists of n1 observations and cluster 2 consists of n2 obervsations, the similarity of these clusters would be the average of n1 X n2 similairty measures

mutually exclusive events

events that have no outcomes in common

Sparkline

is a minimalist type of line chart that can be placed directly into a cell into Excel - Contain no axes, they display only the line of the data, can be effectively used to provide info on overall trends for time series data - Take sup very little space

a random experiment

is a process that generates well defined outcomes

Euclidean distance

is the most common method to measure dissimilairty between observations - Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values - is highly influenced by scale on which variables are measured

Probability

is the numerical measure of the likelihood that an event will occur.

Sentiment analysis

is the process of clustering/categrozing comments or ereviews a s positive, engative or neutral

tokenization

is the process of dividing text into separate terms, referred to as tokens

The goal of cluster analysis

is to group observations into clusters such that observations within a cluster are similar and observations in different lcusters are dissimilar

goal of clustering

is to segment observations into similar groups based on observed variables.

text data

is unstructured data because in its raw form, it cannot be stored in a traditional structured database. The process of extracting useful info from text data Example - video and audio data

median linkage

median distance between each observation in one cluster and each observation in the other cluster. Reduces effect of outliers.

wards method

merges two clusters such that the dissimilarity of the observations within the resulting single cluster increases as little as possible. It tends to produce clearly defined clusters of similar size. Computes the centroid of the resulting merged cluster and then calculates the sum of squared dissimilairty between this centorid and each observation in the union of the two clusters of similar size

Complete Linkage

onsiders only the two most dissimilar observations between the two clusters. (farther neighbor) It would be considered two clusters to be close if their most different pair of obserbartions are close Everything is clost to each other It can be distorted by outliers

Bar charts

provide a graphical summary of categorical data - Use horizontal bars to display the magnitude of the quantitative variable

Addition law

provides a way to compute the probability that event A or event B occurs or both events occur Used to find union of two events Intersted in knoiwing the probabilty thate vent A or event B occurs or both events occur The intersection of A And B is the event containing outcomes that belong to both A And B

stemming

s the process of converting a word to its stem or root word. Would drop the "ing" and "ed" and place only the stack in the list of words to be tracked

bottom up hierarchical clustering

starts with each observation beloning to its own cluster and then sequentially merges the mosrt similar clusters to create a series of nested clusters

confidence

support of antecedent and consequent/support of antecedent. The conditional probability that the conseqeuent of an assoication rule occurs given the antecedent occurs

random variable

takes on different numeric values based on chance

sample space

the set of all outcomes

Column charts

use vertical bars to display the magnitude of the quantitative variable Good at making comparisons between categorical variables

Centroid linkage

uses the distance between cluster centroids (means). Uses the averaging concept of cluster for cluster k1 denoted ck is found by calculating the average value for each variable across all observations of a cluster Centroid is the average observation of a cluster

conditional probability

when the probability of one event is dependent on whether some related event has already occurred It can be computed as the ratio of joint probability to a marginal probability We are considering the probability of event A givent he condition that event B has occurred The probability of A given B


Related study sets

Astronomy 1 Midterm 3 Study Ch. 16-18

View Set

StudyMate Questions: Nervous Tissue & CNS

View Set

7Sage- LR Strategies by question type

View Set

History of Writing Quiz 1 (1, 2, 3, 7, 8, 10, 11)

View Set

ACCT 4420 Comprehensive Final Exam: Practice Questions

View Set

27 Amendments to the Constitution

View Set

PY 201 Test #2 Chapters 4, 5, & 6

View Set

Chapter 5 - E-Business and E-Commerce

View Set