BA CH3&4

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

corpus.

A collection of text documents to be analyzed is called a

PivotTable

A crosstabulation in Microsoft Excel.

Unsupervised learning:

A descriptive data-mining technique used to identify relationships between observations. Thought of as high-dimensional descriptive analytics. There is no outcome variable to predict; instead, qualitative assessments are used to assess and compare the results.

Line chart:

A line connects the points in the chart.Useful for time series data collected over a period of time (minutes, hours, days, years, etc.).

Trendline:

A line that provides an approximation of the relationship between the variables.

Geographic information system (GIS)

A system that merges maps and statistics to present data collected over different geographic areas. Helps in interpreting data and observing patterns.

Heat map:

A two-dimensional graphical representation of data that uses different shades of color to indicate magnitude.

Crosstabulation

A useful type of table for describing data of two variables.

Stacked-column chart:

Allows the reader to compare the relative values of quantitative variables for the same category in a bar chart.

market basket analysis

Although association rules are an important tool in they are also applicable to other disciplines.

Clustered-column (or bar) chart:

An alternative chart to stacked-column chart for comparing quantitative variables.

Median linkage:

Analogous to group average linkage except that it uses the median of the similarities computer between all pairs of observations between the two clusters.

Key performance indicators (KPIs) in dashboards:

Automobile dashboard: Current speed, Fuel level, and oil pressure. Business dashboard: Financial position, inventory on hand, customer service metrics.

Table Design Principles

Avoid using vertical lines in a table unless they are necessary for clarity. Horizontal lines are generally necessary only for separating column titles from data values or when indicating that a calculation has taken place.

Parallel-coordinates plot

Chart for examining data with more than two variables: Includes a different vertical axis for each variable. Each observation is represented by drawing a line on the parallel-coordinates plot connecting each vertical axis. The height of the line on each vertical axis represents the value taken by that observation for the variable corresponding to the vertical axis.

Pie chart:

Common form of chart used to compare categorical data.

market segmentation

Commonly used in marketing to divide customers into different homogenous groups; known as

Data visualization involves

Creating a summary table for the data. Generating charts to help interpret, analyze, and learn from the data.

Data dashboard:

Data-visualization tool that illustrates multiple metrics and automatically updates these metrics as new data become available.

Group Average linkage:

Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters.

Hierarchical Clustering

Determines the similarity of two clusters by considering the similarity between the observations composing either cluster. Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster. Given a way to measure similarity between observations, there are several clustering method alternatives for comparing observations in two clusters to obtain a cluster similarity measure: Single linkage. Complete linkage. Group average linkage. Median linkage. Centroid linkage.

k-Means Clustering:

Given a value of k, the k-means algorithm randomly assigns each observation to one of the k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated. Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid.

Bubble chart:

Graphical means of visualizing three variables in a two-dimensional graph that sometimes is a preferred alternative to a 3-D graph.

Scatter chart:

Graphical presentation of the relationship between two quantitative variables.

Uses of data visualization

Helpful for identifying data errors. Reduces the size of your data set by highlighting important relationships and trends in the data.

Confidence

Helps identify reliable association rules:

Association rules:

If-then statements which convey the likelihood of certain items being purchased together.

Lift ratio

Measure to evaluate the efficiency of a rule:

Data-ink ratio

Measures the proportion of what Tufte terms "data-ink" to the total amount of ink used in a table or chart.

Support count of an item set

Number of transactions in the data that include that item set.

Observation:

Set of recorded values of variables associated with a single entity.

Principles of Effective Data Dashboards (continued):

Should provide timely summary information on KPIs that are important to the user. Should present all KPIs as a single screen that a user can quickly scan to understand the business's current state of operations. The KPIs displayed in the data dashboard should convey meaning to its user and be related to the decisions the user makes. A data dashboard should call attention to unusual measures that may require attention. Color should be used to call attention to specific values to differentiate categorical variables, but the use of color should be restrained.

Sparkline

Special type of line chart: Minimalist type of line chart that can be placed directly into a cell in Excel. Contains no axes; they display only the line for the data. Takes up very little space and can be effectively used to provide information on overall trends for time series data.

unstructured data

Text data is often referred to as because in its raw form, it cannot be stored in a traditional structured database Audio and video data are also examples of unstructured data. Data mining with text data is more challenging than data mining with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis.

Antecedent:

The collection of items (or item set) corresponding to the if portion of the rule.

Consequent

The item set corresponding to the then portion of the rule.

Tables should be used when

The reader needs to refer to specific numerical values. The reader needs to make precise comparisons between different values and not just relative comparisons. The values being displayed have different units or very different magnitudes.

Single linkage:

The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar.

matching coefficient

The simplest overlap measure is called

Complete linkage:

This clustering method defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different.

Jaccard's coefficient

To avoid misstating similarity due to the absence of a feature, a similarity measure called does not count matching zero entries and is computer as:

PivotChart:

To summarize and analyze data with both a crosstabulation and charting, Excel pairs PivotCharts with PivotTables.

Bar Charts

Use horizontal bars to display the magnitude of the quantitative variable.

Column Charts

Use vertical bars to display the magnitude of the quantitative variable.

Scatter-chart matrix

Useful chart for displaying multiple variables.

Treemap:

Useful for visualizing hierarchical data along multiple dimensions.

Charts (or graphs):

Visual methods of displaying data.

k-means clustering

assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are as similar as possible.

When McQuitty's method

considers merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as: ((dissimilarity between A and C) + (dissimilarity between B and C)) divided by 2).

A dendrogram

is a chart that depicts the set of nested clusters resulting at each step of aggregation.

A frequency term-document matrix

is a matrix whose rows represent documents and columns represent tokens, and the entries in the matrix are the frequency of occurrence of each token in each document.

A presence/absence or binary term-document matrix

is a matrix with the rows representing documents and the columns representing words. Entries in the columns indicate either the presence or the absence of a particular word in a particular document.

Euclidean distance

is the most common method to measure dissimilarity between observations.

Stemming

is the process of converting a word to its stem or root word.

Tokenization

is the process of dividing text into separate terms, referred to as tokens: Symbols and punctuations must be removed from the document, and all letters should be converted to lowercase. Different forms of the same word, such as "stacking", "stacked," and "stack" probably should not be considered as distinct terms.

Text mining

is the process of extracting useful information from text data.

Evaluating Association Rules

is ultimately judged on how actionable it is and how well it explains the relationship between item sets. For example, Walmart mined its transactional data to uncover strong evidence of the association rule, "If a customer purchases a Barbie doll, then a customer also purchases a candy bar." An association rule is useful if it is well supported and explains an important previously unknown relationship.

Ward's method

merges two clusters such that the dissimilarity of the observations with the resulting single cluster increases as little as possible.

Bottom-up hierarchical clustering

starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters.

Data-ink

used in a table or chart that is necessary to convey the meaning of the data to the audience.

Non-data-ink

used in a table or chart that serves no useful purpose in conveying the data to the audience.

Centroid linkage

uses the averaging concept of cluster centroids to define between-cluster similarity.


संबंधित स्टडी सेट्स

Oncol355- Lecture 2 (Characteristics of a Radiation Beam and Field)

View Set

HA - Chapter 1: Analyzing Data to Make Accurate Clinical Judgments

View Set

AP® United States Government & Politics - Released Exam MCQ

View Set

Mastering A&P Chapter 7 - The Skeleton

View Set

BUL3310 Unit 13 Chapter 28 & 29 Quiz

View Set

ACCT 121: Exam 3 Practice CH. 7-10

View Set