Business Analytics Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Percent Change

(End Value - Beg. Value)/Beg. Value

Data visualization involves:

- Creating a summary table for the data - Generating charts to help interpret, analyze, and learn from the data

Effective Design Techniques:

- Data-ink ratio: Measures the proportion of what Tufte terms "data-ink" to the total amount of ink used in a table or chart. - Helpful for creating effective tables and charts for data visualization: -- Data-ink: Ink used in a table or chart that is necessary to convey the meaning of the data to the audience. -- Non-data-ink: Ink used in a table or chart that serves no useful purpose in conveying the data to the audience.

Uses of data visualization:

- Helpful for identifying data errors and - reduces the size of your data set by highlighting important relationships and trends in the data

Data Cleansing

--Missing Data --Blakely Tires --Identification of Erroneous Outliers and other Erroneous Values --Variable Representation

Missing Data:

-Data sets commonly include observations with missing values for one or more variables. -In some cases missing data naturally occur; these are called legitimately missing data. --Generally, no remedial action is taken for legitimately missing data. -In other cases missing data occur for different reasons; these are called illegitimately missing data. --The primary options for addressing such missing data are: 1. To discard observations (rows) with any missing values. 2. To discard any variable (column) with missing values. 3. To fill in missing entries with estimated values (imputation) 4. To apply a data-mining algorithm that can handle missing values.

Identification of Erroneous Outliers and other Erroneous Values:

-Examining the variables in the data set by use of summary statistics, frequency distributions, bar charts and histograms, z-scores, scatter plots, correlation coefficients, and other tools can uncover data-quality issues and outliers. ---Many software ignore missing values when calculating various summary statistics. If missing values in a data set are indicated with a unique value (such as 9999999), these values may be used by software when calculating various summary statistics. --Both cases can result in misleading values for summary statistics. --Many analysts prefer to deal with missing data issues prior to using summary statistics to attempt to identify erroneous outliers and other erroneous values in the data.

Boxplots

A boxplot is a graphical summary of the distribution of data. Developed from the quartiles for a data set.

Histograms

A common graphical presentation of quantitative data. --Constructed by placing the variable of interest on the horizontal axis and the selected frequency measure (absolute frequency, relative frequency, or percent frequency) on the vertical axis. --The frequency measure of each class is shown by drawing a rectangle whose base is the class limits on the horizontal axis and whose height is the corresponding frequency measure. --Histograms provide information about the shape, or form, of a distribution. --Skewness: Lack of symmetry. --Skewness is an important characteristic of the shape of a distribution.

Pivot Tables

A crosstabulation in Microsoft Excel.

Percentiles

A percentile is the value of a variable at which a specified (approximate) percentage of observations are below that value. The pth percentile tells us the point in the data where: percent of the observations have values > than the pth percentile.

Scatter Charts:

A scatter chart is a useful graph for analyzing the relationship between two variables. The scatter chart in Figure 2.26 is an example of a positive relationship, because when one variable (high temperature) increases, the other variable (sales of bottled water) generally also increases. The scatter chart also suggests that a straight line could be used as an approximation for the relationship between high temperature and sales of bottled water.

Crosstabulation

A useful type of table for describing data of two variables. --COUNTIFS() enables cross tabulation (See restaurant data set)

Advanced Data Visualization

Advanced Charts Geographic Information Systems Charts

Big Data

Any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software. IBM describes the phenomenon of big data through the four Vs: Volume Velocity Variety Veracity

Table Design Principles

Avoid using vertical lines in a table unless they are necessary for clarity. Horizontal lines are generally necessary only for separating column titles from data values or when indicating that a calculation has taken place.

bar/column chart

Bar Charts: Use horizontal bars to display the magnitude of the quantitative variable. Column Charts: Use vertical bars to display the magnitude of the quantitative variable. Bar and column charts are very helpful in making comparisons between categorical variables.

Quantitative and Categorical Data:

Categorical (qualitative) data are pieces of information that allow us to classify the objects under investigation into various categories. Quantitative data are responses that are numerical in nature and with which we can perform meaningful arithmetic calculations.

types of charts

Charts (or graphs): Visual methods of displaying data. Scatter chart: Graphical presentation of the relationship between two quantitative variables. Trendline: A line that provides an approximation of the relationship between the variables. Line chart: A line connects the points in the chart. Useful for time series data collected over a period of time (minutes, hours, days, years, etc.).

Covariance:

Covariance is a descriptive measure of the linear association between two variables

Tools of business analytics can aid decision making by

Creating insights from data. Improving our ability to more accurately forecast for planning. Helping us quantify risk. Yielding better alternatives through analysis and optimization.

Cross-Sectional and Time Series Data

Cross-sectional data: --Data collected from several entities at the same, or approximately the same, point in time. Time series data: --Data collected over several time periods. -----Graphs of time series data are frequently found in business and economic publications. -----Graphs help analysts understand what happened in the past, identify trends over time, and project future levels for the time series.

Cumulative Distributions

Cumulative frequency distribution: A variation of the frequency distribution that provides another tabular summary of quantitative data. --Uses the number of classes, class widths, and class limits developed for the frequency distribution. --Shows the number of data items with values less than or equal to the upper class limit of each class.

Overview of Using Data: Definitions and Goals

Data: The facts and figures collected, analyzed, and summarized for presentation and interpretation. Variable: A characteristic or a quantity of interest that can take on different values. Observation: A set of values corresponding to a set of variables. Variation: The difference in a variable measured over observations. Random variable/uncertain variable: A quantity whose values are not known with certainty.

Blakely Tires:

Ex. A U.S. producer of automobile tires wants to learn about the conditions of its tires on automobiles in Texas. New tires have a tread depth of 10/32 inch. Tires last for 4-6 years or 40K-60K miles. The data obtained includes the position of the tire on the automobile, age of the tire, mileage on the tire, and depth of the remaining tread on the tire.

Sources of Data:

Experimental study: A variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated so that data can be obtained about how they influence the variable of interest. Nonexperimental study or observational study: Makes no attempt to control the variables of interest. A survey is perhaps the most common type of observational study.

Frequency Distributions for Categorical Data:

Frequency distribution: A summary of data that shows the number (frequency) of observations in each of several nonoverlapping classes, typically referred to as bins.

Geographic Information Systems Charts:

Geographic information system (GIS): A system that merges maps and statistics to present data collected over different geographic areas. Helps in interpreting data and observing patterns

Variable Representation:

In many data-mining applications, it may be prohibitive to analyze the data because of the number of variables recorded. Dimension reduction is the process of removing variables from the analysis without losing crucial information. A critical part of data mining is determining how to represent the measurements of the variables and which variables to consider. Often data sets contain variables that, considered separately, are not particularly insightful but that, when appropriately combined, result in a new variable that reveals an important relationship.

Location of pth percentile:

Lp = (p/100)(n+1)

Conditional Formatting of Data in Excel:

Makes it easy to identify data that satisfy certain conditions in a data set. Quick Analysis button appears just outside the bottom-right corner of a group of selected cells. It provides shortcuts for Conditional Formatting, adding Data Bars, and other operations.

Missing data (cont.)

Missing completely at random (MCAR): The tendency for an observation to be missing the value for some variable is entirely random; whether data are missing does not depend on either the value of the missing data or the value of any other variable in the data. Missing at random (MAR): The tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data. Missing not at random (MNAR): The tendency for the value of a variable to be missing is related to the value that is missing.

Identifying Outliers:

Outliers: Extreme values in a data set. They can be identified using standardized values (z-scores). Any data value with a z-score less than -3 or greater than +3 is an outlier. Such data values can then be reviewed to determine their accuracy and whether they belong in the data set.

Advanced Charts:

Parallel-coordinates plot: Chart for examining data with more than two variables: -- Includes a different vertical axis for each variable. -- Each observation is represented by drawing a line on the parallel-coordinates plot connecting each vertical axis. -- The height of the line on each vertical axis represents the value taken by that observation for the variable corresponding to the vertical axis. Treemap: Useful for visualizing hierarchical data along multiple dimensions.

pie/bubble/heat map charts

Pie chart: Common form of chart used to compare categorical data. Bubble chart: Graphical means of visualizing three variables in a two-dimensional graph that sometimes is a preferred alternative to a 3-D graph. Heat map: A two-dimensional graphical representation of data that uses different shades of color to indicate magnitude.

Population and Sample Data:

Population- all elements of interest Sample- a subset of the population (random sampling- a sampling method to gather a representative sample of the population data)

Quartiles:

Quartiles: When the data is divided into four equal parts: --Each part contains approximately 25% of the observations. The difference between the third and first quartiles is often referred to as the interquartile range, or IQR ➔ middle 50%

Relative Frequency and Percent Frequency Distributions:

Relative frequency distribution: A tabular summary of data showing the relative frequency for each bin. Percent frequency distribution: Summarizes the percent frequency of the data for each bin. --Used to provide estimates of the relative likelihoods of different values of a random variable.

Measures of Association Between Two Variables

Scatter Charts Covariance Correlation Coefficient Nonlinear Relationships

Charts

Scatter Charts Recommended Charts in Excel Line Charts Bar Charts and Column Charts A Note on Pie Charts and Three-Dimensional Charts Bubble Charts Heat Maps Additional Charts for Multiple Variables PivotCharts in Excel

Sparkline chart

Special type of line chart: - Minimalist type of line chart that can be placed directly into a cell in Excel. - Contains no axes; they display only the line for the data. - Takes up very little space and can be effectively used to provide information on overall trends for time series data.

Additional Charts for Multiple Variables:

Stacked-column chart: Allows the reader to compare the relative values of quantitative variables for the same category in a bar chart. Clustered-column (or bar) chart: An alternative chart to stacked-column chart for comparing quantitative variables. Scatter-chart matrix: Useful chart for displaying multiple variables

Tables

Table Design Principles Crosstabulation PivotTables in Excel Recommended PivotTables in Excel Tables should be used when: 1. The reader needs to refer to specific numerical values. 2. The reader needs to make precise comparisons between different values and not just relative comparisons. 3. The values being displayed have different units or very different magnitudes.

Correlation Coefficient:

The correlation coefficient measures the relationship between two variables. Not affected by the units of measurement for x and y.

Business analytics

The scientific process of transforming data into insight for making better decisions Used for data-driven or fact-based decision making, which is often seen as more objective than other alternatives for decision making.

z-Scores:

The z-score measures the relative location of a value in the data set. Helps to determine how far a particular value is from the mean relative to the data set's standard deviation. Often called the standardized value.

Interpretation of Correlation Coefficient:

There is a (weak/moderate/strong), (positive/negative), linear relationship between x and y. <0 : Negative Linear Near 0: No linear Relationship >0: Positive Linear

Frequency Distributions for Quantitative Data:

Three steps necessary to define the classes for a frequency distribution with quantitative data: 1. Determine the number of nonoverlapping bins. 2. Determine the width of each bin. 3. Determine the bin limits.

PivotCharts in Excel:

To summarize and analyze data with both a crosstabulation and charting, Excel pairs PivotCharts with PivotTables.

Empirical Rule:

When the distribution of data exhibits a symmetric bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that are within a specified number of standard deviations of the mean. For data having a bell-shaped distribution: --Approximately 68% of the data values will be within 1 standard deviation. --Approximately 95% of the data values will be within 2 standard deviations. --Almost all the data values will be within 3 standard deviations.


Ensembles d'études connexes

AP Bio Notes Google Doc because I'm a Weirdo and I Read the Textbook for Fun

View Set

Pharm exam 2 guide conti.. 91-95

View Set

Common Parasites of Livestock: External Parasites

View Set

NCLEX cardiovascular, hematologic, and lymphatic

View Set

The End of the world as we knew it.

View Set

APUSH Period 5, APUSH Period 4, APUSH Periods 1 & 2, APUSH Period 3

View Set

Astronomy Chapter 13 & 14 The Death of Stars Notecards

View Set