Data Visualization Exam 2
common mistakes in data dashboard design
- Not considering the reason the organization wants to develop a data dashboard, i.e., the organization's needs and objectives. This includes - neglecting information that is important with respect to the objectives of the organization - focusing on information that is not meaningful with respect to the objectives of the organization - Neglecting to obtain sufficient input from the actual users throughout the data dashboard design process - Not considering the environment(s) in which the data dashboard will be used - Failing to position complimentary components (charts, tables, etc.) on the data dashboard in a manner that facilitates their simultaneous use - Using an inappropriate or ineffective type of chart for the data and its message - Neglecting the principles of good chart and table design when creating the individual components of the data dashboard - Creating a data dashboard that is too cluttered - Designing an unattractive visual display - Not considering the organization's and users' potential future needs
50% percentile
= median
strip chart
A chart consisting of sorted variable values along either the horizontal or vertical axis One shortcoming of both histograms and frequency polygons is the specific values of the smallest and largest values are difficult to discern from the visualization due to the binning of values. If we want to display a small set of values in a manner that shows the individual values Occlusion in strip charts can be mitigated by (1) plotting hollow dots rather than filled dots and (2) jittering the observation. Jittering an observation involves slightly adjusting the value of one or more of the variables comprising the observation
PivotChart
A chart that allows the user to interact with the data by applying filters to select various aspects of the data to be displayed in the chart.
slope chart
A chart that shows the change of a single variable over time for multiple entities by connecting pairs of data points for each entity
interactive dashboard
A dashboard that allows users to customize the data dashboard display, effectively allowing a user to filter the data displayed on the dashboard.
dynamic dashboard
A dashboard that automatically receives and incorporates new data into the dashboard as the new data become available.
noninteractive dashboard
A dashboard that does not allow users to customize the data dashboard display.
static dashboard
A dashboard that may periodically be updated manually as new data and information are collected.
Analytical dashboard
A dashboard typically used by analysts to identify and investigate trends, predict outcomes, and discover insights in large volumes of data. Because analytical dashboards usually support exploration of long-term issues, these dashboards are generally updated less frequently than operational, strategic, or tactical dashboards.
Analytical dashboards
A dashboard typically used by analysts to identify and investigate trends, predict outcomes, and discover insights in large volumes of data. Because analytical dashboards usually support exploration of long-term issues, these dashboards are generally updated less frequently than operational, strategic, or tactical dashboards.
Strategic dashboards
A dashboard typically used by executives to monitor the status of KPIs relevant to overarching organizational objectives.
Operational dashboard
A dashboard typically used by lower level managers to monitor rapidly changing critical business conditions.
Operational dashboards
A dashboard typically used by lower level managers to monitor rapidly changing critical business conditions.
Tactical dashboard
A dashboard typically used by mid-level managers to identify and assess the organization's strengths and weaknesses in support of the development of organizational strategies.
Tactical dashboards
A dashboard typically used by mid-level managers to identify and assess the organization's strengths and weaknesses in support of the development of organizational strategies.
considering the needs of the data dashboards
A data dashboard should be designed to assist a particular user or group of users with specific tasks associated with the management of an organization. Thus, it is important that the dashboard design team both understands the needs of the dashboard's end users and recognizes how addressing these needs will ultimately support the organization's dashboard objectives. This knowledge, in conjunction with adherence to the principles of effective data visualization, will help the dashboard developer determine the information that the dashboard should convey and the most effective manner for presenting this information to its intended audience.
Data Dashboard Users' Need
A data dashboard should be designed to assist a particular user or group ofusers with specific tasks associated with the management of an organization.• The dashboard design team must understand :• the needs of the dashboard's end users • how addressing these needs supports the organization's objectives• This knowledge will help the dashboard developer determine: • what information the dashboard should include • the most effective manner for presenting it
data dashboard
A data visualization tool that gives multiple outputs and may update in real time.
probability distribution
A description of the range and relative likelihood of possible values of a random variable. a percent frequency distribution can be used to provide estimates of the relative likelihoods of different values for a random variable. So, by constructing a percent frequency distribution from observations of a random variable, we can estimate the probability distribution that characterizes its variability
Time interval widget
A feature that allows the user to specify the time period to be displayed on a data dashboard.
Customization tools
A feature that allows the user to tailor the dashboard to specific needs.
customization tool
A feature that allows the user to tailor the dashboard to specific needs.
Drilling down
A feature that provides the user with more specific and detailed information on a particular element, variable, or KPI.
Hierarchical filtering
A feature that provides the user with the capability to restrict the data displayed to a specific segment by systematically selecting values of several categories or values of variables in a nested manner.
relative frequency
A frequency measure in a distribution analysis that computes the fraction or proportion of observations in each of several nonoverlapping bins (classes). Relative Frequency of a bin = Frequency of bin/n
percent frequency
A frequency measure in a distribution analysis that computes the percentage of observations in each of several nonoverlapping bins (classes) A percent frequency distribution can be used to provide estimates of the relative likelihoods of different values for a random variable.
mean
A measure of central location computed by summing the data values and dividing by the number of observations.
mode
A measure of central location defined as the value (or range of values) that occurs with the greatest frequency.
median
A measure of central location provided by the value in the middle when the data are arranged in ascending order. The median is the 50th percentile.
Skewness
A measure of the lack of symmetry in a distribution.
big associated number (BAN)
A number associated with a visualization that is displayed in very large font for emphasis or to guide the audience's attention.
random variable
A quantity whose values are not known with certainty.
dot matrix chart
A simple chart that uses dots, or another simple graphic, to represent an item or groups of an item. It is useful for providing additional context to the audience for large numerical values.
storyboard
A simple visual organization of the main points of the story used to provide structure of the narrative to be developed for the audience.
Sparkline
A special type of line chart that indicates the trend of data but not magnitude. A sparkline does not include axes or labels.
Simple linear regression
A statistical procedure predicting the value of one dependent variable with the value of one independent variable through a linear equation.
sample
A subset of the population
PivotTable
A table that allows the user to interact with the data by applying filters to select various aspects of the data to be displayed in the table.
Crosstabulation
A tabular summary of data for two variables. The classes of one variable are represented by the rows; the classes for the other variable are represented by the columns.
Lurking variable
A third variable associated with two variables being studied that results in a correlation between the two variables, falsely implying a causal relationship between the pair.
slicer
A tool that allows the spreadsheet user to filter the data to be displayed in PivotTables and PivotCharts.
Key performance indicator
A value a manager uses to operate and maintain their businesses effectively and efficiently. Also known as a KPI.
key performance indicators
A value a manager uses to operate and maintain their businesses effectively and efficiently. Also known as a KPI. The outputs provided by the dashboard are a set of KPIs for the organization that are aligned with the organization's goals and can be used to monitor current and potential future performance on a continual basis. By consolidating and presenting data from a number of sources in a data visualization designed for a specific set of purposes, data dashboards can help an organization better understand and use its data to improve decision making The KPIs displayed in the data dashboard should quickly and clearly convey meaning to its user and be related to decisions the user makes.
Outlier
An unusually small or unusually large data value.
base year
Arbitrary year chosen to be the common year to measure economic values such as costs and prices to adjust for inflation.
Width of the Bins
As a general guideline, we recommend that the width be the same for each bin. Thus, the choices of the number of bins and the width of bins are not independent decisions. A larger number of bins means a smaller bin width and vice versa.
Storytelling with Charts
As we have discussed in this chapter, a goal of storytelling with data is to make it easy for the audience to interpret the insights from the data and compel the audience to act on those insights in some way. This starts with understanding the audience and is enabled by creating empathy with the data. Effective storytelling with data also requires you to use the correct chart for the data and the insights you are trying to convey to the audience.
Survivor bias
Bias that occurs when a sample data set consists of a disproportionately large number of observations corresponding to positive outcomes for a particular event.
Selection bias
Bias that occurs when data are drawn from a sample that has not been properly randomized to represent the intended population.
Data Dashboard Display
Considerations on how the information is displayed on a dashboard shouldinclude: • selection of appropriate types of charts • effective use of preattentive attributes, Gestalt principles, and color • a practical layout that enables end-users to quickly find the information theyneed and relate information from various charts in the dashboard • strategies for avoiding overcrowding and unnecessary complexity such asorganizing information into subsets and using interactive tools • the device used to access the dashboard, ambient lighting, size, and resolutionof the display
Cross-sectional data
Data collected from several entities at the same or approximately the same point in time.
Data DashBoard Purpose
Data dashboard purpose: to support an organization's operations, decision-making, and strategic planning.Data dashboard objectives may include: • tracking KPIs and monitoring processes• assessing attainment of goals and objectives •developing/enhancing insight • sharing information • measuring performance • forecasting and data exploration
Categorical variable
Data for which categories of like items are identified by labels or names. Arithmetic operations cannot be performed on categorical variables.
Stacked data
Data organized such that the values for categorical variables are in a single column
Time series data
Data that are collected at intervals over time.
Time series data
Data that are collected over a period of time (minutes, hours, days, months, years, etc.).
Geospatial data
Data that include information on the geographic location of each record.
quantitative variable
Definition:Data for which numerical values are used to indicate magnitude, such as how many or how much. Arithmetic operations such as addition, subtraction, and multiplication can be performed on a quantitative variable.
Variability
Differences in values of a variable over observations.
variation
Differences in values of a variable over observations. Practically every challenge an organization or individual faces is concerned with the impact of the possible values of relevant variables will have on an outcome of interest. Thus, we are concerned with how the value of a variable can vary;
Considering the organizations needs
Failure to consider the organization's motivations for creating a dashboard will leave the dashboard design team directionless, which can slow the development of the dashboard and potentially result in development of a dashboard that does not address the needs of the organization.
Determining how to deal with illegitimately missing data 2
MNAR - If a variable has observations for which the missing values are MNAR, the observation with missing values cannot be ignored because any analysis that includes the variable with MNAR values will be biased. Furthermore, there is no satisfactory manner to address a variable with missing data that are MNAR because it is the (unknown) values of the missing data that are causing them to be missing. If the variable with MNAR values is thought to be redundant with another variable in the data for which there are few or no missing values, removing the MNAR variable from consideration may be an option. In particular, if the
price index
Measure of the relative change in the price of a standard set of products and services over time. There are many popular price indexes that are tracked by economic organizations and used for adjusting for inflation including the consumer price index (CPI) and the producer price index (PPI) price index that is most closely associated with the type of products being analyzed.
Missing at random (MAR)
Missing data for which the tendency for an observation to be missing a value for a variable is related to the value of some other variable(s) in the observation. However, the occurrence of some missing values may not be completely at random. If the tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data, For data that are MAR, the reason for the missing values may determine its importance. For example, if the responses to one survey question collected by a specific employee were lost due to a data entry error, then the treatment of the missing data may be less critical. However, in a health care study, suppose observations corresponding to patient visits are missing the results of diagnostic tests whenever the doctor deems the patient too sick to undergo the procedure. In this case, the absence of a variable measurement actually provides additional information about the patient's condition, which may behelpful in understanding other relationships in the data.
Missing not at random (MNAR)
Missing data for which the tendency for an observation to be missing a value of a variable is related to the missing value. Data are MNAR if there is a tendency for a missing entry of a variable to be related to its value. For example, survey respondents with extremely high or extremely low annual incomes may be less inclined than respondents with moderate annual incomes to respond to the question on annual income, and so these missing data for annual income are MNAR.
Missing completely at random (MCAR)
Missing data for which the tendency for an observation to be missing the value for a variable is entirely random and does not depend on either the missing value or the value of any other variable in the observation. For example, if a missing value for a question on a survey is completely unrelated to the value that is missing and is also completely unrelated to the value of any other question on the survey, the missing value is MCAR.
Data Dashboard Engineering
Once the dashboard design team understands the organization's objectives for the dashboard and the dashboard end users' related needs, the team should turn its attention to the information that should displayed in the dashboard. All information displayed must meet the end users' needs, and the design dashboard team should work with the end users to ensure this occurs. ld determine the manner in which the information will be displayed. This includes selection of appropriate types of charts, effective use of preattentive attributes and Gestalt principles, appropriate use of color, and an effective layout that enables end users to easily find the information they need and relate information from various charts in the dashboard. It is also important that the dashboard is easy to read and interpret, and that the display is not too sparse, too crowded, or overly complex. In addition, at this stage the dashboard design team should consider the environment in which the data dashboard will be used. Data dashboards are most often accessed from desktop computers. However, some data dashboards are accessed with other devices such as tablets or smartphones on factory floors or retail showrooms, in automobiles, or outdoors. It is important to consider factors such as the device used to access the dashboard, ambient lighting, size and resolution of the display, likely distance of the user from the display, and whether a touch screen will be utilized when designing a data dashboard. A well-designed data dashboard can quickly lose its value to its users if it is difficult to update and maintain, so the design team should consider the skills and capabilities of the individual or team that will be responsible for its maintenance and updates.
nominal values
Raw values that have not been adjusted for inflation or other important factors.
Biased data
Sample data that are not representative of the population that is under study.
storytelling
Specific to stories generated from data, storytelling refers to the ability to build a narrative from the data that is meaningful for the audience, is memorable for the audience, and is likely to influence the audience.
Data Dashboard Organization
Store the dashboard data source in a separate worksheet. • The dashboard should not draw information directly from the source dataworksheet. • Extract the data needed to create a chart or table from the source dataworksheet and store it in a separate worksheet. • The dashboard's users should not be allowed to make permanent changesto the dashboard.
Predictive analytics
Techniques that use models constructed from past data to predict the future or to ascertain the impact of one variable on another. For example, past data on product sales may be used to construct a mathematical model to predict future sales. This model can factor in the product's growth trajectory and seasonality based on past patterns. it is unrealistic to expect that these predicted point estimates have no error. That is, there is uncertainty in how close a predicted point estimate will be to the corresponding future observation.
Quartile
The 25th, 50th, and 75th percentiles, referred to as the first quartile, second quartile, and third quartile, respectively. The quartiles can be used to divide a set of data
quartiles
The 25th, 50th, and 75th percentiles, referred to as the first quartile, second quartile, and third quartile, respectively. The quartiles can be used to divide a set of data
Logos
The ability to connect with the audience through logic and reasoning.
pathos
The ability to connect with the audience using emotion.
Ethos
The ability to show credibility in a story to the audience.
Empathy
The ability to understand and share in the feelings of others. Being able to empathize with data means remembering that data are not just numbers from a spreadsheet or database, but that these data often represent real people, and that the decisions made based on our analysis may have a substantial impact on real people. Therefore, it is important to consider how we can create data visualizations that can help others generate empathy. By empathizing with the data, we can create data visualizations to effectively impact decisions. Two common challenges for creating empathy with data are that audiences can lose the ability to associate meaning when considering large numerical values and that it can be difficult to consider individual cases when looking at aggregate statistics. Here, we will discuss ways of dealing with each of these challenges. We can try to create more empathy with the data by focusing not just on the aggregate statistics but also including something specific that makes these numbers seem more personable and relatable. One way to do this is to include pictures in the data visualization. This is even more effective if we can include individual characteristics so the audience can relate to this specific individual.
Data Dashboard Context
The context for the information included in the data dashboard may beprovided by: • showing how a KPI varies over time • comparing the value of a KPI to an organizational goal • comparing the value of a KPI: • internally across divisions, departments, or geographies of an organization • externally across customers • externally across competitors or organizations in the same industry
Data Dashboard Testing
The dashboard design team must test its work extensively at each step tominimize a data dashboard's errors, miscommunications, andmisunderstandings.• Failure to do so may result in: • the creation of a faulty final product, leading to poor decisions and missed opportunities • time-consuming and costly revisions • damaging the credibility of the data dashboard design team
Data Dashboard Engineering
The dashboard design team must understand how: • to organize the charts and tables on the dashboard to facilitate users'analyses and limit eye travel • to maintain and assess the data dashboard effectiveness • the organization's objectives may shift in the future. Specifically: • will the data dashboard need to reflect these shifts? • what new KPIs are likely to become important to the organization in the future? • what may be the source or format for the future incorporation of new data intothe dashboard
interquartile range
The difference between the third and first quartiles.
Univariate analysis
The examination of the data for an individual variable. column
Inflation
The general increase in prices over time.
bins
The nonoverlapping groupings of data used to create a frequency distribution. Bins for categorical data are also known as classes.
Data cleansing
The process of ensuring data is accurate and consistent through the identification and correction of errors and missing values.
aspect ratio
The proportion between the chart's width and its height.
Quantitative scales
The range of quantitative values along the horizontal and vertical axes in a chart.
Aspect ratio
The ratio of the width of a chart to the height of a chart.
geographic charts
The use of geographic maps for data visualization introduces additional possibilities for misleading the audience. Choropleth maps that use different shades of a color to represent quantitative variables are a common type of data visualization for exploring and examining data related to different geographic regions. Choropleth maps are commonly used to examine the differences in many different economic and health variables such as employment rates, income levels, cancer rates, political support, and average life spans across geographic regions such as counties, states, regions, and countries In most cases, it is best to use a value relative to the population of the region when creating choropleth maps.
Steps to telling an effective story
To be effective at storytelling, we need to understand our audience. We also need to understand the story, or key insight(s), that we want to convey from the data. Once we know who our audience is and what story we want to tell, we can then start to think about what type of data visualization is most effective for that audience and that story. We can also think about specific design attributes and formatting that we should use in the data visualization to best convey our story to the audience.
Know your message
To best explain your data to an audience, you also need to ensure that you know what insight(s) you are trying to convey to the audience. This means that you need to understand the data well enough that you can communicate the insights clearly and succinctly. Not only should you be able to explain what the data mean to the audience, but you should also be able to explain the limitations inherent in the data and how these limitations affect the insights drawn from the data. To best explain the data, we need to understand what types of insights will help the decision maker. The goal of all analytical methods is to influence the audience in a way that facilitates better decisions.
real values
Values that have been adjusted for inflation.
Rhetorical Triangle
Visual illustration proposed by Aristotle to define three general areas in which a story should connect with an audience: ethos, logos, pathos
Freytag's Pyramid
Visual illustration that defines the five common elements of the structure of an effective story: introduction, rising action, climax, falling action, conclusion.
Strategic dashboard
a dashboard typically used by executives to monitor the status of KPIs relevant to overarching organizational objectives.
errors in data
can detect outliers Not all erroneous values in a data set are extreme; these erroneous values are much more difficult to find. However, if the variable with suspected erroneous values has a relatively strong relationship with another variable in the data, we can explore the data set through data visualization tools such as scatter charts to help us identify data errors.
dual-axis chart
makes use of a secondary axis to represent one of the variables so that both variables can be shown on the same chart. However, in most cases, dual-axis charts are difficult for the audience to interpret, and there is often a better way to present the data.
Types of charts best suited for exploratory data analysis
scatter chart, histogram, box plot
Symmetric
shows a symmetric histogram, in which the left tail mirrors the shape of the right tail.
Trend
the long-run pattern in a time series observable over several periods of time
Which way is Irving skewed?
to the right - positively skewed
kernel density chart
A chart for visualizing the distribution that smooths the bin frequency values of a histogram representation by using kernel density estimation. is a "continuous" alternative to histograms designed to overcome the reliance of histograms on the choice of number of bins and bin width Kernel density charts employ a smoothing technique known as kernel density estimation to generate a more robust visualization of the distribution of a set of values. As an example, a kernel density chart for the 30 observations in Death30 file is displayed with this note. Comparing the kernel density chart to the histograms in Figure 5.11, we observe that the kernel density chart smooths the extremes of the histograms in an attempt to generalize the patterns in the data. Excel does not have built-in functionality to construct a kernel density chart (which is not the same as a frequency polygon), but many statistical software packages such as R do.
histogram
A columnar presentation of a frequency distribution, relative frequency distribution, or percent frequency distribution of quantitative data constructed by placing the bin intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis. a histogram is simply a column chart with no spaces between the columns whose heights represent the frequencies of the corresponding bins. Eliminating the space between the columns allows a histogram to reflect the continuous nature of the variable of interest.
Tall data
A data set with many observations (rows). As data grow taller or wider, the possibility of data errors and missing values increases. In addition, wide data becomes increasingly arduous to explore because there are a large number of possible combinations of variables to examine.
Wide data
A data set with many variables (columns).
violin chart
A graphical method that encases the elements of a box and whisker chart inside a rotated and mirrored kernel density chart.
Scatter-chart matrix
A graphical presentation that uses multiple scatter charts arranged as a matrix to illustrate the relationships among multiple variables.
box and whisker chart
A graphical summary of data based on the quartiles of a distribution. Vertical lines, called whiskers, extend from the top and bottom sides of the box. The top whisker is drawn up to the largest value in the data that is less than or equal to third quartile
Moderately skewed right
A histogram is said to be skewed to the right if its tail extends farther to the right than to the left.
Equal-area cartogram
A map-like diagram that uses relative geographic positioning of regions, but uses shapes of equal (or near-equal) size to represent regions.
range
A measure of variability defined to be the largest value minus the smallest value
exception to using a line chart
An exception to using a line chart for time series data is when we want to emphasize or compare individual values of one or more variables over time. Then a good choice may be to use a column chart. Figure 6.22 provides an example of a (stacked) column chart in which total annual sales over time are being compared for different DCs and by customer status (new or existing)
power law
An observation that in some data the relative change in one variable results in a proportional relative change in another variable. if the relative frequency distribution of the first digits of these data deviates substantially from the relative frequency distribution in Figure 5.7, then there may be systemic errors in the data or the data may be fraudulent
Ordinal variable
Data for which categories of like items are identified by labels or names and there is an inherent rank or order of the categories. In Step 5, we ordered the data from largest to smallest to facilitate the comparison of the different soft drink types. This reordering of the data is acceptable because there is no ordinal relationship between soft drinks. However, if an ordinal relationship exists between bin categories, then it would not be appropriate to sort the bins by the number of observations in each bin. Instead, the bins should be ordered according to their ordinal relationship.
Frequency Distribution for Qualitative Variable
For quantitative data, each bin in the frequency distribution is based on the range of values that the bin contains. To create a frequency distribution for quantitative data, three features need to be defined: 1. The number of non overlapping bins 2. The width (numerical range) of each bin 3. The range spanned by the set of bins To interpret this bin, we note that a square bracket indicates that the end value is included in the bin, and a round parenthesis indicates the end value is excluded. So the most common ages at death occur in the range greater than 77 years old and less than or equal to 84 years old. The choice of number of bins and bin width can strongly affect a histogram's display of a distribution.
Illegitimately missing data
Missing data that do not occur naturally. These cases can result for a variety of reasons, such as a respondent electing not to answer a question that the respondent is expected to answer, a respondent dropping out of a study before its completion, or sensors or other electronic data collection equipment failing during a study. Remedial action is considered for illegitimately missing data. After detecting illegitimately missing data, the primary options for addressing them are (1) to discard observations (rows) with any missing values, (2) to fill in missing entries with estimated values, or (3) treat missing data as a separate category if dealing with a categorical variable.
Legitimately missing data
Missing data that occur naturally. For example, respondents to a survey may be asked if they belong to a fraternity or a sorority, and then in the next question are asked how long they have belonged to a fraternity or a sorority. If a respondent does not belong to a fraternity or a sorority, the respondent should skip the ensuing question about how long
Multivariate analysis
The examination of patterns by considering two or more variables at once. multivariate analysis involves pairs of variables (and thus is bivariate) but may involve three or more variables. Whether the variables of interest are categorical or quantitative dictates the statistical summary and visualization technique deployed.
frequency polygon
a chart used to display a distribution by using lines to connect the frequency values of each bin. Like a histogram, a frequency polygon plots frequency counts of observations in a set of bins. However, a frequency polygon uses lines to connect the counts of different bins, in contrast to a histogram, which uses columns to depict the counts in different bins. While frequency polygons provide for a more transparent comparison of two or more distributions, for a single distribution they do not support the magnitude comparison of different bins as well as a histogram. Therefore, histograms are typically preferred for the visualization of a single variable's distribution.
Determining how to deal with illegitimately missing data
Whether the missing values are MCAR, MAR, or MNAR, the first course of action when faced with missing values is to try to determine the actual value that is missing by examining the source of the data or logically determining the likely value that is missing. If the missing values cannot be determined, we must determine how to handle them MCAR - If a variable has observations for which the missing values are MCAR, then discarding the observations with missing values may be a good choice if there are a relatively small number of observations with missing values. When missing data are MCAR, removing observations with missing values is equivalent to randomly culling the rows of the data set. We will certainly lose information if the observations that are missing values for the variable are ignored, but the results of an analysis of the data will not be biased by the missing values. As an alternative to discarding observations with missing values that are MCAR, it may be useful to replace the missing entries for a variable with the variable's median, mean, or mode. MAR - , then there is a relationship between the likelihood of a variable having a missing value and the value of another variable in the observation. If missing data are MAR, then discarding observations with missing values can alter the observed patterns in the remaining data. , it may be possible estimate an observation's missing value of the variable based on the values of the other variables in the observation. MNAR variable is highly correlated with another variable that is known for a majority of observations, the loss of information may be minimal.
time series chart
A chart where a measure of time is represented on the horizontal axis and a variable of interest is shown on the vertical axis. Temporally consecutive data points are generally connected with straight lines. Connecting the consecutive observations with line segments in a time series chart accentuates the temporal nature of the data and the inherent relationship between consecutive time periods.
Know Your Audience
A data visualization or presentation that is effective for one audience may not be effective for another audience due to differences in the audience interests, their roles within the organization, or their level of comfort with analytical methods and tools. Therefore, our first goal in effective storytelling is to ensure we understand our audience. In particular, we want to determine (1) the needs of our audience from our data visualization or presentation; and (2) the level of analytical comfort in our audience. (1) high-level understanding or (2) detailed understanding. Audience needs inform which type of story will be most effective, and hence, suggests certain types of data visualizations may be more effective than others. high level of understanding from the data, it is best to use simple charts that clearly communicate the main insight from the data. For audiences that need a more detailed understanding, data visualizations may be more sophisticated so that the audience can understand more details of the analysis.
Choropleth map
A geographic visualization that uses shades of a color, different colors, or symbols to indicate the values of a variable associated with a region. While choropleth maps may provide a good visual display of changes in a variable between geographic areas, they can also be misleading. If the location data are not granular enough so that the value of the displayed variable is relatively uniform over the respective areas over which it is displayed, then the values of the variable within regions and between regions may be misrepresented. The choropleth map may mask substantial variation of the variable within an area of the same color shading. Further, the choropleth map may suggest abrupt changes in the variable between region boundaries while the actual changes across boundaries may be more gradual. Choropleth maps are the most reliable when the variable displayed is relatively constant within the different locations to be colored. If this is not the case and a choropleth map is desired, the likelihood of the map to convey erroneous insights is mitigated when (1) variable measures are density based (quantity divided by land area or population) or (2) the colored regions are roughly equal-sized so there are no regions that are visually distracting. As Figure 6.46 demonstrates, choropleth maps are typically better for displaying relative comparisons of magnitude than conveying absolute measures of magnitude. Indeed, the strength of a choropleth map is the identification of the high-level characteristics of a variable with respect to geographic positioning Another weakness of Figure 6.46 is that it masks the income distributions within each state.
Cartogram
A map-like diagram that uses geographic positioning but purposefully represents map regions in a manner that does not necessarily correspond to land area. A cartogram often leverages the audience's familiarity with the geography of the displayed regions to convey its message. In Figure 6.49, we observe the tiny size of many western states (Alaska, Nevada, Idaho, Nebraska, North Dakota, and South Dakota) and the white space paired with the audience's tacit knowledge of a standard U.S. map conveys the low population density in these areas. A strength of a cartogram is that the area displayed is proportional to the variable being measured, thus avoiding any misleading impressions. A weakness of a cartogram is that the sizing of the regions according to the displayed variable may distort enough to render the relative geographic positioning meaningless and the standard area-based geography unrecognizable.
Measures of central location
A measure of (central) location identifies a single value of a variable that in some manner best characterizes the entire set of values. In this sense, a measure of location is a measure of a variable's center around which other values are distributed. In this section, we present different measures of location and discuss their relative advantages and disadvantages. Although the mean is a commonly used measure of central location, its calculation is influenced by outlying values-extremely small and extremely large values. Therefore, the median is often the preferred measure of central location as its calculation is resistant to outlying values. generalize, saying that whenever a data set contains extreme values or is severely skewed, the median is the preferred measure of central location; this is particularly true for data sets with relatively few observations. If no value in the data occurs more than once, we say the data have no mode The mode can be a useful measure of central location for variables that have a relatively small set of distinct values. For variables with many possible values (such as the value of home sales in the CincySales file or the race times in the HalfMarathon file), the frequency that defines the mode will either be small or the mode may not exist. For variables with many possible values, it may be best to construct a histogram and apply the notion of the mode to refer to the bin (range of values) with the most observations. That is, the bin in a histogram with the most observations (the tallest column) may then be referred to as the mode. mean, median, mode
standard deviation
A measure of variability that captures how much a set of values deviates from the mean. The standard deviation of a sample of a variable's values can be viewed as the average amount that an observation in the sample deviates from the sample mean Standard deviation is a reliable measure of variability when the values of a variable resemble the histogram in Figure 5.21, in which values are distributed symmetrically around a single mode. For such bell-shaped distributions, we can use the standard deviation to describe the variability of the distribution using intervals. Specifically, ≈ 68% of data values lie in the interval [mean − st. dev., mean + st. dev.,] ≈ 95% of data values lie in the interval [mean − 2 × st. dev., mean + 2 × st. dev.,] > 99% of data values lie in the interval [mean − 3 × st. dev., mean + 3 × st. dev.,] However, because its calculation relies on the mean, the standard deviation can also be heavily influenced by extreme values. For skewed distributions, the standard deviation cannot be reliably used to provide an interpretable measure of the variability of a set of values.
Moving average
A method of smoothing time series data that uses the average of the most recent m values. is computed by averaging the last m values observed. That is, at a point in time, future observations and observations from more than m periods in the past are not included in the calculation of the moving average. As Figure 6.38 shows, as the number of periods on which the moving average is calculated increases, the more stable the moving average smoothing becomes.
Seasonality
A pattern in time series data in which the values demonstrate predictable changes at regular time intervals. Seasonality may be difficult to clearly identify in charts that display all of the data linearly from oldest to most recent. Instead, the presence of seasonality recurring at a time interval is often best examined by plotting the data using multiple lines that correspond to a specific time interval
Correlation
A standardized measure of linear association between two variables that takes on values between −1 and +1. Values near −1 indicate a strong negative linear relationship, values near +1 indicate a strong positive linear relationship, and values near zero indicate the lack of a linear relationship. is a statistical measure of the strength of the linear relationship between variables. Values of correlation range between −1 and +1. Correlation values near 0 indicate no linear relationship exists between the two variables. The closer a correlation value is to +1, the closer the data points on the scatter chart of the two variables resembles a straight line that trends upward to the right (positive slope). The closer a correlation value is to −1, the closer the data points on the scatter chart of the two variables resembles a straight line that trends downward to the right (negative slope). While the sign (positive or negative) of the correlation is depicted by the slope (positive or negative) of the linear trendline, the strength of the correlation between two variables is not related to the steepness of the slope of the linear trendline. The slope reflects the unit change in the variable on the vertical axis given a unit change in the variable on the horizontal axis. Thus, while the slope is affected by the units in which the variables are expressed, the correlation is not.
frequency distribution
A summary of data that shows the number (frequency) of observations in each of several non overlapping bins (classes). A frequency distribution can be created for both a categorical variable and a quantitative variable. When we collect data, we are gathering past observed values, or realizations, of a random variable. The role of descriptive analytics is to analyze and visualize data to gain a better understanding of variation and its impact. A frequency distribution is a summary of data that shows the number (frequency) of observations in each of several nonoverlapping classes, typically referred to as bins.
Table lens
A tabular-like visualization in which each column corresponds to a variable and the magnitude of a variable's values are represented by horizontal bars. A table lens can be a useful visualization tool for wide and tall data sets as the insight on the relationships between variables remains evident even if the display is "zoomed out" to show the table in its entirety or near-entirety. displays the resulting table lens. We interpret this table lens by comparing the large values to small values pattern in the column we sorted (in this case Median Monthly Rent) to the patterns in the columns. Because Percentage College Graduates also displays a large values to small values pattern, we can deduce that this variable has a positive association with Median Monthly Rent. Conversely, Poverty Rate displays a small values to large values pattern, so we can deduce that this variable has a negative association with Median Monthly Rent. The Commute Time column displays no pattern, so we can deduce that this variable has no relationship with Median Monthly Rent. By sorting the table on the values of a different variable, the relationships between different pairs of variables can be analyzed.
percentile
A value such that approximately p% of the observations have values less than the pth percentile; hence, approximately (100 2 p)% of the observations have values greater than the pth percentile The 25th, 50th, and 75th percentiles Notice that the 50th percentile and the median have the same value. That is, 50% of the observations have values less than the median, which matches its definition of the median. The use of percentiles and the interquartile range to measure variability has advantages over the range and standard deviation. First, extreme values do not distort the value of percentiles. Second, percentiles do not require a variable's distribution to be bell-shaped to accurately convey its variability.
Trellis display
A vertical or horizontal arrangement of individual charts of the same type, size, scale, and formatting that differ only by the data they display.
trellis display
A vertical or horizontal arrangement of individual charts of the same type, size, scale, and formatting that differ only by the data they display. When comparing many distributions (three or more), frequency polygons plotted on the same chart can become cluttered. A trellis display is a vertical or horizontal arrangement of individual charts of the same type, size, scale, and formatting that differ only by the data they display. Figure 5.15 contains a vertical trellis display of the length of stay distributions of three hospitals using frequency polygons. This trellis display facilitates distribution shape comparisons, but it is not as useful for magnitude comparisons.
Spurious relationship
An apparent association between two variables that is not causal, but is coincidental or caused by the third (lurking) variable. in which there is no cause-and-effect between the two variables. A spurious relationship between two variables can arise when (1) both variables are affected by a third variable, called a lurking variable, (2) the data are biased and not a representative sample, or (3) the data are insufficient to distinguish it from random coincidence.
confidence interval
An estimate of a population parameter that provides an interval of the form point estimate plus or minus margin of error, believed to contain the value of the parameter with a specified confidence. For example, if an interval estimation procedure provides intervals such that 95% of the intervals formed using the procedure will include the population parameter, the interval estimate is said to be constructed at the 95% confidence level. sample proportion or sample mean plus or minus or margin of error Because the sample mean cannot be expected to provide the exact value of the population mean, a confidence interval is computed by adding and subtracting a value, called the margin of error, to the sample mean: confidence interval is to provide information about how close the sample mean may be to the value of the population mean. While the derivation of the formula for the margin of error for a confidence interval on a mean is beyond the scope of this book, we note that it is dependent on three factors: (1) the sample size, (2) how variable the sample values are (as measured by the sample standard As the sample size increases, the margin of error decreases. As the sample standard deviation increases, the margin of error increases. as the required confidence level increases, the margin of error increases. If we must state an interval with more confidence, then we must be more conservative with that interval and increase its width.
prediction interval
An interval estimate of the prediction of a future value of the dependent variable such that there is a specified confidence that this interval will contain the future value of the dependent variable. The uncertainty in a model's predicted values of future observations can be expressed using a prediction interval. The 95% prediction interval corresponds to the range that we are 95% confident will contain the value of the independent variable in a future observation with a specified value of the dependent variable. Instead, these lines are slightly curved to depict that the width of the prediction interval is the narrowest near the mean value of number of requests. That is, the width of the prediction interval depends on the value of the independent variable for the observation being predicted. To maintain a level of 95% confidence, the time series model must quote a wider prediction interval as it makes predictions further into the future. That is, the width of the prediction interval for a time series model depends on how far into the future the prediction is. This is reflected by the growing distance between the lower and upper limits of the 95% prediction interval in Figure 5.35for predictions further in the future.
Benford's Law
An observation that the leading digit in many naturally occurring data sets approximately obeys a known frequency distribution. In particular, the leading digit is likely to be small with 1 being the most likely leading digit and 9 the least likely.
outlier
An unusually small or unusually large data value. It is a good idea to inspect the records corresponding to these outliers to confirm that these are accurately reported and not the result of an error. If the value of an outlier is the result of an error or if the observation occurred in a circumstance that makes it inappropriate for an analytical study, the observation may be removed from consideration. However, removing outlier observations without warrant can distort analysis by artificially reducing the variation in a variable.
Number of Bins
Bins are formed by specifying the ranges used to group the data. As a general guideline, we recommend using from 5 to 20 bins. Using too many bins results in a histogram in which many bins contain only a few observations. With too many bins, the histogram does not capture generalizable patterns in the distribution and instead may appear jagged and "noisy." Using too few bins results in a histogram that aggregates observations with too wide of range of values into the same bins. With too few bins, the histogram fails to accurately capture the variation in the data and presents only blurred high-level patterns. For a small number of observations, as few as five or six bins may be used to summarize the data. For a larger number of observations, more bins are usually required. The determination of the number of bins is an inherently subjective decision, and the notion of a "best" number of bins depends on the subject matter and goal of the analysis. Because the number of observations in the Death file is relatively large , we should choose a larger number of bins. We will use 16 bins to match Figure 5.9.
categorical variable
Data for which categories of like items are identified by labels or names. Arithmetic operations cannot be performed on categorical variables.
Quantitative variable
Data for which numerical values are used to indicate magnitude, such as how many or how much. Arithmetic operations such as addition, subtraction, and multiplication can be performed on a quantitative variable.
Unstacked data
Data organized such that the values for a categorical variable correspond to labels for separate columns and the columns contain observations corresponding to these respective category values For example, the unstacked version of the facilitates the construction of team-specific line charts. To quickly construct several line charts to explore interesting patterns in time series data, we construct sparklines. Because different arrangements of data can facilitate different visualizations, it is useful to be able to transform stacked data to unstacked data and vice versa.
range spanned by bins
Once we have set the number of bins and the bin width, the remaining decision is to set the value at which the first bin begins. We must ensure that the set of bins spans the range of the data so that each observation belongs to exactly one bin. observe that the smallest data value is 0 and the largest data value is 109. Because the range of data is 109, but the range of the bins is 112 We note that the choice of the number of bins (and the corresponding bin width) may change the shape of the histogram (particularly for small data sets). Therefore, it is common to determine the number of bins and the appropriate bin width by trial and error. Once a possible number of bins are chosen, Equation (5.2) is used to find the approximate bin width. The process can be repeated for several different numbers of bins.
Occlusion
The inability to distinguish some individual data points because they are hidden behind others with the same or nearly the same value.
statistical inference
The process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through analysis of sample data drawn from the population. Failure to appropriately convey the inherent uncertainty in these estimates may lead the audience to develop a false sense of confidence in these point estimates. We want to construct visualizations that help the audience comprehend these sample-based estimates as intervals rather than points.
Jittering
The process of slightly altering the actual values of one or more variables in the observations of a data set so that identical observations occupy slightly different positions when plotted. Occlusion in strip charts can be mitigated by (1) plotting hollow dots rather than filled dots and (2) jittering the observation. Jittering an observation involves slightly adjusting the value of one or more of the variables comprising the observation. we jitter by adding a small random number between zero and one to the height at which male and female half-marathon times are plotted Compared to Figure 5.17, the hollowing and jittering of the plotted values in Figure 5.18results in a strip chart that more clearly display the density of similar half-marathon times. Recall the vertical axis has no meaning, so adding a small random value between zero and one to the y-series values does not alter the interpretation of the chart at all, but it allows the audience to visually discern between similar half-marathon times. If necessary, we could have also jittered the x-series values by adding and subtracting relatively small values to the half-marathon times without qualitatively changing the insight derived from the chart, but that was not necessary in this case. this case.
Exploratory data analysis (EDA)
The process of using summary statistics and visualization to gain an understanding of the data, including the identification of patterns. The objectives of EDA include (1) detection of errors, missing values, and any other unusual observations; (2) characterization of the distribution of the values for the individual variables; and (3) identification of patterns and relationships between variables. Visual display is an essential principle of EDA, as it allows the analyst to translate the information contained in the rows and columns of data into charts, providing "first looks at the data" that achieve the EDA objectives.
Temporal frequency
The rate at which time series data is displayed in a chart. How frequently we plot time series data can dramatically affect what we see. In a time series chart, the rate at which we display the data (typically along the horizontal axis) is called the month to month, quarter to quarter
population
The set of all elements of interest in a particular study.
margin of error
The value added to and subtracted from a point estimate in order to develop a confidence interval for a population parameter.
dependent variable
The variable that is being predicted or explained. It is generally plotted on the vertical axis. Also sometimes referred to as the response variable or target variable.
independent variable
The variable used for predicting values of the dependent variable. It is generally plotted on the horizontal axis.
Measures of Variability
convey any information regarding the variability in the values. range, standard deviation, percentile