Data Analysis & Presentation (Ch. 7) - ACC 421

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

three steps to creating a predictive analytic model

(1). select the target outcome (2). find and prepare the appropriate data (3). create and validate a model

what happens one a predictive model has been created?

- it must be validated - validation tests balance the accuracy of predicting the target outcome correctly with overfitting the data by examining the performance of the model on the test dataset - once the model has been validated, it can be used in business decision modeling

choosing the right visualization: other visualization purposes

- *spatial data* such as maps with data overlays and network diagrams or Sankey diagrams that show the flow of data - visualization types can also be combined to fulfill multiple purposes

data presentation

- the common expression, "*a picture if worth a thousand words*" actually conveys what researchers have found about the human brain being programmed to process visual information better than written information - to be helpful, data needs to be presented using the right visualization, and the visualization needs to be designed correctly - different visualizations are designed to convey different messages

range of the data

- the difference between the lowest and highest values

type II error

- the failure to reject a false null hypothesis - if an alarm does not go off while there is a fire

type I error

- the incorrect rejection of a true null hypothesis - if an alarm goes off while there is no fire

when the target outcome is a numeric value vs categorical value

- when the target outcome is a numeric value, one of the many forms of a regression for prediction - when the target outcome is a categorical value, use *classification analysis* (various techniques that identify characteristics of groups and then tries to use those characteristics to classify new observations in one of those groups). -> EX: whether a customer will be a repeat customer

what are several benefits of visualizing data relative to reading that researchers have identified?

(1). visualized data is processed faster than written or tabular information (2). visualizations are easier to use. Users need less guidance to find information with visualized data (3). visualization supports the dominant learning style of the population because most learners are visual learners

common problems with data analytics: failing to consider the variation

- (the spread of the data about a prediction) inherent in a model

common problems with data analytics: extrapolation beyond the range of data

- a process of estimating a value that is beyond the data used to create the model - it's important to use model created using data as similar as possible to what is being predicted

visualization (viz)

- any visual representation of data, such as a graph, diagram, or animation

central tendency of the data

- refers to determining a value that reflects the center of the data distribution - the most common measures of central tendency are the mean and median - comparing mean and median values can provide insight to the central tendency as means can be highly influenced by outliers whereas medians are influenced significantly less by outliers

weighting

- refers to the amount of attention an element attracts (*visual weight*) - there are various techniques to increase visual weight, including color, complexity, contrast, density, and size - these techniques can be used in combination to create even greater emphasis than using them separately - as with highlighting, be careful that that the use of visual weighting to create emphasis doesn't create an overlay complex visualization that reduces simplicity

spread of the data

- refers to the dispersion of data around the central value - the most common measure of spread are the range of the data and the standard deviation of the data - quartiles are often used to talk about the spread

mean

- the average amount - that is, the sum of all values divided by the number of observations in the dataset

types of errors and example

- type I & type II EXAMPLE 1: *NULL*: there's no relation between pay and leaving the company -> *type I*: there's a relation, either positive or negative, between paying employees more and them leaving the company -> *type II*: finding no relation between pay and leaving the company is the true relation is that paying more does decrease the rate employees leave the company

what are the four categories of data analytics and how do they differ?

1. *descriptive*: computations that address the basic question of "what happened?" 2. *diagnostic*: goes beyond examining what happened to try to answer the question, "why did this happen?" 3. *predictive*: go a step further than diagnostic analytics to answer the question "what is likely to happen in the future?" 4. *prescriptive*: answers the question "what should be done?" - differ in terms of their complexity and the value they add to the organization (this is the order of increasing complexity and value)

choosing the right visualization: correlation

- *comparing how much two numeric variables fluctuate with each other* EXAMPLES: (1). *scatterplot*: a numeric variable is listed on the x-axis, a different numeric variable listed on the y-axis, and the values of each are plotted in the data area (regression line) (2). *heatmap*: looks like a data table, but instead of showing data values it shows colors that relate to the magnitude of the different entries

how can you avoid data deception?

(1). show representations of numbers proportional to the reported number (starting the y-axis at zero helps to ensure this) (2). in vizs designed to depict trends, show time progressing from left to right on the x-axis (3). present complete data given the context

what are the steps in the basic process of testing a hypothesis?

(1). state a null and alternative hypothesis (2). select a level of significance for refuting the null hypothesis (3). collect a sample of data and compute the probability value (4). compare the computed probability against the level of significance and determine if the evidence refutes the null hypothesis. *Failing to refute* the hypothesis is seen as support of the alternative hypothesis

choosing the right visualization: comparison

- *comparing data across categories or groups* - require both numeric and categorical variables - EXAMPLES: (1). *bar chart*: puts the categorical data variable on the x-axis and then plots the numerical value on the other axis (2). *bullet graph*: adds a "bullet" or a small line by each bar that indicates an important benchmark (3). *benchmark*: includes things like budgeted amounts, goals, expected progress, etc.

simplification: distance

- refers to how far apart related information is presented - the technique of distance refers to how far apart related information is presented - removing distance aids in understanding - a side benefit of removing distance is that often you remove other unnecessary information - simplify through the distance between relevant comparison groups

simplification

- refers to making a visualization easy to interpret and understand - a visualization can be simplified by considering three important techniques that will enhance the design of all visualizations: (1). quantity (2). distance (3). orientation - visualizations are more effective when they simplify the presentation of data to clearly and concisely communicate the objective of the visualization

when examining new data (descriptive).... ?

- seek to understand the central tendency of the data, - the spread of the data, - the distribution of the data, & - correlations in the data

categorical data

- take on a limited number of assigned values to represent different groups while numeric values are continuous EX: a predication of whether men or women are more likely to purchase a product would be a *categorical value* (i.e., male/female), whereas a prediction of how much a customer is likely to spend would be a *numeric value*

choosing the right visualization: trend evaluation

- *show changes over an ordered variable, most often a measurement of time* - the difference between visualizations showing trends and correlations is that the axis in a trend viz is ordered EXAMPLES: (1). *line chart*: the x-axis is an ordered unit such as days, months, or years (2). *area chart*: same as a line chart except the area between the line(s) and the x-axis are filled in (helps focus on a trend progression over time)

choosing the right visualization: distribution

- *show the spread of numeric data values* - showing distribution can help develop a deeper understanding of data than by just examining simple descriptive statistics like the minimum, maximum, and average EXAMPLES: (1). *histogram*: a single numeric value is divided into equal-sized bins, and the bin sizes are listed on the x-axis. Then, a bar is used to show the count of each value that falls into the bins. (2). *boxplot*: draws a line at the medium value for a numeric variable and then shows another line for the upper and lower quartiles

choosing the right visualization: part-to-whole

- *show which items make up the parts of a total* EXAMPLES: (1). *pie charts*: most overused and misused visualization type in practice -> most appropriate when showing percentages that sum up to 100% and the data only has a few categories (2). *treemaps*: use nested rectangles to show the amount that each group or category contributes

wording and types of a hypothesis

- a *hypothesis* should be worded as a testable statement, not a question, about a general relationship between two ideas, groups, or concepts -> if we pay employees more, they will be less likely to leave our company - a hypothesis should be thought of as two statements: (1). null hypothesis (2). alternative hypothesis

outlier

- a data point, or a few data points, that lie on an abnormal distance from other values in the data - identifying outliers is important because they can exert undue influence on the computation of many analytics - which may lead to erroneous interpretations of the data

null hypothesis

- a statement of equality, suggesting there is no relationship between concepts or ideas in the hypothesis

alternative hypothesis

- a statement of inequality, suggesting that one concept, idea, or group is related to another concept, idea, or group

standard deviation

- a statistical computation that measures the dispersion of data around the mean

distribution of the data

- a statistical term that refers to how often values in the data occur or repeat - distributions are important to understand the shape of the data - for determining which statistical tests may be properly applied to analyze data - the validity of each statistical test is dependent on the data meeting the test's assumed distribution -> the most common distribution is the normal distribution, which looks like the famous bell-shaped curve - understanding the distribution of data is also helpful in identifying outliers

machine learning

- an application of AI that allows computer system to improve and update prediction models on their own - machine learning allows computers to learn and to improve over time without human intervention - machine learning and predictive analytics are closely related as machine learning algorithms are often used in predictive analytics

prescriptive analytics

- answers the question "what should be done?" - it can be either recommendations to take OR - programmed actions a system can take based on predictive analytics results - it uses techniques such as artificial intelligence, machine learning, and other statistics to generate predictions -> the key to being successful is the development of initial predictive models and then applying appropriate learning algorithms so those models continue to improve their recommendations over time - still an emerging area expected to grow and mature over the coming years

emphasis

- assuring the most important message is easily identifiable - high-quality viz should emphasize the data that is most relevant, important, or timely, for the decision maker - understanding what to emphasize depends on the objective of the situation

informal diagnostic analysis

- builds on descriptive analytics - it includes using logic and basic tests to try to reveal relationships in the data that explain why something happened - a general rule of thumb is the "5 Why's" principle, which states that it often requires asking "Why?" five times in order to uncover the true reason why something happened

formal diagnostic analysis

- can employ *confirmatory data analysis* techniques - confirmatory data analysis tests a hypothesis and provides statistical measures if the likelihood that the evidence (data) refutes or supports a hypothesis

choosing the right visualization

- choosing the right type of visualization strengthens the ability of the viz to communicate effectively - data can be presented in many forms, including: -> static graphics, tables, videos, static and dynamic models, etc. - five main purposes of visualizations: (1). comparison (2). correlation (3). distribution (4). trend evaluation (5). part-to-whole

ethical data presentation

- refers to avoiding the intentional or unintentional use of deceptive practices that can alter the user's understanding of the data being presented - data deception is "a graphical depiction of information, designed with or without an intent to deceive, that may create a belief about the message and/or its components, which varies form the actual message" - ways to visually distort information are to use visual weight inappropriately - ethical data presentation shows complete, accurate, consistent, timely, and valid data

descriptive analytics

- computations that address the basic question of "what happened?" - it uses *exploratory data analysis* techniques - central tendency - spread of data - distribution of the data - central tendency, spread, distributions, and correlations are often depicted visually because of the visual representations of these concepts are quick to interpret, easy to understand, and indicate areas that need further exploration - descriptive analytics can also be performed on qualitative data by first transforming the qualitative data into numbers (can't analyze text) - external auditors use many descriptive analytics, including computing profit margins and leverage ratios to examine if business risk changed significantly during a period and to identify possible fraud - corporate accounts compute descriptive analytics to understand how the business is performing

correlation

- correlations in data refer to how closely two items fluctuated together in a dataset - the most common measure of correlation is a correlation coefficient measures as a value from -1 to +1 - note that correlations is distinct from causation *-1*: two variables are negatively correlated (as one variable goes up, the other goes down by the same relative amount) *+1*: two variables are positively correlated (as one variable goes up, the other goes up by the same relative amount) *0*: no relation in the movement of the variables

simplification: orientation

- data is easier to understand if it is oriented in the correct fashion - one way to improve orientation is to change the direction of an entire chart - bar charts are most often printed with the bars in a vertical format -> however, the same chart can be turned so that the bars are presented in a horizontal fashion and make it much easier to read the labels - orientation also applies to how the data is sorted - when presenting data, the information can be sorted based on the labels, typically alphabetically, or based on the values of the data, typically in ascending or descending order - the choice of sorting order of sorting attribute can simplify finding the correct information and processing what the information means relative to other groups

predictive analytic model: select the target outcome

- decide what outcome is to be predicted - target variable (either categorical value or a numeric value), an outcome variable, or a dependent variable - a categorical value answers the question "which one?" - a numeric value answers a question such as "how much?"

designing high-quality visualizations

- follow three important design principles: (1). simplification (2). emphasis (3). ethical presentation - within each of these different design principles are differnt techniques and options to improve the ability of the viz to communicate effectively - the principles and techniques can be applied to the *four main parts of a viz*: the title, axes (including labels, tick marks, and lines), legend, and data area - simplification, emphasis, and ethical presentation can enhance the ability of each of these parts of a viz to individually and in combination effectively communicate a message

predictive analysis

- go a step further than diagnostic analytics to answer the question "what is likely to happen in the future?" - predictive analytics use historical data to find patterns likely to manifest themselves in the future -> the more data, the better chance of finding patterns - the dramatic increases in computing power and in available historical data allow computers to find relations that humans cannot - successful predictive analytics can be transformative in an organization - to be successful, predictive analytics require that future events are predictable based on past data and that the organization has collected the necessary data for prediction

diagnostic analytics

- goes beyond examining what happened to try to answer the question, "why did this happen?" - within diagnostic analysis, both informal and formal analyses can be conducted - analysts often must ask follow-up questions and perform analyses related to these questions to uncover the true underlying cause - with the null and alternative hypotheses specified, two errors are possible - the analysis of data can take many forms, including t-tests, regressions, analysis of variance (ANOVA), etc. - because it's expensive and time-consuming, many business decisions are made without formal hypothesis testing

what does hypothesis testing reveal?

- hypothesis testing only reveals whether there is a relation (or not) between the two variables measured - it does not indicate the *importance of the relation* - the *effect size*, which is a quantitative measure of the magnitude of the effect, reveals the importance of the relation

level of significance

- in statistics, the criteria for choosing between the null and alternative hypothesis - the level of significance is the probability of accepting a type I error - general scientific rules of thumb use probability levels of 0.05, or sometimes a more lenient 0.10 -> can never "prove" a hypothesis is true; only provide evidence favoring one or the other hypothesis

highlighting

- includes using colors, contrasts, call-outs, labeling, fonts, arrows, and any other technique that brings attention to an item - while highlighting can be applied to all areas of the viz, most often highlighting is applied to the data area of the viz - colors are a particularly valuable highlighting tool - depending on the purpose of the visualization, different color schemes can help highlight, and thus emphasize, what is most important - when visualizations don't use different colors, using different shading as a contrast is highly effective - other ways to highlight information are to use labels, arrows, and graphics - when deciding on what to highlight, remember the principle of simplification and don't use so many highlights that the viz becomes cluttered

data overfitting

- occurs when a model fits training data very well but does not predict well when applied to other datasets - if one tests their model on the same data used to train it, there is a danger of *data overfitting* - using separate training and test data sets is a valuable guard against overfitting

common problems with data analytics: data overfitting

- occurs when a model is designed to fit training data very well but does not predict well when applied to other datasets - producing an analysis that corresponds to exactly to a set of data such that when additional data is used with the model, it doesn't predict future observations reliably. Determining the correct model requires testing and evaluation using a test dataset

what do you do once a level of significance is decided on?

- one must collect and analyze data - the ideal way to collect data is to collect a random sample of data

common problems with data analytics

- the acronym *GIGO* stands for "*garbage in, garbage out*" and refers to the concept that data analysis is of no value if the underlying data is not of high quality 1. data overfitting 2. extrapolation beyond the range of data 3. failing to consider the variation - *variation and extrapolation* go together because as one extrapolates a greater distances from the data values used to create the model and the point of prediction, the variation increases - analyses based on poor data are common when the data architecture is not properly designed, maintained, and documented - often, individuals will misuse data analyses to report a single number as a prediction and believe that the outcome will be that number - predicting outcomes from new data outside of the range of data on which the prediction model was built will result in a greater likelihood of prediction error

ordering

- the intentional arranging of visualization items to produce emphasis (*data ordering*) - the two most common ways of ordering data are (1) by using categories on the axes & (2) by the values of the data - methods are almost always superior to just ordering the data in a random form - data ordering can be combined with other techniques to enhance emphasis

purpose and challenges of performing the steps of hypothesis testing?

- the purpose of performing the steps of hypothesis testing is to help understand why a phenomenon happened - the challenges with this method are that it requires careful design to get proper inference. Even in the era of big data, one often finds that the needed data has not been collected or is not available

median

- the value that separates the higher half of the values from the lower half of the values

predictive analytic model: find and prepare the appropriate data

- this entails the ETL process - predictive analytics perform better when they are developed with a variety of data about many potential causes of the outcome - collecting data that may only be tangentially related to the outcome often can be valuable because data scientists are finding that such data can be predictive of outcomes

quartile points

- three quartile points create the four ranges: (1). starting from the lowest point, the first quartile cutoff is the point between the lowest 25% of the data and the highest 75% of the data (2). the second quartile cutoff is the same as the median (3). the third quartile cutoff is the point between the lowest 75% of the data and the highest 25% of the data

exploratory data analysis

- used by descriptive analytics - an approach that explores data without formal models or hypotheses - this type of analysis is often used for the following: (1) to find mistakes in the data (2). to understand the structure of data (3). to check assumptions required by more formal statistical modeling techniques (4). to determine the size, direction, and strength of relationships between variables

test dataset

- used to assess how well the model predicts the target outcome

training dataset

- used to create the model for future prediction

predictive analytic model: create and validate a model

- variable selection is important in the creation of a model - for predictive analytics, models can be generated and tested with all possible combinations of input data - models are evaluated according to their overall fit to the data and their ability to predict future outcomes - to test a model, the data should be split into: -> a training dataset -> a test dataset - data overfitting

simplification: quantity

- visualizations are most impactful when they follow Goldilocks principle of containing not too much and not too little, but just the right amount of data - for the areas that contain text, such as the titles, labels, and legend, a poorly designed viz typically contains too little information in the title and too much information in labels and the legend - titles serve as an important way to orient readers - when examining the quantity of information, examine each of the separate elements of a viz and then consider how the elements work together - reducing the quantity of information displayed in the data section will improve the ability of users to interpret the viz and more clearly communicate a message - wordiness, information overload, and too many formats


संबंधित स्टडी सेट्स

Econ Chapter 3, 4, 5 - Pearson HW and Terms

View Set

Ch.14 - Workers' Compensation Insurance

View Set

SEVI 3013 UARK Exam 2 Summer online

View Set

Stage 8: Breed profiles & mixed breed styling

View Set