Ch. 2 - Descriptive Analytics I: Nature of Data, Statistical Modeling, and Visualization

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What is data visualization? Why is it needed?

"the use of visual representations to explore, make sense of, and communicate data

management

(dashboard) displaying operational data that identify what actions to take to resolve a problem

analysis

(dashboard) summarized dimensional data to analyze the root cause of problems

monitoring

(dashboard) graphical, abstracted data to monitor key performance metrics

How do we know if the model is good enough?

- R^2 (R-square) - p Values - error measures (for predication problems)

regression

- a part of "inferential" statistics - the "most widely known and used" analytics technique in statistics - used to characterize relationship between "explanatory" (input) and "response" (output) variable Used in: - hypothesis testing (explanation) - forecasting (prediction)

linear regression assumptions don't hold

- compromises that validity of the model - What do we do then? Identify the violations of the assumptions and use techniques to mitigate them

information visualization

- descriptive, backward focused - "what happened", "what is happening"

dashboard-type reports

- graphical presentation of several performance indicators in a single page using dials/gauges

metric management reports

- help manage business performance through metrics (SLAs - service-level agreements) for externals; KPIs (key performance indicators - for internals) - can be used as part of Six Sigma and/or TQM (total quality management)

balanced scorecard-type reports

- include financial indicators and non-financial indicators (customer, business processes, and learning & growth)

categorical data

- nominal data - ordinal data

predictive analytics

- predictive, future focused - "what will happen", "why will it happen"

numerical data

- ratio data - interval

How do we develop linear regression models?

- scatter plots (visualization-for simple regression) - ordinary least square method (a line that minimizes squared of the errors)

What are the main data preprocessing steps?

Data consolidation Data cleaning Data transformation Data reduction

analytics study

Identifying, accessing, obtaining, and processing of relevant data are the most essential tasks in ____________ ____________.

What is an information dashboard? Why are they so popular?

Information dashboards are common components of most, if not all, BI or business analytics platforms, business performance management systems, and performance measurement software suites. Dashboards provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily drilled in and further explored. A typical dashboard is shown in Figure 2.27. This particular executive dashboard displays a variety of KPIs

yes

Is time series forecasting different from simple linear regression?

What are the main differences among line, bar and pie charts? When should you use one over the others?

Line - time series Bar - numerical data that splits nicely into different categories so you can see the comparative results Pie- used to illustrate proportions of like 4 or less categories otherwise use a bar chart.

What is logistic regression? How does it differ from linear regression?

Logistic regression has its output (yvalue) be a class ,its categorical, like yes no, or red green blue. That is, whereas linear regression is used to estimate a continuous numerical variable, logistic regression is used to classify a categorical variable.

List and briefly define the central tendency measures of descriptive statistics.

Mean median mode

estimate the future values

Time series models are focused on extrapolating on their time-varying behavior to ________ ___ ______ _______.

data visualization techniques and tools

____ _____________ __________ and _____ make the users of business analytics and BI systems better information consumers

linear regression

______ __________ models suffer from highly restructure assumptions

logistics regression

_________ __________ is a probability-based classification algorithm

readying the data for analytics

_____________ ___ ____ ___ __________ is a tedious, time-demanding, yet crucial task

statistical methods; descriptive and inferential measures

_____________ _____________ are used to prepare data as input to produce both _____________ and ______________ _______________.

statistics

a collection of mathematical techniques to characterize and interpret data

business report

a written document that contains information regarding business matters

NCAA Bowl Game outcomes

analytics process to develop "predication models (both regression and classification type)" for this

report

any communication artifact prepared to convey specific information in a presentable form

geographic map

are typically used together with other charts and graphs, as opposed to by themselves, and show postal codes, country names, latitude/longitude, and etc.

business report source

data from inside and outside the organization (via the use of ETL - extraction, transformation, and loading of the data)

critical to analytics

data quality and data integrity

descriptive statistics

describing the data (as it is)

visualization

differs from traditional charts and graphs in complexity of data sets and use of multiple dimensions and measures

inferential statistics

drawing inferences about the population based on sample data

bubble chart

enhanced variant of a scatter plot because it adds a dimension via the size of the dot

nominal data

ex: - the code values for the variable "marital status", S-single, M-married, D-divorced

ordinal data

ex: the data field for the variable "credit score" can be generally categorized as (1) low, (2) medium, or (3) high

hierarchy chart

helpful when illustrating the hierarchy chart of employees in a company

Where does the data for business analytics come from?

modernday data collection mechanisms that use Internet and/or sensor/RFID-based computerized networks. These automated data collection systems are not only enabling us to collect more volumes of data but also enhancing the data quality and integrity.

PERT chart

network diagrams; show precedence relationships among the project activities/tasks

dashboards

provide visual displays of important information that is consolidated and arranged on a "single screen" so that information can be digested at a "single glance" and "easily drilled in and further explored"

List and briefly define the dispersion measures of descriptive statistics.

range standard deviation variance, mean absolute devaiation , quartiles and interquartile range

data reduction

reduce dimension, reduce volume, and balance data

What is data?

refers to a collection of facts usually obtained as the result of experiments, observations, transactions, or experiences. Data may consist of numbers, letters, words, images, voice recordings, and so on, as measurements of a set of variables (characteristics of the subject or event that we are interested in studying). Data are often viewed as the lowest level of abstraction from which information and then knowledge is derived.

report considerations

- the key to any successful business reports is clarity, brevity, completeness, and correctness - traditional reporting process is a manual process of collecting and aggregating financial and other information - traditional reporting may be flat, slow to develop, and difficult to apply to specific situations - traditional reporting is still used in corporations - the "last mile" is the most challenging stage of the reporting process in which consolidated figures are cited, formatted, and described to form the final text of the report

types of business reports

1. metric management reports 2. dashboard-type reports 3. balanced scorecard-type reports

monitoring, analysis, management

3 layers of information on a dashboard

What is time series? What are the main forecasting techniques for time series data?

A time series is a sequence of data points of the variable of interest, measured and represented at successive points in time spaced at uniform time intervals. Examples of time series include monthly rain volumes in a geographic area, the daily closing value of the stock market indexes, averaging methods that include simple average, moving average, weighted moving average, and exponential smoothing.

What are the two most commonly used shape characteristics to describe a data distribution?

Skewness is a measure of asymmetry (sway) in a distribution of the data that portrays a unimodal structure—only one peak exists in the distribution of the data. Because normal distribution is a perfectly symmetric unimodal distribution, it does not have skewness, that is, its skewness measure (i.e., the value of the coefficient of skewness) is equal to zero. T Kurtosis is another measure to use in characterizing the shape of a unimodal distribution. As opposed to the sway in shape, kurtosis is more interested in characterizing the peak/tall/skinny nature of the distribution. Specifically, kurtosis measures the degree to which a distribution is more or less peaked than a normal distribution. Whereas a positive kurtosis indicates a relatively peaked/tall distribution, a negative kurtosis indicates a relatively flat/short distribution.

What are the main differences between descriptive and inferential statistics?

The main difference between descriptive and inferential statistics is the data used in these methods—whereas descriptive statistics is all about describing the sample data on hand, and inferential statistics is about drawing inferences or conclusions about the characteristics of the population

What are the main categories of data?

There are structured data which are Quantitative and categorical Examples are Ordinal data contain codes assigned to objects or events as labels that also represent the rank order among them. For example, the variable credit score can be generally categorized as (1) low, (2) medium, or (3) high. Similar ordered relationships can be seen in variables such as age group (i.e., child, young, middle-aged, elderly) and educational level (i.e., high school, college, graduate school). Some predictive analytic algorithms, such as ordinal multiple logistic regression, take into account this additional rank-order information to build a better classification model. • numeric data represent the numeric values of specific variables. Examples of numerically valued variables include age, number of children, total household income (in U.S. dollars), travel distance (in miles), and temperature (in Fahrenheit degrees). Numeric values representing a variable can be integer (taking only whole numbers) or real (taking also the fractional number). The numeric data may also be called continuous data, implying that the variable contains continuous measures on a specific scale that allows insertion of interim values. Unlike a discrete variable, which represents finite, countable data, a continuous variable represents scalable measurements, and it is possible for the data to contain an infinite number of fractional values. • interval data are variables that can be measured on interval scales. A common example of interval scale measurement is temperature on the Celsius scale. In this particular scale, the unit of measurement is 1/100 of the difference between the melting temperature and the boiling temperature of water in atmospheric pressure; that is, there is not an absolute zero value. • ratio data include measurement variables commonly found in the physical sciences and engineering. Mass, length, time, plane angle, energy, and electric charge are examples of physical measures that are ratio scales. The scale type takes its name from the fact that measurement is the estimation of the ratio between a magnitude of a continuous quantity and a unit magnitude of the same kind. Informally, the distinguishing feature of a ratio scale is the possession of a nonarbitrary zero value. For example, the Kelvin temperature scale has a nonarbitrary zero point of absolute zero, which is equal to -273.15 degrees Celsius. This zero point is nonarbitrary because the particles that comprise matter at this temperature have zero kinetic energy. There are unstructured data such as text multimedia and code Examples are images, audio, video, a JSON file

ratio data

commonly found in physical science, such as mass, length, time, but in business, the values for the variable "salary" are ratio data

performance dashboards

commonly used in BPM software suites and BI platforms

What is regression, and what statistical purpose does it serve?

regression has become the statistical technique for characterization of relationships between explanatory (input) variable(s) and response (output) variable(s). Regression can be used for one of two purposes: hypothesis testing—investigating potential relationships between different variables, and prediction/forecasting—estimating values of a response variables based on one or more explanatory variables.

student attrition

represents students who drop out or fail to complete a course of study

interval

scale measurement is "temperature" on the Celsius scale where the unit of measurement is 1/100 of the difference between the melting temperature and the boiling temperature of water in atmospheric pressure; that is, there is not an absolute zero value

analytics

starts with data

visual analytics

the combination of: information visualization + predictive analytics

dashboard design

the fundamental challenge of dashboard design is to display all the required information on a "single screen, clearly and without distraction", in a manner that can be "assimilated quickly"

Identify and comment on the information dimensions captured in the Napolean march diagram.

the size of the army, direction of movement, geographic locations, outside temperature, etc.)

time series forecasting

the use of mathematical modeling "to predict future values" of the variable of interest based on previously observed values

data visualization

the use of the visual representations to explore, make sense of, and communicate data. Related to information graphics, scientific visualization, and statistical graphics

predication results

this for NCAA Bowl Game outcomes 1. classification 2. regression

business report purpose

to improve managerial decisions

line chart

used to show how donations to United Way Giving Fund increased over the past five years

pie chart

used to show relative "proportions" of majors declared by college students in their sophomore year

bar chart

useful in displaying nominal data or numerical data splits nicely into different categories so you can quickly see comparative results and trends within your data

What are the most common metrics that make for analytics-ready data

• Data currency/data timeliness • Data granularity • Data validity • Data relevancy • Data currency/data timeliness means that the data should be up-to-date (or as recent/new as it needs to be) for a given analytics model. It also means that the data is recorded at or near the time of the event or observation so that the time-delayrelated misrepresentation (incorrectly remembering and encoding) of the data is prevented. Because accurate analytics rely on accurate and timely data, an essential characteristic of analytics-ready data is the timeliness of the creation and access to data elements. • Data granularity requires that the variables and data values be defined at the lowest (or as low as required) level of detail for the intended use of the data. If the data is aggregated, it may not contain the level of detail needed for an analytics algorithm to learn how to discern different records/cases from one another. For example, in a medical setting, numerical values for laboratory results should be recorded to the appropriate decimal place as required for the meaningful interpretation of test results and proper use of those values within an analytics algorithm. Similarly, in the collection of demographic data, data elements should be defined at a granular level to determine the differences in outcomes of care among various subpopulations. One thing to remember is that the data that is aggregated cannot be disaggregated (without access to the original source), but it can easily be aggregated from its granular representation. • Data validity is the term used to describe a match/mismatch between the actual and expected data values of a given variable. As part of data definition, the acceptable values or value ranges for each data element must be defined. For example, a valid data definition related to gender would include three values: male, female, and unknown. • Data relevancy means that the variables in the data set are all relevant to the study being conducted. Relevancy is not a dichotomous measure (whether a variable is relevant or not); rather, it has a spectrum of relevancy from least relevant to most relevant. Based on the analytics algorithms being used, one may choose to include only the most relevant information (i.e., variables) or if the algorithm is capable enough to sort them out, may choose to include all the relevant ones, regardless of their relevancy level. One thing that analytics studies should avoid is to include totally irrelevant data into the model building, as this may contaminate the information for the algorithm, resulting in inaccurate and misleading results.


Ensembles d'études connexes

Chapter 7: Health Promotion during Early Childhood

View Set

Case Study: Pain Management Exam

View Set

PDBio 210: Female Reproductive System

View Set

Anthropology 1050 Exam 3 Study Guide Chapter 10

View Set

Ch. 37 - Neurons and Nervous Systems

View Set