wpc 300 final
T-statistics
is used when population SD is unknown Good for small sample n<30 with underlying population distribution is normal
Correlation
n is a measure of the linear relationship between two variables, X and Y, which does not depend on the units of measurement. Correlation is measured by the correlation coefficient, also known as the Pearson product moment correlation coefficient. The correlation coefficient is scaled between -1 and 1
Find an appropriate solution framework
Break down the problem into pieces Iterative process (agile vs waterfall) Identify appropriate analytical/modeling techniques
Routinize the procedure
Documentation Next similar questions can be solved quickly Build a system (macros, codes, programs)
Secondary Data
Firm's proprietary database Internet data (crawlers) [scarpy, beautifulsoup] Stock/capital market data [compustat, CRSP] Accounting disclosure data [ from 10K, 10Q]
Mode:
The most frequently occurring value in a data set Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio)
Tell an interesting and complete story
The problem you address should be meaningful Solution could be reused for related problems Assumptions, boundaries
Predictive Analytics
- Predictive analytics is the use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data. The goal is to go beyond knowing what has happened to providing a best assessment of what will happen in the future. Question: (1) What will happen next? (2) Why will it happen next? Methods: (1) Data mining (2) Text mining (3) Forecasting Outcome: Accurate projections of future outcomes and events
Prescriptive Analytics
- Prescriptive analytics answers the question of what to do by providing information on optimal decisions based on the predicted future scenarios. The key to prescriptive analytics is being able to use big data, contextual data and lots of computing power to produce answers in real time. Question: (1) What should be done about it? (2) Why should you do it? Methods: (1) Optimization (2) Simulation (3) Expert systems Outcome: Best possible business decision and outcome
Descriptive Analytics
- This is a preliminary stage of data processing that creates a summary of historical data to yield useful information and possibly prepare the data for further analysis. Questions: (1) What happened? (2) What is happening? Methods: (1) Standard reporting (2) Dashboards (3) Visual analytics Outcome: Well defined business problems and opportunities
Measures of central tendency
yield information about the center, or middle part, of a group of numbers. Mean Median Mode Percentiles Quartiles Measuring Data Centrality Perce
Diagnostic/Explanatory Analytics
- this is about looking into the past and determining why a certain thing happened. This type of analytics usually revolves around working on a dashboard. Question: (1) Why did it happen? (2) How did it happen? Methods: Inferential Statistics, Visual analytics Outcome: Discover/Understand causal relationships of an outcome
Estimation
- using an experiment guarantees that you learn something about what you want to know.
Control
- using an experiment is the only reliable way to measure response to changing variables
Negatives of Analytical Decision Making
. Delayed Action Lack of flexibility Frustrations in teams
Null hypothesis
A statement that generally assumes nothing has changed Avg. amount of drink = 16 Oz
Observational Study example
A study took a random sample of people and examined their social media habits. Each person was classified as either a light, moderate, or heavy social media user. The researchers looked at which groups tended to be happier
Decoy Effect Bias
According to economic theory, we make decisions based on what will have the most utility to us. Consumers will tend to have a specific change in preference between two options when presented with a third option that is asymmetrically dominated
Experimental Study example
Another study took a group of adults and randomly divided them into two groups. One group was told to drink tea every night for a week, while the other group was told not to drink tea that week. Researchers then compared when each group fell asleep.
Mean Absolute Deviation
Average of the absolute deviations from the mean:
Confounder
is an extraneous variable in an observational study that correlates with both the dependent and independent variables Example: Regular consumptions of organic food will keep you in good mood. The confounder could be "money" Since you need money to buy organic food and ideally since you have money you are in good mood.
p-value
is lower than α, then reject the null hypothesis
Experimental Study
Another randomly assigned volunteers to one of two groups: One group was directed to use social media sites as they usually do. One group was blocked from social media sites. The researchers looked at which group tended to be happier.
Zero risk bias
Because we love certainty and hence ignore risk entity while making decision What would you decide if you were offered the following two options? Bet $10 to win a lottery $100 that has 50% chance of winning Bet nothing to get free $10.
quantitative data
Can be counted, measured, and expressed using numbers
Categorical: Ordinal
Categorical data can be on an ordinal scale. Numbers are used to indicate rank or order Relative magnitude of numbers is meaningful Differences between numbers are not comparable Example: Difference between strongly agree and agree is not necessarily same as the difference between disagree and strongly disagree. Another example (rank value as shown below) 1 for President 2 for Vice President 3 for Plant Manager. cannot add or sub
Block
is the e arranging of experimental units in groups (blocks) that are similar to one another
Simulated Data
Data based on assumption and simulation Used a lot in scheduling, routing and queuing
Numerical: Interval
Distances between consecutive integers are equal Relative magnitude of numbers is meaningful Differences between numbers are comparable Location of origin, "zero", is arbitrary Data are always numerical Example: Temperature at different rooms in a home. cannot multiply and divide
Observational Study example
Effect of drinking tea before bedtime A study took random sample of adults and asked them about their bedtime habits. The data showed that people who drank a cup of tea before bedtime were more likely to go to sleep earlier than those who didn't drink tea.
Experimental Study
Establish causality from observational study in a controlled environment Design an experiment to study a certain effect by intervention You plan for the data before you collect it
Why Experimental Study?
Experiments allow us to set up a direct comparison between the treatments of interest. We can design experiments to minimize any bias in the comparison. We can design experiments so that the error in the comparison is small. We are in control of experiments, and having that control allows us to make stronger inferences about the nature of differences that we see in the experiment. Specifically, we may make inferences about causation.
Data Extraction
Extract data from primary/secondary source.
Bandwagon Effect
Group thinking, adopting a decision based on the number of people who hold a certain belief. The most famous and commonly cited example of Groupthink is how the US Navy treated the threat of a Japanese attack on Pearl Harbor in Hawaii. Following a long line to dine in a famous restaurant - think Yelp
Observational Study
How different parameters in the population behave together, if or not they move together in the same direction Draw conclusions on correlations No outside intervention during the study You use the data available to you.
Categorical: Nominal
In nominal measurement the numerical values just "name" the attribute uniquely. A player with number 24 is not more of anything than a player with number 23, and is certainly not better than number 23. Numbers are used to classify (male or female) or categorize (Color) - can be stored as "word", "text" or "nominal code". Example: Employment Classification 1 for Educator 2 for Construction Worker 3 for Manufacturing Worker cant find the mean. can only compare if the data is equal. cannot add or sub
Sunk-cost Fallacy Bias
Individuals commit the sunk cost fallacy when they continue a behavior or endeavor as a result of previously invested resources Example: "I might as well keep eating because I already bought the food." "I might as well continue dating someone bad for me because I've already invested so much in them."
qualitative data
Is descriptive and conceptual and cannot be measured
Mean:
Is the average of a group of numbers Not applicable for nominal (categorical) or ordinal data Affected by each value in the data set, including extreme values Computed by summing all values in the data set and dividing the sum by the number of values in the data set.
Analytics
Learns by Analyzing Uses step by step procedure Values quantitative information and models Builds mathematical models and algorithms Seeks optimal solution
Heuristics
Learns by acting Uses trial and error Values experience, effort reduction Relies on common sense Seeks satisficing solution Fast and frugal May lead to decision biases!
Data Load
Load data into final target database, more specifically an operational data store, data mart or data warehouse
median
Middle value in an ordered array of numbers Applicable for ordinal, interval (quantitative), ratio data Ex: Median Housing price in a State Not applicable for nominal data (why not?) Unaffected by extremely large and extremely small values (How?)
Alternative hypothesis
Opposite of null, typically your claim. Avg. amount of drink < 16 Oz.
Tools for A/B Testing
Optimizely Visual web optimizer Adobe target Google content experiments
Anchoring Bias
Over-reliant on first piece of information you hear Most of buying decisions are affected by anchoring effect What do you think black Friday sales are driven by? Have you ever wondered why retail price of a product tend to be $39.99, not $40?
Availability Heuristics
Overestimate the importance of information that is available. Example: After you see a movie about a nuclear disaster, you might become convinced that a nuclear war or accident is highly likely. A person might argue that smoking is not unhealthy as his father who lived 100 years was a chain smoker and smoked 3 packs a day for 70 years!
Online Survey
Polls are completed only by visitors to the site Those with an interest in the website's mission are the only ones who will participate
Predictive Modeling Applications
Predict water leakage in a city water pipe network Predict when a person would go to depression Predict criminal activities at Los Angeles by LAPD Predict performance for certain stock portfolios Forecast demands for sales Predict if a customer is likely to buy certain product or services
Prescriptive Data Modeling
Prescribes the best course of action when making complex decisions involving tradeoffs between business goals and constraints, using optimization technology It basically uses simulation and optimization to ask "What should a business do?" Prescriptive analytics is a combination of data, mathematical models and various business rules.
responses
The outcomes that we observe after applying a treatment to an experimental unit
Numerical: Ratio
Ratio is very similar to the interval scale, with the difference that it has a true zero point. This scale is commonly used for values that are measured in numbers, such as length, height, weight, or monetary values like cost and revenue. Relative magnitude of numbers is meaningful Differences between numbers are comparable Location of origin, zero, is absolute (natural) Examples: Height, Weight, and Volume; Monetary Variables, such as Profit and Loss, Revenues;
The endowment effect
is the phenomenon in which most people would demand a considerably higher price for a product that they own than they would be prepared to pay for it (Weber 1993). The endowment effect is a hypothesis that people value a good more once their property right to it has been established.
Non-response
Some individuals are less likely to respond to a survey, e.g., want their opinion about smoking weed
Primary Data
Survey Interviews (marketing firm's telephone interviews) Used a lot in marketing research
Principles of Problem Framing
Tell, find and routinize
Clustering illusion
Tendency to see patterns in random events Gambler's fallacy
range
The difference between the largest and the smallest values in a set of data Simple to compute Ignores all data points except the two extremes
Analytics
is the process of developing actionable decisions or recommendations for actions based on insights generated from historical data
e confidence level
is the proportion of samples that will yield a confidence interval that actually contains the population mean.
Experimental units
The things on which the experiment is done. ex: students
Sample Study
This is mostly done if you want to estimate the parameters of the population Inferential statistics enables us to determine such parameters You make sure the sample is representative of the population before analyzing it.
Overconfidence
Too confident about your ability, especially when you are considered an expert in your field. Example: A person who is convinced he is going to get into Harvard and who only applies to Harvard. In this case, the overconfidence of the person could result in him not getting into any schools if Harvard rejects him.
Data Transformation
Transform / clean data into proper format or structure for the purpose of querying & analysis
Data Association
Two variables have a strong statistical relationship with one another if they appear to move together. When two variables appear to be related, you might suspect a causeand-effect relationship. Sometimes, however, statistical relationships exist even though a change in one variable is not caused by a change in the other
Typical sampling mistake
Unrepresentative sample Biased respondents Low response rate (non-response bias) or lower sample size Biased questions
Prescriptive Data Modeling Application
Used in producing credit score which helps financial institutions decide the probability a customer paying credit bills on time Asset management in utility companies Optimized operating conditions to maximize productions and minimize risks Better utilize: capital, personnel, equipment, vehicles and facilities
Social Desirability
Want to study what factors lead to academic dishonesty Who will be participating?
Decision Making Biases
We tend to believe or seek out information to preserve our own opinions or beliefs This can cause a gap in how we reason and how we should reason This causes us to make bad decisions Remember we make better decisions using critical thinking and being bit analytical.
Planning of an experiments
You have to decide: What measurement to make (the response) What condition to study (the treatment) What experimental materials to use (the units)
Process of A/B Testing
You take a webpage or app screen and modify it to create a second version of the same page. The change you want to see should be controlled to a single change (for example the placement of the "sign in" tab from left to right. Use a script to randomly show half of your visitors the original version of the page (known as the control) and the other half are exposed to modified version of the page (the variation).
Sample
a portion of the whole/population a subset of the population; must be large enough to represent the whole
Measurement units
actual objects on which the response is measured
Variance and Standard Deviation
average of the squared deviations from the arithmetic mean. Standard Deviation= square root of the variance
The framing effect
is an example of cognitive bias, in which people react to a particular choice in different ways depending on how it is presented; e.g. as a loss or as a gain.
3 principles of describing data
center, spread, shape
factors
combine to form treatment. Individual setting for each factor are called levels of the factor
Measures of Shape
describe the skewness of a set of data
Treatments
different procedures we want to compare
Census
gathering data from the entire population
Covariance
is a measure of the linear association between two variables, X and Y. Like the variance, different formulas are used for populations and samples. Population covariance
A/B testing
is a method of comparing two versions of a webpage or app against each other to determine which one performs better. • Two or more variants of a page are shown to users at random, and statistical analysis is used to determine which variation performs better for a given conversion goal.
placebo
is a null treatment that is used when the act of applying s treatment any treatment has an effect
Random sampling
is a part of the sampling technique in which each sample has an equal probability of being chosen A sample chosen randomly is meant to be an unbiased representation of the total population. An unbiased random sample is important for drawing conclusions about the population
confidence interval
is a range of values (based on the sample mean, the sample size, and either the sample or the population standard deviation) that is likely to contain the population mean
spurious correlation
is a relationship between two variables that appear to have interdependence or association with each other but actually do not
Experimental Design
n is the process of planning a study to meet specified objectives. Planning an experiment properly is very important in order to ensure that the right type of data and a sufficient sample size and power are available to answer the research questions of interest as clearly and efficiently as possible.
Blinding
occurs when the evaluator of response do not know which treatment was given to which unit
experimental error
random variation present in all experimental results
Inferential statistics
s (study sample data) Estimate uncertainty (using probability) some member of the data to infer about population data. don't have access to the entire population so you randomly select a sample
Z-statistics
s is good for larger sample (n>30) with underlying distribution of the population may or may not be normal
Statistics
s is the science concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data to assist in making effective decision
low kurtosis
s tend to have a flat top near the mean rather than a sharp peak.
Descriptive statistics
study data with entirety) Three principles of describing data Center, Spread and Shape
h high kurtosis
tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails.
Population
the whole a collection of persons, objects, or items under study
Inferential statistics
use a random sample of data taken from a population to describe and make inferences about the population.
Randomization
use of a known, understood probabilistic mechanism for the assignment of treatments to units
Efficiency
using an experiment learn the most from the experiment
Measures of variability
y describe the spread or the dispersion of a set of data. Common Measures of Variability: Range Interquartile Range Mean Absolute Deviation Variance Standard Deviation