BSTATS Final

¡Supera tus tareas y exámenes ahora con Quizwiz!

Probability of Type II Error Insights

α is the probability of Type I Error. Increasing α decreases the probability of a Type II Error. To decrease β, increase α. To decrease α, increase β. The only way to decrease BOTH α , β → increase n (sample size). (Note, we want both of these values to be decreased, since they are the probability of our two error types.) Small deviations between μ0 and μ1 → β increases. Large deviations between μ0 and μ1 → β decreases. Type I Error = α = probability of incorrectly rejecting the null hypothesis Power = 1 - β = probability of correctly rejecting the null hypothesis

Prediction Interval

• Interval of plausible values for y at x=x0 • Tries to capture a range Always wider than the confidence interval because there is an extra term under the square root

Contingency Table

•A Contingency Table is a cross-tabulation of frequencies into rows and columns, where the intersection of the rows and columns are cells that display frequencies of the joint events associated with each cell's row and column. (Note, occasionally you will see contingency tables in units of probability rather than regular units.)

Probability Distribution

•A Probability Distribution is a function or rule that assigns a probability to each value of a random variable •If you are keeping track, we have two functions (or rules) going on ... •Random Variable •Probability Distribution

P-value Summary

A p-value is the probability, if the test statistic really were distributed according to the stated null, of observing a test statistic (as extreme as, or more extreme than) the one actually observed.

Designing Engineering Experiments

ANOVA is often considered a key element of designed experiments. Every experiment involves a sequence of activities: 1.Conjecture - the original hypothesis that motivates the experiment. 2.Experiment - the test performed to investigate the conjecture. 3.Analysis - the statistical analysis of the data from the experiment. 4.Conclusion - what has been learned about the original conjecture from the experiment. Often the experiment will lead to a revised conjecture, and a new experiment, and so forth.

Training Data Metrics

Additional Criteria (Performance Metrics) Include: •Mallows's Cp •Akaike Information Criterion (AIC) •Bayesian Information Criterion (BIC) Each of these criteria are to be minimized. Knowing the formulas for calculating these criteria is not important, just realize that these are good to compare models side-by-side, software (not Excel, but R or SAS) will be able to do it for us. For linear regression Mallows's Cp and AIC are equivalent.

Step 5 in testing of hypothesis

Collect sample observations. Compute the value of the test statistic (TS) by setting μ = μ0 = μpopulation and ACCEPT H0 if the TS is in the acceptance region, or REJECT H0 if the TS is in the rejection region.

Step 2 in Testing of Hypothesis

Determine the TEST STATISTIC to be used and identify its probability distribution A TEST STATISTIC is a statistic obtained from the sample that can be used to test the parameter about which statements are made in the hypotheses. This statistic is the estimator of the parameter being tested or a function of that estimator.We know from earlier developments that the estimator of the population mean, μ, is the sample mean,.

Naming Missing Values

Example: •Original variables: •Y : Target variable •X : Input variable (has missing values) • •Imputed variables •Y : Target variable •X : Input variable (has missing values) •IMP_X: New input variable with all missing values "fixed" •M_X: Flag variable •M_X = 0 means variable is original •M_X = 1 means variable imputed

Data Preparation Outliers

Extreme data points that are significantly larger or smaller than remaining data points are called outliers. • Outliers present some issues with modeling: •Can exert significant influence on model parameters •Model may be less accurate •Model may give a different interpretation or understanding than what actually exists

Getting Started with Data Analysis

•Motivation—We have a problem and we think that some sort of data analysis is appropriate •What is the problem?—it depends on your point of view and your goals! •Sometimes easy; sometimes not obvious • •We develop a set of goals for analysis •What do we hope to prove or know when done? •Who do we aim to persuade and how?

Use method of Least squares to Obtain Regression Equation

•Obtain b0 and b1 (the estimates of ß0 and ß1) that minimizes the square of all the error terms -> SSE (error sum of squares) Find b0 and b1 that minimizes the equation

Useful Method for categorical Transforms

Flag variables: •Many times a category needs to be converted into a series of flag variables for analytic purposes. •Things to consider: •If desired, every category may have its own flag variable. •If desired, some categories may be ignored (no flag). •If desired, several categories can be combined. Example: •"Marital status of" (Married, Single, Divorced, Widow, Unknown) might be grouped as: •Married_YES = Marital_Status in ("Married"); •Married_NO = Marital_Status in ("Single"); •Married_OTHER = Marital_Status in ("Divorced","Widow"); •In the above example, Married and Single were kept as independent categories; Divorced and Widow were combined, and Unknown was ignored

Data Preparation

For data preparation, we will review how to handle the following: •Missing values •Outliers •Transforming variables • Note: "Data preparation" is also referred to as "data cleaning," "data wrangling," and so on. This course uses the phrase "data preparation" since the process is typically a precursor to building a predictive model

Model Fit

For multiple linear regression, we can use RSE and R2, just as we did for simple linear regression. For the Advertising data you will find R2 for all three advertising media in the model to be 0.8972; however, if you only included TV and radio, then you would get an R2 of 0.89719. That is, there is a small increase in R2 if we include a variable in the model that is not significant. Thus, there is an additional metric for multiple linear regression known as Adjusted R2 R2 will always increase as variables are included in the model. Adjusted R2 is a metric that penalizes the addition of a variable that does not contribute to model fit. So, in theory, maximizing Adjusted R2 will allow you to include the correct variables without adding noise variables.

Uses of Regression

Generally, regression is used to: 1.Describe the form of the relationships involved 2.Provide an equation (or rule) for predicting the response variable from the predictor variables 3.Identify which predictor variables are related to the response variable 4.Assess the strength of the relationship between each of the predictor variables individually with the response variable 5.Assess the accuracy of the combined effect of these predictor variables on the response variable 6.Identify interaction effects among the predictor variables on the response variable

Percentiles and Quartiles

•A percentile provides information about how the data are spread over the interval from the smallest to the largest value •The pth percentile of a dataset is the value below which p% of the observations fall •This measure is often used to establish a "threshold of acceptance" •In Excel, "=PERCENTILE.INC(Range of Data, Desired p-level)" •Earlier versions, "=PERCENTILE(Range of Data, Desired p-level)" • •A quartile is one of three points that divide an ordered set of data into four equal groups, each representing a fourth of the data •First quartile (Q1, 25th percentile, cut off of lowest 25%) •Second quartile (Q2, 50th percentile, median, etc.) •Third quartile (Q3, 75th percentile, cut off of highest 25%)

Categorical Variables

If a qualitative predictor (also known as a factor) only has levels, then you need to create dummy variables that can take on either 0 or 1. For example, if we have a height factor, we would introduce a dummy variable: x = 1 if person is 72 inches or taller; 0 if person is less than 72 inches The model would be: Y = β0 + β1X + ε Where the interpretation would be: β0 is the outcome if person is less than 72 inches β0 + β1 is the outcome if person is 72 inches or taller β1 is the difference in outcome between the two groups \ If the categorical variable has more than two levels, then you will need more than multiple dummy variables. For example, for a height variable with 3 values, we would need 2 dummy variables. x1 = 1 if person is 72 inches or taller; 0 if person is not x2 = 1 if person is shorter than 66 inches; 0 if not The model would be: Y = β0 + β1X1 + β2X2 + ε Where the interpretation would be: β0 is the outcome for people between 66 and 72 inches β0 + β1 is the outcome for 72 inches or taller β0 + β2 is the outcome for shorter than 66 inches

ANOVA: Analysis of Variance

If you are hoping to reject H0 and conclude there is a linear relationship, then: •SSE: want to be small. •SSR: want to be large. •Basically, we want the variability in y to be explained by x. •In other words, we want the variability explained by the regression model (not the error). •Thus we want SSR to be large and SSE to be small.

Transforming Variables

In a variable's initial ("raw") state, it might not be suitable for use in a predictive model. When this occurs, a variable might need to be transformed. • The two most likely situations where numeric data needs to be transformed are: •Missing data •Outliers • The two most likely situations where categorical or "class" data needs to be transformed are: •Missing data •Converting categories to flag variables • Missing data transformation was covered separately.

Multiple Linear Regression

In practice, we often have more than one predictor. For example, for Advertising data we have the amount of money spent advertising on radio, TV, and in newspapers. We denote multiple linear regression as follows: Y = β0 + β1X1 + β2X2+ ... + βpXp + ε We interpret a single βj as the average effect on Y of a one unit increase in Xj, holding all other predictors fixed. For our advertising example, then we can regress sales by fitting the model: sales ≈ β0 + β1(TV) + β2(radio) + β3(newspaper)

Handling Outliers

Increase size of data set or number of variables used: •Not always possible to get more data •Not always possible to add more variables •May not have predictive power •May be highly correlated with other variables and cannot be used •Adding complexity to a model adds problems •Difficult to interpret •Difficult to put into production • Remove outlier data prior to building model: •Effective at eliminating the effects of the outliers •May have adverse effects on the accuracy of model, assumes outlier is wrong (without an investigation) •In some heavily regulated industries (credit, insurance), doing so might not be legal Build multiple models. For instance, build a model for the normal data and a model for the outliers. • Example: •One predictive model for "super rich" (Bill Gates, Warren Buffet, etc.) •One predictive model for income < $1 Million/year Benefits: •May improve accuracy •Different models might uncover different behavior between groups Potential problems: •Is there enough data to build a separate model for outliers? •Some industries are heavily regulated, might not be legal •Building and testing multiple models might be expensive •No guarantee that multiple models will be better Use a different modeling technique. Decision trees are immune from outliers: •Output from trees are not granular enough, may not be useful •May not be legal to use a tree in some industries • Random forests and gradient boosting: •Based on decision trees, so immune to outliers •May not be legal to use in some industries •Highly granular output makes them desirable, but almost impossible to interpret results (similar to neural network in its "black box" qualities) Transform variables: •Effective for reducing magnitude of outliers •Altering a variable may introduce more error than outlier •Many types of transformations, including: •Truncating values at a certain value (e.g., above 99th percentile, then impute to 99th percentile; below 1st percentile, then impute to 1st percentile) •Log transforms •Standardization •Normalization •Binning data •Combining several techniques (e.g., "log followed by binning") • The only way to know which way is best is through trial and error!

Predictions

Multiple linear regression will yield a prediction equation: Y = β0 + β1X1 + β2X2+ ... + βpXp + ε For our advertising example: sales ≈ 2.939 + 0.046(TV) + 0.189(radio) - 0.001(newspaper) Note, we can also build prediction intervals given particular predictor values (similarly you can built confidence intervals on the mean value as well).

Def of Statistic Summary

•A statistic is a summary measure of sample data used to describe a characteristic of a population or to draw inferences about the population. •100 owners of a certain car reported a total of 85 problems in the first 90 days of ownership. The statistic "85" describes the number of problems per 100 cars during the first 90 days of ownership, and might suggest that the entire population of owners of these cars experience an average of 0.85 problems per car.

Guidlenes for Setting up H1

One-sided Alternative:i. If you believe the parameter is larger than the value stated in H0, set up H1 using >.ii. If you believe the parameter is smaller than the value stated in H0, set up H1 using <.Two-sided Alternative:If you believe the value has changed from the stated H0 but are not sure if it is larger or smaller, set up H1 using ≠.Compound Inequality:If the claim involves a compound inequality (i.e., ≥, ≤) and the H1 is stated in the opposite direction (i.e., <, >).Example: Mean is "at most 25"H0: μ = 25 (μ ≤ 25 is implied)H1: μ > 25

Probability of Type II Error (B)

Recall, the probability of Type II Error was defined as:β = P[accept H0 | H0 is false] = Type II Error Consider the following:H0: μ = μ0 H1: μ > μ0 Acceptance Region:TS ≤ Z(1-α) or t(1- α, n-1) Rejection Region:TS > Z(1-α) or t(1- α, n-1) Assume T.S. is: H1 states that μ > μ0. If in fact H1 is true, then the mean μ is actually some specific value (let us call this value μ1, which is given to us) and which has to be greater than μ0 (that is, μ1 > μ0).

Introduction to Regression

Regression refers to a broad set of methodologies used to predict a response variable from one or more predictor variables. Linear regression means that the relationship between the response variable and the predictor variables is linear. Other forms of regression include (but are not limited to): polynomial, multivariate, logistic, Poisson, time-series, nonlinear, nonparametric, and robust.

Fixed vs Random - Effects Models

•Often the factor of interest has a large number of possible levels. •The analyst ( you) is interested in drawing conclusions about the whole population of factor levels. •If you randomly select a of these levels from the entire population we call this a random factor. •Since the levels were chosen randomly, our conclusion will be valid for the entire population of factor levels. •The possible factor levels must be large or "infinite" •Note: This is different from the four levels of hardwood content

Confidence Interval for u, sigma known

•Only valid if parent distribution follows a normal distribution and by CLT (i.e., n ≥ 30). •The objective is to construct an interval (with a lower limit and an upper limit) which contains the population mean (μ) with some specified probability (= 1-α). That is, we want to find numbers L and U such that: P[L < μ < U] = 1-α •L: the Lower Confidence Limit •U: the Upper Confidence Limit

Note about R squared

•R2 is not a measure of correlation (it is the square of correlation, but that is not relevant to our class). It is the percent (or proportion) of variability in the response variable explained by the model. • •The key takeaways from the regression output are whether or not something is significance. Such as the significance F value and p-values (is it less than 0.01 or 0.05 - in order to be significant; if higher, then not significant), and the values of the coefficients (e.g., slopes). • •Do not confuse "correlation" with "significant."

Useful Functions for Transforming Variables

•Absolute value (i.e., the positive value) [ABS(X)] •Returns the positive value •Useful when a mathematical operation requires positive values •Sign value (i.e., determines if a value is positive, zero, or negative) [SIGN(X)] •Returns a 1, 0, or -1 (for positive, zero, and negative) •Useful to return an absolute value back to its original form •Logarithm (i.e., what exponent will return this value) [LN(X)] •Typically, Logarithm Base 10 or Logarithm Base e •Logarithms require X > 0 •Logarithms are generally used when data sets have outlier values that need to be constrained

Confidence Interval for E(y) 2 x=xsub0

•Estimates the mean of y at x=x0 • • • • • • • • •Again, estimate σ2 using s2

Inferences Made about Parameters

•First, we will look at µ, and then σ2. Remember, the population mean and variance are parameters. •Two types of inferences can be made about parameters: •Estimation •Point Estimation: uses a single value from the sample to estimate the parameter. •Interval Estimation: constructs an interval, based on the sample for the parameter. •Testing of Hypotheses

Step 3 in testing of hypothesis

Specify the PROBABILITY OF TYPE I ERROR Two types of errors are possible when testing hypotheses, called Type I error and Type II error. In Testing of Hypotheses, the decision maker specifies the Probability of Type I error (α). The most commonly used values are 0.1, 0.05, 0.01, 0.005, and 0.001. The probability of Type I Error is also called the level of significance. α = P[reject H0 | H0 is true] = Type I Errorβ = P[accept H0 | H0 is false] = Type II Error

Step 1 in Testing of Hypothesis

Step 1: Determine the NULL and ALTERNATIVE Hypotheses A hypothesis is a statement concerning the parameter of a population. There are two types of hypotheses in statistics: Null Hypothesis (Denoted by Ho) Alternative Hypothesis (Denoted by H1) Each of these hypothesis is either TRUE or FALSE, and we must decide whether to "accept" a hypothesis as true or false. If we "accept" Ho as TRUE, then H1 is FALSE and vice versa. "Accept" is often stated as "fail to reject" Objective: decide, based on sample information, which of the two hypotheses is correct (or true). The problem will be formulated so that one of the claims is initially favored (H0). This initially favored claim will not be rejected (or considered false) in favor of the alternative claim unless sample evidence contradicts it and provides strong support for the alternative assertion.

ASsumptions Revisited and Other Issues

The error term, ε, includes four key assumptions: 1.Mean of 0 2.Normally distributed 3.Constant variance 4.Independent (uncorrelated) Some other issues: •Nonlinear relationship between Xs and Y •Collinearity of predictor variables •Outliers (unusual value for Y) •High-leverage points (unusual value for X)

Estimating the Sample Mean

The sample mean, , obtained from a sample of size n: •Follows a normal distribution if the population known to follow a normal distribution (any sample size). •Follows a normal distribution if the sample size is large (i.e., n ≥ 30), by the CLT. True for any population distribution. •Mean of (mean of the population) • •Variance of(variance of the population divided by the sample size) • •Description: •Transformation:

Joint Probability

a probability based on a specific cell (intersection) within the table that satisfies both conditions. Example:What is the probability that a crash involves a male driver under 21?

Conditional Probability

a probability based on a specific cell (intersection) within the table that satisfies both probabilities, and is then restricted to the "given" column or row. Example:What is the probability that a crash involves a male driver, given that the driver is under 21?

Incomplete Block Design

the number of experimental units available in a block is less than the number of treatments.

Confidence Interval for B0 (y intercept)

ß0 = y-intercept = significant ONLY if scope of your model includes x = 0 A 100(1-α)% confidence interval for ß0:

Error Terms

1. Error terms are normally distributed ε~N 2. Error terms have a mean zero ε~N (0, ) 3. Error terms have constancy of variance ε~N (0, σ2 ) 4. Error terms are not correlated

Five Most Common Ways to Deal With Missing Values

1.Delete records with missing values. ● 2.Avoid using a variable that has a missing value. ● 3.Use a business rule to fix a missing value. ● 4.Fill in the missing value with a central value (e.g., mean, median, mode). ● 5.Use a decision tree (or similar tool) to build a model to impute the missing value.

Importnat Features of the Model

1.Yi is the sum of two components: §A constant term, (ß0+ ß1X) and §A random term, εi èThus, Yi's are random variables 2.The εi's are assumed to be normally distributed with a mean of zero: §ε~N(0, ) 3.The εi's are assumed to have a constant variance σ2 §σe2=σ2 §ε~N(0, σ2) 4. The εi's assumed to be independent (uncorrelated)

Discrete or Continuous

A random variable is either discrete of continuous, depending on the nature of its sample space. •A Random Variable is Discrete if it assumes a countable number of distinct values •Daily demand for books at a bookstore •Sample Space is all non-negative integers •We'll focus on discrete random variables in this lesson (and the next) •A Random Variable is Continuous if it assumes an uncountable number of distinct values •Length of calls in minutes at a call center •Sample Space is all non-negative real numbers •We'll focus on continuous random variables in later lessons •Note that "countable" doesn't mean "finite". •A set is countable if its items can be placed in a one-to-one relationship with the positive integers (which we know are infinite).

Two Types of Statistics

Descriptive Statistics helps us make statements like: - "The sample mean resistance is 981.5 ohms." Using Inferential Statistics, we can make statements like: - "The population mean resistance lies between 971.5 and 991.5 ohms with probability 0.95." (process control) - "The resistance of the next resistor selected at random from the process will have a value between 970 and 990 with probability 0.90." (used in measuring quality)

Step 4 in testing of hypothesis

Determine the ACCEPTANCE REGION and the REJECTION REGION

Sampling Bias

Even if you have a sample of adequate size, randomly selected, and framed (to the population) correctly, you can still have bias: • •Response bias (e.g., some people don't do surveys) • •Push polls (e.g., commentary in the questions/introduction that would influence answer) • •Others (e.g., using hospital patients; since they were there for some reason already ...)

Normal Equations

Found by taking the partial derivatives for each variable and setting equal to zero: X X b X y T T b X X X y T T 1 This is how we find the coefficients for our prediction equation. 1 X X T is known as the variance-covariance matrix

Bayes Theorem

Given two events a and b When A is known P(B|A)=P(A|B)P(B)/P(A) When A is unknown P(B|A)=P(A|B)P(B)/(P(A|B)P(B)+P(A|B′)P(B′) )

Matrix Notation

Introduce a dummy variable so that the y-intercept has an x variable: 1 x0i . Rewrite model as: i i i i k k i y x x x ... x 0 0 1 1 2 2 Y = Xβ + ε

What Affects Outlier Impact

Large data sets reduce the effects of outliers: •As the size of the data set increases, the effect of outliers is reduced. •As more variables are used in a model, the effect of the outlier is reduced. • Quantity and magnitude of outliers: •As the number of outliers increases, the effect of outliers is increased. •As the magnitude of the outliers increase, the effect of the outlier increases.

Hypothesis Testing

Many times, we are not directly concerned with the actual value of the parameter (i.e., population mean, population variance). Instead, we want to know whether its value exceeds a given number, is less than a given number, or falls into a given interval. We want to decide whether a statement concerning the parameter is true or false; that is, we want to test a hypothesis about the parameter. Example: In USA TODAY (10 - 12 - 1995) it was reported that the overall mean score for the SAT test in the U.S. was 910. Suppose that in a school district, the average SAT score of 15 students in that year was 930 (i.e., = 930, n = 15). Can the superintendent of that school district conclude that the students in his/her school scored higher than the national average? Basically, is μ > 910, where μ is the mean SAT score of this particular school district? - At first glance, inclined to say "yes".- We know varies around μ.- Is the higher score due to sampling variability?

Residual Analysis

Need to verify that error is independent and identically distributed (normally). Recall: εi ~ N(0, σ2).We could do Hypothesis Tests to confirm (i.e., on the variance and the residuals); however, we will focus more on "graphical interpretation" to determine if the assumptions are correct.Assumptions:1. The distribution of error is Normal.2. Identical (i.e., variance is constant throughout).3. Independent.We will use:- a Residuals versus Fits plot (want randomness about zero), - a histogram (want center at 0 and bell curve)- a normality plot (want points all covered by a "pen")We can also look at the correlation (of the residuals) to test independence, or residuals versus order of data if data was entered (into the graph) in the same order as it was taken.

Mutually Exclusive (Disjoint)

No element of A is contained in B and no element of B is contained in A.

Effects of Outliers

Outliers can significantly impact a predictive model. For example, in a REGRESSION equation, an outlier can cause a large difference in the coefficient or "beta" value. • Do outliers always affect a predictive model? •Not necessarily. If the outlier exhibits the same relationship to the target variable as the rest of the data, then it will not have a perceived effect. ... But the outlier may still be influencing the model

Distribution Functions

PDF is short for Probability Distribution Function Example (Rolling a single fair die): §P(X = 1) = 1/6 §P(X = 2) = 1/6 §P(X = 6) = 1/6 CDF is short for Cumulative Probability Distribution Function Example (Rolling a single fair die): §P(X ≤ 1) = 1/6 §P(X ≤ 2) = 2/6 = 1/3 §P(X ≤ 6) = 6/6 = 1 Note that PDFs and CDFs contain the same information, that is, you can calculate one from the other.

Missing Values: Imputation Methods

Quick approaches: •Numerical variable: •Impute missing values with the mean or median (or mode) • •Categorical variable: •Create an additional category (e.g., Male, Female, and Unknown) • A longer, more detailed approach is to use a decision tree or similar tools.

Type 1 Error

Rejecting null hypothesis when it is true false positive

Interactions

The linear models that we have evaluated thus far have assumed that the independent variables do not have any interaction or relationship with each other. However, that is not always the case. For example, in our Advertising example suppose that spending money on radio advertising actually increased the effectiveness of TV advertising, so that the slope term of the TV should increase as radio expenditures increases. This is known as the synergy effect in marketing. In machine learning and statistics it is known as interaction. sales ≈ b0 + b1(TV) + b2(radio) + b3(newspaper) + b4(TV*radio) {Interaction term is where you multiply the two variables together to create an interaction variable.}

Intersection

all elements contained in Both A and B

Set Definition

collection of objects - items in my refrigerator

Type 2 error

false negative fail to reject null hypothesis when it is false

Treatments

levels are often called treatments

For a random effects model, the appropraite hypotheses to test are

that the individual treatment effects are meaningless:

Method of Least Squares

used to estimate coefficients, just as in the simple linear regression case: Min SSE = 2 i e . The only difference is that matrices will be used.

each term on the right hand side is called a

variance component

Mathematical Model: predicing y given x

ysubi = b sub0 +Bsub1Xsubi +esub i

Multiple Linear Regression Excel Comments

•Independent Variables (X variables) need to be in contiguous columns (or rows) within Excel in order to use the Regression feature in the Data Analysis ToolPak. • •Excel can only do 16 predictor variables simultaneously

Complete Block Design

•each block consists of one complete replication of the set of treatments.

Regression Output

•R2: coefficient of determination -Percent of variability in y (dependent variable) explained by x's (independent variables). -Percent of variability in y explained by the regression model. -Always between 0 and 1. -Note: Adjusted R2 has similar interpretation, used when having multiple predictor (x) variables, Adjusted R2 will always be LESS than R2. •Significance F (p-value for your MODEL): -Provided with ANOVA table in Excel. -Lower value indicates better model. -Usually want below 0.10 (90% confident) or 0.05 (95% confident). -Always between 0 and 1. •Coefficients -Intercept (b0) -Slope(s) •(b1, b2, ...) as needed -Use these to build your model: •y = b0 + b1x1 + b2x2 + ... •P-values (associated with slope coefficients) -Lower value indicates better model. -Usually want below 0.10 (90% confident) or 0.05 (95% confident). -Always between 0 and 1.

Central Limit Theorem

•Regardless of the probability distribution of the underlying population, the probability distribution of the sample mean, , is approximately normal, if the sample size n is sufficiently large. •When the population is normally distributed (that is, the parent population is normal), then the distribution of the sample mean, , is always normally distributed, for any sample size. •When the population is NOT normally distributed (that is, the parent population is NOT normal), then a large sample size (usually n ≥ 30) is required in order for the sample mean, , to be normally distributed. • •Caution, just because you can use the normal distribution to examine the sample mean, does not mean that you can or should use the normal distribution to examine the distribution (pdf, cdf) of a Random Variable.

Regression Analysis

•Regression Analysis is a statistical tool that examines the relationship between two or more variables so that one may be predicted from the other(s). •If the relationship between the variables is linear, then it is called Linear Regression. •If, in addition to being linear, there are only 2 variables involved (one predictor variable and one variable being predicted), it is Simple Linear Regression.

Hypothesis Testing in SLR

•SLR → analyzing the relationship between two variables •Two parameters: β0 (intercept) and β1 (slope). •Can test on the significance of either. However, the "significance" of an intercept is not as interesting as the "significance" of a slope. •If a slope that is unequal to zero exists, then there is a linear relationship between the two variables. •Hypothesis Testing of β1 (slope): -H0: β1 = 0(slope equal to zero, thus no linear relationship) -H1: β1 ≠ 0 (slope unequal to zero, there is a linear relationship) •Use t - statistic -Denominator: standard deviation of b1 -Degrees of freedom: n-2

Useful Methods for transforming numeric transforms

•Trimming the data •When a variable exceeds a certain limit, it is simply truncated so that it cannot exceed the limit •Example: Limit X to a range of -5 to +10 •Typically done with IF/ELSE type statements • •Standardizing the data •For example, a normal transform (Z transform) •Subtract the mean and divide by the standard deviation; this usually results in a value +/- 3 (unless there are some serious outliers) • •Log transform • •Binning (quantile and buckets) •Trimming the data •Standardizing the data • •Log transform •Logarithms require X > 0, so this transform will do the following: •Take the absolute value of X and add 1 •Perform log function (Log10, LN, or any log of any other base) •Multiply by the original sign of "X" • •Binning (quantile and buckets) •Trimming the data •Standardizing the data •Log transform •Binning (quantile and buckets) •Binning is the process of placing the X values into a predefined number of groups. The grouping can be done in any way desired. Some common approaches include: •BUCKETS: The bins are equally spaced, but membership in each bin might not be equal. •QUANTILES: Grouping is done so that there is nearly equal membership in each bin. •AD HOC: Grouping is done based on a business rule or some other reasoning.

Analysis of a Problem

•Understand and agree upon problem definition •Understand problem context and measurement/metrics •Design of tests and experiments •Data collection effort •Preparation of data •Consideration of the robustness and range of the model •Sensitivity of the model to changes •Documentation of analytical results and preliminary results •Return to earlier stages if necessary to refine •Derive final insight

Set Relationships: Four Operations

•Union • Intersection • Compliment • Mutually Exclusive (also known as Disjoint) These can be illustrated graphically via Venn Diagrams, where we let the sample space be rectangle, and represent the events by circles inside the rectangle.

Estimators

•Values computed from the sample data to estimate the parameters of the population or provide point estimates of the parameters. •The parameters (µ and σ2) are constants, whereas their estimators ( and s2) are random variables (because they vary from sample to sample). • •This means that and s2 have their own probability distributions (pdf's), means, and variances. (Because each RV has its own pdf, mean, and variance.) • •µ is estimated by • •σ2 is estimated by • •In order to evaluate the "effectiveness" of these estimators, we need to know their probability distributions, means, and variances.

Permutation

•We ARE interested in the order •Want the number of ways in which three people can be placed into three positions (ABC, ACB, BAC, BCA, CAB, CBA; are not the same)

Combination

•We are not interested in the order •Want the number of ways in which we can select three representatives from our class (i.e., ABC, CAB, etc. are all equivalent to us).

Intervals

•You can create Confidence Intervals and Prediction Intervals for both the slope and the intercept. • •Confidence intervals are for the true mean of the population based on a sample. This is useful for telling you the center of the distribution. • •Prediction intervals are likely values for the next observation (sample size = 1). This is useful for showing you how wide the distribution is. • •Thus, prediction intervals will be much wider since those include plausible observations, whereas confidence intervals are for the mean.

P-values

•p-values are interpreted the same way regardless of the test •If p-value < alpha; you Reject H0. •If p-value >= alpha; you Fail to Reject H0. • •In most situations you design a Hypothesis Test such that you'd like to Reject H0; thus, having a small p-value is your goal.

A randomized block design is a restricted

•randomization design in which the experimental units are first sorted into homogeneous groups, called blocks, and the treatments are then assigned at random within the blocks.

Confidence Interval for B1 (slope)

•ß1=slope = average increase (decrease) in y for a unit increase (decrease) in x •Can be negative •In general, a confidence interval looks like the following: Pt. estimate ± (constant)*(std. dev. of the pt. estimate)

Compliment

elements from the Universe not contained in A

Sampling Methods

•Simple Random Sample -Use items without any pattern to the selection process (i.e., randomly) from a list • •Systematic Sample -Select every kth item from a list or sequence (e.g., every 6th customer) • •Stratified Sample -Select randomly from within a pre-defined strata (i.e., within an age range, specific occupation, gender) • •Cluster Sample -Similar to stratified, but based on location (physical proximity) (e.g., using ZIP codes)

Simple Exercise in Data Scrubbing

•Sometimes (most of the time!) data is not in the format we want or need •Sometimes data is inappropriate or just wrong •Data Scrubbing and Data Manipulation is often required

Correlation

•The Correlation Coefficient (r) describes the degree of linearity between paired observations between two variables. •r varies between -1 and 1 •Near 0 means no linear relationship between the two variables •Near +1 means a strong positive relationship between the two variables •Near -1 means a strong negative relationship between the two variables •Excel, "=CORREL" function

Excel Formulas for Rules of Counting

Factorial (n!); Excel Formula is "=FACT(n)" Permutation; Excel Formula is "=PERMUT(n,x)" Combination; Excel Formula is "=COMBIN(n,x)" [Note, n is always larger (or equal to if choosing all values; which is a trivial case) than x. That is, you are choosing x objects from n. Also, n and x are always non-negative integers {0, 1, 2, 3, ...} (if your fancy calculator or computer calculates it for negative numbers and decimals, it is approximating it using the Gamma Function).]

Permutation >= Combination

For a given n and x; the permutation is always greater than (or equal to) the combination.Because the denominator of a combination has an extra term. It will be strictly greater than (>) in most cases.(It is equal only when x = 0 or 1; which are trivial cases because you are picking 0 or 1 out of n choices.)

Subsets

If A and B are sets, then A is a subset of B if all elements in A are also part of B.

Observations, Variables, and Data Sets

Observation: a single member of a collection of items that we want to study, such as a person, firm, or region Variable: a characteristic of the subject or individual, such as an employee's income or an invoice amount data set: consists of all the values of all of the variables for all of he observations we have chosen to observe

Model Types

1.Iconic Model (Scale): physical model scaled up or down (e.g., CAD drawing of building, model train) 2.Analog Model: one property is substituted for another (e.g., clock) 3.Symbolic Model (Math): system behavior is represented by mathematical symbols and functions [can be optimized] 4.Simulation Model: computer program that attempts to replace system behavior [Note, some would split #4 into simulation models and computer models.]

Reasons for Modeling

1.to describe a system to gain better understanding 2.to facilitate taking measurements on the real system 3.to predict future results 4.to aid in decision-making*** (key reason why this is important/core class)

Bernoulli Distribution

A random experiment with only two possible outcomes (that is, a random experiment with binary events) is called a Bernoulli random experiment To obtain an equivalent random variable, we (arbitrarily) label one of these "success" (say X=s) and the other "failure" (say X=f) The probability of success is denoted by p (so the probability of failure = 1 - p) It is common to choose "success" and "failure" such that "success" has the lower probability, but this isn't a hard-and-fast rule Examples: Success and failure depends on the situation Success could be reaching someone on the phone or passing an exam (with failures defined in obvious ways) Key feature is that the sample space (and therefore random variable) is binary [mutually exclusive] (e.g., Yes or No, True or False; 1 or 0).

Random Experiments

Connection to Random Experiments and Their Sample Spaces •Recall that we defined •a Random Experiment as an observational process whose results cannot be known in advance, and •a Sample Space as the set of all possible outcomes of a random experiment •Using these definitions, many phenomena of interest can be thought of as random experiments, even though the term "experiment" may not seem to fit: •Tracking defects at a work center •Sample Space is all non-negative integers •A random variable would be the number of defects per hour •Timing customer visits to a retail store •Sample Space is all non-negative numbers •A random variable might be the average visit time for females between the ages of 18 and 24 •All we're doing here is giving a name - Random Variable to the functions or rules that assign numbers to the events - the thing being counted or measured in these situations.

Types of Statistics

Descriptive statistics Characterize and summarize features of the data Mean, median, percentile, standard deviation Graphs, tables, charts Attribute data Numerical data Inferential statistics Drawing conclusions about a population based on information from the sample T-test, chi-square test (examples) in Hypothesis testing

Binomial Distribution Examples

Examples of a binomial distribution problems: On average, 20% of emergency room patients at a particular hospital are uninsured. In a random sample of 10 patients, What is the probability that 5 will be uninsured? What is the probability that none will be uninsured? What is the probability that all will be uninsured? An analysis of payment behavior in a hotel chain reveals that 32% of the customers pay with an American Express (AmX) credit card. Of the next 100 customers, What is the probability that 30 will pay with AmX? What is the probability that more than 60 will pay with AmX? What is the probability that 28 or fewer will pay with AmX? What's the underlying Bernoulli experiment in each case?

PDF's and CDF's

PDF's and CDF's •Since a continuous random variable can assume an infinite number of values, we cannot enumerate the PDF as we did in the discrete case. •Instead, we express the PDF of a continuous random variable as a formula, denoted f(x). •This formula must, of course, define a continuous function •Intuitively, the domain of a continuous function is any real number (can't have gaps) and if one approaches a point on the graph from either direction, one ends up at the same point (in other words, the values that the function assumes are "smooth" in the sense that the graph of the function doesn't "jump" up or down). •Example: The PDF of the continuous distribution of highway speeds: •The CDF of a continuous random variable is also a formula, not an enumeration. The CDF is denoted F(x) •The value of the CDF at a point, say x1, is the probability that the random variable is less than or equal to x1. •In words, the value of the cumulative distribution function at a point x1 is the area under the PDF to the left of x1

Hypergeometric Distribution

Recall that the binomial distribution involved analyzing the probability of events from a finite binary (success or failure) sample space with replacement for some number of independent trials (i.e., Bernoulli Trials). The hypergeometric distribution is similar to the binomial distribution except that the events are a sample chosen from a finite population of N items without replacement (i.e., not independent from trial to trial). The distribution is described by three parameters: §N = number of items in the population §n = sample size §s = number of successes in the population

Distribution Expected Value (Mean) (Discrete Distributions)

Recall the Population Mean: average of all the observations in the population and is denoted by the Greek letter µ. If the population size is finite and equal to N, then The population mean is called the EXPECTED VALUE, denoted by E(X), and can be calculated from the probability distribution as: The expected value indicates the central location (or center of gravity) of the probability distribution of the random variable X.

Distribution Variance (Discrete Distributions)

Recall the Population Variance: the 'average' of the squared deviations of all the observations in a population from the population mean and is denoted by the Greek letter σ2. The population variance or the VARIANCE of X is denoted V(X) and can be calculated from the probability distribution as: Or, more simply: V(X) = E(X2) - [E(X)]2 The variance measures the spread or the variability of the probability distribution of X. Note, the standard deviation is the square root of the variance.

Elephant in the Room

The answer is that we must follow a two step process: Step 1: Develop a random sample of the phenomenon in question (or use the entire population if feasible) For example, count customer arrivals, or compute customer purchase frequencies, or track product demand at a store That is ... collect the data. Step 2: Attempt to "fit" this empirical data to a known probability distribution There are statistical tests to determine the "goodness of fit" of particular distributions to given empirical data There are software packages that examine many candidate distributions and recommend particular distributions (and their parameters) given your empirical data That is ... create a model (probability distribution) and test how good it is and refine as needed.

Binomial and Hypergeometric Distributions

Similarities and Differences: Similarities §Samples of size n §X is number of successes in a sample Differences §The binomial situation replaces each sample back into the population (or in some cases, like flipping a coin n times, the population is infinite so replacement isn't an issue), while the hypergeometric situation requires that the sample not be replaced in the finite population Approximating the hypergeometric distribution with the binomial distribution §If replacement is not occurring but the size of the sample (n) relative to the size of the population (N) is very small (typically n/N<=0.05), then the probability of success on each hypergeometric draw is nearly constant §In this case, the binomial distribution with p = s/N is a good approximation to the hypergeometric distribution

Statistics Definition

Statistics definition: The science of •collecting, •organizing, •analyzing, •interpreting, and •presenting data for the purpose of gaining insight and making good decisions.

Binomial Distribution

The Binomial Distribution arises when a Bernoulli experiment is repeated n times, where each of these n trials is independent of the others. Since the trials are independent, the probability of success is constant across all trials, and is equal to the Bernoulli distribution's parameter p. For the binomial distribution, we will denote this probability p. In a Binomial experiment, we are interested in the number of successes, say X, in the n trials, so the Binomial Random Variable is the sum of n independent Bernoulli random variables, and the Binomial Probability Distribution is the probability distribution of this random variable. Recognizing Problems that require the binomial distribution • •All binomial distribution problems have five main elements: •There are a fixed number of "trials" n •There are only 2 outcomes for each trial (the Bernoulli requirement) •The probability of success for each trial p remains constant across all trials •The trials are independent - that is, the outcome of one trial has no bearing on the outcome of any other trial •The random variable X is the number of successes - the number of successes, or its converse (the number of failures) is what is being asked for

Poisson Distribution

The Poisson Distribution describes the number of occurrences of an event within a random interval of time (minute, day) or space (square foot, linear mile). Named for the French mathematician and physicist, Simeon D. Poisson, 1781-1840. To apply, the events must occur randomly and independently over a continuum of time or space. We will often refer to this continuum as "time" since the Poisson distribution is most commonly used to model arrivals per unit of time. When modeling arrivals, the independence assumption is crucial! In this case, independence means that the timing of each event's occurrence has no effect on the timing of other events. Power interruptions tend to occur in bunches - the probability of a second occurrence happening very soon is greater, for example, given that the first has just occurred - so the occurrences are NOT independent These examples are more likely to be independent (but would still need to be studied empirically to confirm this): Arrival of customers at an ATM Number of Airbus 330 engine shutdowns per 100,000 hours Arrival of asthma patients at a clinic Recognizing Problems that require the Poisson distribution All Poisson distribution problems have four main elements: An event of interest occurs randomly over time or space The average rate λ at which the event occurs is known and remains constant The timing of event occurrences are independent of each other The random variable of interest, X, is the number of events occurring within a specific time interval

Establishing Cause and Effect

Three questions we should ask and answer •Is there a real association between the variables in question, or is it just correlation? •Is there any important timing relationship between the variables and what is it—lead or lag or other? •Are there other explanations for relationships—coincidence, error, etc?

Sum of Two Random Variables

We frequently encounter problems that require that we know the probability distribution of the sum of two random variables. For example, a Distribution Center (DC) provides products to 2 retail stores. We are trying to determine how much inventory to hold at the DC to protect against spikes in demand at the stores. We know the probability distributions of demand at the stores individually, but to solve this problem we need to know the probability of demand at the DC, whose demand is the sum of the stores' demand. In cases such as this, we have two (or more) random variables X any Y, and The mean of the random variable X+Y, the sum of the two variables, is the sum of their individual means 𝜇𝑋+𝑌 = 𝜇𝑋 + 𝜇𝑌 The variance of the random variable X+Y... §IF X AND Y ARE INDEPENDENT, is 𝜎2𝑋+𝑌 = 𝜎2𝑋 + 𝜎2𝑌 §IF X AND Y ARE NOT INDEPENDENT, is 𝜎2𝑋+𝑌 = 𝜎2𝑋 + 𝜎2𝑌 + 2𝜎𝑋𝑌, where 𝜎𝑋𝑌 is the covariance of X and Y

When to use exponential distribution

When to use: - Time between failures (e.g., time between light bulb burnouts) - Service times - Survival times of diseases - Same λ from Poisson Distribution: If X (Poisson) is a RV denoting the # of occurrences per time interval or area, with average = λ, then Y (Expo), representing the time or space between occurrences, is exponential with λ and has average time/space between occurrences µλ = 1/ λ.

Marginal Probability

a probability based on either a row total or a column total Example:What is the probability that a crash involves a male driver?

Union

all outcomes of A or B that are in A, orB, or Both

Random experiment, sample space, event

•A Random Experiment is an observational process whose results cannot be known in advance •A Sample Space for a random experiment is the set of all possible outcomes (S) for the random experiment •A sample space with a countable number of outcomes is Discrete (e.g., number of people in line). Otherwise, it is Continuous (e.g., height of the people in line). -If the sample space is continuous then it cannot be listed but can be described by a rule •An Event is any subset of outcomes in the sample space -A Simple Event or Elementary Event is a single outcome. The term "event" can refer to a simple event or a collection of simple events.

Random Variable

•A Random Variable is a function or rule that assigns a numerical value to each outcome in the sample space of a random experiment. • •Example 1: •Random experiment: flipping a coin •One outcome: {head} •Random variable: {head} = 1, {tail} = 0 • •Example 2: •Random experiment: flipping a coin 5 times •One outcome: {head, tail, tail, tail, head} •Random variable: number of heads in 5 flips • •Usually arises from counting or measuring something that has an uncertain outcome.

Data Availability

•Am I data rich or poor? •Will I need observational or experimental data ? •If I collect data, I need to think about: •Sampling? •Experimental design? •Sources of data? •Is it squeaky clean or dirty? (that is, is it ready to use; or do you need to clean it up?)

Stratified Sample

•Appropriate when the population can be divided into relatively homogenous subgroups which are expected to exhibit different characteristics. The subgroups are called strata. • •Then use simple random sampling within each stratum. -Predetermined sample size for each stratum.

Certainty and Sampling

•As long as you are sampling you are not able to make any assertions with certainty •If we are estimating a population parameter like the mean, our point estimate is usually associated with a confidence interval •Avoid point estimates of population parameters—The average salary of a WM graduate 5 yrs out of school is $78,865.63

Data Classification

•By Type of Data •Cross-Sectional - measurements taken at one time period or where time dimension is unimportant •Time series - data collected over time and the time dimension is critical to understanding •Pooled - is a combination of both cross-sectional and time series (e.g., monthly unemployment rates for all 50 states for the last 5 years) •By Number of Variables •Univariate- data consisting of a single variable to measure some entity •Multivariate- data consisting of two or more variables to measure some entity

Qualitative-nominal (named) data

•Categorical—data that can be organized by placing them in categories. Place of birth, gender, etc. Best we can do is count the proportion of observations in a category. An average of the data collected is meaningless. •Ranked/Ordinal Scale—data that is categorized, but also can be ranked and scale can be any numerical representation that maintains the ranking: Excellent=4, Good=2, Poor= 0... is same as Excellent=10, Good=6, Poor= 2 •Proportions of observations in category are valid; if is ordinal data can apply computations based on ranking process •Nominal/Categorical •Differences in data are meaningless--if Chihuahua is assigned 3 and Rottweiler is 8, then 8-5 meaningless •Ordinal •More statistical techniques are applicable than Nominal/Categorical data, but not as many as Interval or Ratio data

Standardized Data

•Chebyschev's Theorem •For any dataset, however distributed, the proportion of observations that lie within k standard deviations (σ) of the mean (µ) must be at least: 1 - 1/k2 •Example: Let the population mean and standard deviation be: µ = 50 and σ = 5. •If k = 2, then at least, 1 - ¼ = 0.75 of the observations must lie within 2*5 = 10 units of 50. Thus, 75% of the observations should be between 40 and 60. •Note, for more well-known distributions (e.g., Normal Distribution), we have much stronger results we can use.

Cluster Sample

•Cluster sampling is a sampling technique used when "natural" but relatively heterogeneous groupings are evident in a population - usually based on physical proximity. •One-stage Cluster sampling: randomly choose which clusters to sample, then sample all elements within each of those clusters. •Two-stage Cluster sampling: randomly choose which clusters to sample, and then choose a random sample within each of those clusters. • •Often used in marketing research (i.e., where samples are interviewed). •A common motivation for cluster sampling is to reduce the total number of interviews and costs given a desired accuracy. -Note, there is going to be some loss of reliability by choosing this sampling technique. And this loss of reliability must be acceptable within the confines of the experiment/goal of the study. •Assuming a fixed sample size, the technique is more accurate when the variation within the population is within the clusters, not between the clusters.

Intersection: Think Multiplication

•Critical question: are the A and B events related to each other? •In other words....are A and B statistically independent or conditional? •If events are statistically independent, the occurrence (or non-occurrence) of one event has absolutely no effect or bearing on the occurrence of the other event: §P(AÇB)= P(A) x P(B) §P(A|B) = P(A) [Conditional] §Formulas only valid for independent events.

Summary: Thinking of Goals is Essential

•Data Analysis provides a disciplined approach to designing a study to answer a question •It will insure that we are headed toward the right goal(s) •It won't guarantee that there will not be other goals to appear later—learning is always on-going •Sets up the next phase and we will likely return to more data analysis

Data Classification by Measurement

•Data results from measurement of observed activity—an experiment, survey, etc. Also, data is a plural noun—datum is singular. •Data is either quantitative or qualitative; numerical or categorical •Assignment of a numerical value to categorical data does not make categorical data numerical—e.g. If you are from Texas you are assigned a 1. •Measurement is not always accurate/what you think it is.

Standard Normal Random Variable

•Defined by one parameter, z •Where, z=(x-μ)/σ • •CDF in Excel: = NORM.S.DIST(z, True) •Or = NORM.S.DIST(z, 1) • •Random Data in Excel: = NORM.S.INV(RAND())

Normal Random Variable

•Defined by two parameters §N(mean, standard deviation2) • •N(µ, σ2) • •PDF in Excel: = NORM.DIST(x, µ, σ, False) •CDF in Excel: = NORM.DIST(x, µ, σ, True) • •Random Data in Excel: = NORM.INV(RAND(), µ, σ) • •Note, if you want to go "backwards" (i.e., you know the cumulative probability as decimal, but need to find x), then use the following: = NORM.INV(Cumulative Probability, µ, σ)

Sample Size

•Depends on the variability of the population being measured and the required precision of the estimate. -High variability means larger sample size -More precision means larger sample size

Statistical Methodology

•Descriptive statistics - collection, organization, and description of data •Statistical inference - drawing conclusions about unknown characteristics of a population based on samples •Predictive statistics - inferring future values based on historical data

Deterministic vs. Probabilistic

•Deterministic models - assume away uncertainty and use point estimates •Probabilistic models - deal explicitly with uncertainty and say "I'm not afraid of you Mr./Ms. Uncertainty!"

Discrete vs Continuous

•Discrete or continuous •Attributes: discrete data obtained from counting and categorized into some classification scheme; can't be reduced any further •E.g., number of defects per unit of output, percentage of on-time flight arrivals, number of complaints per customer, percentage of "top box" responses in a satisfaction survey •Variables: continuous numerical data that can be measured on a continuum; can have just about any value and can be subdivided in to smaller increments (time, cost, revenue, volume, length, temperature, etc) •Delivery times, number of ounces in a bottle of soda, monthly revenues, dollars spent on maintenance, balance in a customer account, time spent on homework, delay times (or early) in ship arrivals

Inferential Statistics

•Drawing conclusions and making predictions about a population based on information obtained from a sample drawn from that population. • •Three main inferences about the population are made from the information obtained from the sample: •pdf of the population f(x) •mean of the population µ •variance of the population σ2

Box Plots

•Easy and visually informative plot •Also known as the box-and-whisker diagram •Based on a five number summary: •Min, Q1, Median, Q3, Max

Approaches to Probability

•Empirical (or Relative Frequency) -Probabilities are estimated from observed relative frequencies (sampling) -Law of Large Numbers shows us that as the sample size increases, any empirical probability approaches its theoretical limit •Similarly, as sample size increases there is an increasing probability that any event - even a rare event will occur (i.e., same lottery number, gambling streaks) •Law of Large Numbers is important because it tells us that random events will have stable long-term results (e.g., a single blackjack player may have a hot streak, but casino profits will approach a predictable level in the long term) •Classical -Probabilities known in advance by the nature of the experiment (e.g., 50% chance of a coin flip, 1 out of 6 for a fair die roll) •Subjective -Probabilities are based on informed opinion or judgement •A panel of experts estimated that there was a 45% chance that the U.K. would exit the E.U. •Commonly used in business situations when empirical approaches are not possible (or even more commonly, when the skills to collect and analyze the data are not available)

Normally Distributed Data

•For normally distributed data -68.26% of the observations fall within 1 standard deviation of the mean -95.44% fall within 2 standard deviations of the mean -99.73% fall within 3 standard deviations of the mean • • • • • • • • •Can standardize a normal distribution by expressing the number of standard deviations an observation is from the mean (i.e., a z-score) •±3σ an unusual observation •±4σ an outlier Too many unusual observations or outliers would indicate problems with sampling methods or not a normal distribution

Normal Distribution

•For normally distributed data •68.26% of the observations fall within 1 standard deviation of the mean •95.44% fall within 2 standard deviations of the mean •99.73% fall within 3 standard deviations of the mean • • • • • • • • •Can standardize a normal distribution by expressing the number of standard deviations an observation is from the mean (i.e., a z-score) •±3σ an unusual observation •±4σ an outlier •Too many unusual observations or outliers would indicate problems with sampling methods or not a normal distribution.

EXCEL: if

•IF is an Excel function •Used to perform an IF/THEN filter. •The syntax of the function is: =IF(logical_test,[value_if_true],[value_if_false]) •logical_test - a TRUE or FALSE test, usually on data contained in another cell •value_if_true - the output (can be another formula) if the logical test in the first argument is TRUE, if you leave this blank, then it will return "TRUE" •value_if_false - the output (can be another formula) if the logical test in the first argument is FALSE, if you leave this blank, then it will return "FALSE"

With or Without Replacement

•If we allow duplicates when sampling, then we are sampling with replacement. • •If we do not allow duplicates when sampling, then we are sampling without replacement. • •Sampling without replacement can introduce bias into the sample since eventually the items not yet sampled have greater probability of being chosen than the items already sampled. (This becomes a problem as the sample size approaches the population size. A rule of thumb is 5%; that is, if your sample size is going to be more than 5% of your population size then sampling without replacement can introduce bias.)

Stratified vs. Cluster

•In cluster sampling, a cluster is selected at random. Whereas in stratified sampling, members from each strata are selected at random. • •In stratified sampling, each strata is assumed to be composed of homogenous members. In cluster sampling, each cluster is assumed to be composed of heterogeneous members. • •Stratified sampling is slower (i.e., geographically dispersed sample), whereas cluster sampling is faster (i.e., geographically close sample). • •Stratified samples have less error due to factoring in the presence of each group within the population and adapting the methods to obtain a better estimation. Cluster sampling has inherent higher percentage of error.

Odds versus Probability

•In gambling, sports, and games of chance, we often hear the term "odds" rather than "probability" •The Odds in Favor of an event A is the ratio of the probability that an event A will occur to the probability that the event A will not occur. The Odds against are defined analogously... •Odds in Favor of A = P(A) / P(A') = P(A) / [1 - P(A)] •Odds Against A = P(A') / P(A) = [1 - P(A)]/ P(A) • Example: •The IRS tax audit rate for a particular income bracket is 1.41%. If A is the event that a taxpayer in this income bracket is audited, what are the odds against a taxpayer being audited? •P(A) = 0.0141 •Odds Against A = [1 - 0.0141]/0.0141 = 69.92 or ~70 to 1. • Note: In horse racing (and sports), odds are usually quoted against winning. •If the odds against an event are quoted at b to a. Then the implied probability of the event happening (winning) is P(A) = a / (a + b). •So if at a horse race, and the "favorite" horse is quoted at 6 to 5 odds. Then, P("Favorite" Horse Winning) = 5/(5+6) = 45%.

Continuous Random Variables

•In the discrete case, we calculated probabilities using the idea of "relative frequency" §This allowed us to compute the probability of a particular event by dividing the frequency with which it occurs (f) by the number of possible ways that something can occur (n), or f/n §P(rolling a 7) = f/n = 6/36 •Because a continuous random variable can assume an infinite number of values, however, n is infinite, so computing f/n would require that we divide by infinity, yielding 0 (or ∞/∞, in indeterminant form) in all cases. •For this reason, it isn't meaningful to assign a probability to any single value in the domain of a continuous probability distribution. §Instead, we can only calculate ranges of values §P(X > 5) or P(20 <= X <= 50), for example

Inferential vs Deductive

•Inferential - Top down approach that assumes modeler can use their understanding of variables to build model - can lead to "data poor" modeling (e.g. simulation and optimization) •Deductive - Bottom Up approach that focuses on data to build understanding of variables and parameters - can lead to "data rich" modeling (e.g., regression or database queries)

Simple Random Sample

•Simple Random Sample -Use items without any pattern to the selection process (i.e., randomly) from a list -Need to have a predetermined sample size

Quantitative

•Interval—a defined/meaningful and equal distance between each interval, but no absolute 0 values (temperature as centigrade or Fahrenheit, etc) •Ratio—a defined distance between each interval, but has an absolute 0 values (revenue, length, other physically measureable observation—0 revenue is the total absence of revenue) •Sometimes difficult to tell if ratio or interval—ask yourself if the ratio of one outcome vs. another makes sense. (Is 78 degrees twice as hot as 39 degrees?) •Interval •Differences in data are meaningful •Therefore, averages for comparative purposes are meaningful •Ratio •Differences are meaningful •Ratios are meaningful

Mutually Exclusive or Independent

•Mutually Exclusive -Mutually Exclusive events are events where only one specific outcome can occur and the possible events are disjointed from one another. -For example, a high school student applies to W&M, then there are three outcomes of that application (Accept, Reject, Wait List). Never more than one. When the potential student receives the letter from W&M it will have only ONE outcome, never a combination. • • Independent -Independent events can happen at the same time and those events don't impact the other. -For example, W&M is playing U of R in football, while Virginia Tech is playing UVA in football. The games occur at the same time. It is possible that both W&M and Virginia Tech both win because they are not playing each other. The games are independent of each other (i.e., one game does not impact the other game's outcome). - •Note, two (or more) events cannot be both independent and mutually exclusive of each other [unless one never happens, P(Event) = 0].

Non-Random Sampling Techniques

•Non-random sampling techniques -Judgement samples: use subject matter experts or expert knowledge to make a choice -Convenience samples: use a sample that happens to be available (e.g., ask your coworkers during break for their opinions) -Focus groups: in-depth dialogues with a representative sample of users (e.g., iPad users)

Populations and Samples

•Population - all items of interest for a particular decision or investigation •All married drivers in the U.S. over age 25 •All US Army personnel below the rank of Lieutenant •Sample - a subset of a population •Nielsen sample of TV viewers for a population of interest •Audit team sample of inventory accuracy in a munitions magazine •Samples are used •To reduce costs of data collection and provide convenience •Make assertions about populations •When a full census cannot be taken

Discrete Uniform Distribution

•Random between a and b, but values must only be integers. • •Each value of the random variable is equally likely to occur! • •Since discrete (not continuous) a single value can have a probability. • •In Excel: = RANDBETWEEN(a,b) Where, b > a. [Note, you could use RAND() and manipulate from there, but RANDBETWEEN is easier]

Measures of Variability

•Range: the difference between the smallest and largest values •Variance and Standard Deviation •Measures the degree to which values are dispersed from the mean of the values •Note with respect to samples and populations (Lesson 4), the Variance and Standard Deviations have different formulas depending on population or sample •Mean Absolute Deviation •The average distance from the "center" of the values •In Excel, the function is "=AVEDEV" •Frequently used in Forecasting •Coefficient of Variation (CV) = Standard Deviation / Mean •Relative measure of dispersion relative to the mean •Dimensionless (i.e., units cancel out) •Useful to compare different distributions (of different units or scales) to one another

Sampling Concepts

•Sample or Census -A sample looks only at some items from a population, whereas a census looks at all items in a population. •Parameters and Statistics -A parameter is a measurement or characteristic of the population (e.g., mean, proportion, standard deviation) •Parameters are usually unknown since we can rarely observe the entire population. They are usually represented by Greek letters (µ or π). -A statistic is a numerical value calculated from a sample (e.g., mean, proportion, standard deviation) •Statistics are usually represented by Roman letters (x ̅ or p).

Sampling Key Ideas

•Sampling -To examine part of a whole; in order to gain insights on the whole •Randomize -To make the sample representative of the population •Sample size -The actual number of items sampled (not the fraction of the population!) is what is important -The larger your sample size, the less error (less variability)

Systematic Sample

•Select every kth item from a list or sequence (e.g., every 6th customer) -Often used when there is an inherent randomness in the order of the sequence -k is called the periodicity of the systematic sample •Useful in unlistable or infinite populations -Production processes (e.g., sample every 100th item being made) -Political polling (e.g., sample every 5th person leaving a polling station) •A systematic sample of size n from a population of size N requires a periodicity of k to be approximately N/n -For example, to choose 25 companies from Fortune 500, then you would need to choose every (k = 500/25 = 20) 20th company on the list •Choose the starting point randomly (i.e., choose the first item to be sampled within the first k items randomly, and then begin the every kth item) -Continuing Fortune 500 example. In an alphabetical list of Fortune 500 companies. Choose randomly a number between 1 and 20, say it is 13, then you would sample company 13, then 33, then 53, etc. on the alphabetical list.

Measure of Center: Shape

•Shape: measures the relationship between the mean and the median •Mean < Median: Skewed left (called negative skewness, tail of the histogram points left) •Mean = Median: Symmetric (tails of histogram are balanced) •Mean > Median: Skewed right (called positive skewness, tail of the histogram points right)

Covariance

•The Covariance describes the degree to which two quantitative variables change together. •Covariance is denoted by Cov(X, Y), sXY, or σXY •Cov(X, Y) > 0 means that they move in same direction •Cov(X, Y) = 0 means that X and Y are unrelated •Cov(X, Y) < 0 means that they move in opposite directions •Excel: "=COVARIANCE.S()" or "=COVARIANCE.P()" depending on sample or population • •Units of measurement for Covariance are unpredictable as their scale changes dramatically as the variance of the variables increase • •Thus, Correlation Coefficient is used much more often (as it is dimensionless and standardized between -1 and 1)

An Important Continuous PDF

•The Normal (Gaussian) probability distribution: -Bell-shaped function developed by the German mathematician Gauss in the 18th century. - -Examples of continuous RV's that might follow the normal distribution •Heights •Weights •Test scores •Quality characteristics such as diameter or thickness, some resistances, etc.

Probability Def

•The Probability of an event is a number (between 0 and 1, inclusive) that measures the relative likelihood that the event will occur -For example, the probability of drawing an ace from a standard deck of cards is calculated by dividing the number of different ways an ace can be drawn by the total number of different outcomes that are possible when drawing a card. •Ways to draw an ace: ? •Number of different outcomes that are possible: ? •P(drawing an ace): ? •The Probability of an event is a number (between 0 and 1, inclusive) that measures the relative likelihood that the event will occur -For example, the probability of drawing an ace from a standard deck of cards is calculated by dividing the number of different ways an ace can be drawn by the total number of different outcomes that are possible when drawing a card. •Ways to draw an ace: 4 •Number of different outcomes that are possible: 52 •P(drawing an ace): 4/52 = 0.0769 • •The probability of event A [denoted by P(A)], must lie within the interval from 0 to 1 inclusive - that is [0, 1]: -If P(A) = 0, then the event A cannot occur -If P(A) = 1, then the event A is certain to occur

Target Population

•The target population is the population we are interested in studying (e.g., U.S. gasoline prices). •The population must be carefully specified and the sample must be drawn scientifically so that the sample is representative (of the population). -The sampling frame is the group from which we take the sample (e.g., 115,000 stations). -Examples (today): •Phone Directories •Voter Registration Lists •Alumni Association Mailing Lists -The frame should not differ from the population.

Exponential Distribution

•This distribution is closely related to the Poisson distribution - a discrete distribution we discussed earlier. -Recall that a Poisson distribution is often used to model the number of arrivals per unit of time. -The only parameter necessary to describe the Poisson distribution is the arrival rate λ. •When the number of arrivals has a Poisson distribution with arrival rate λ, the time between arrivals (or interarrival time) has an Exponential Distribution with a mean interarrival time of 1/λ. •Excel Formulas: • •CDF: = EXPON.DIST(x, λ, True) •Or = EXPON.DIST(x, λ, 1) • •Random data in Excel: •= -LN(RAND())/λ • • •Always right skewed.

EXCEL: VLOOKUP

•VLOOKUP is an Excel function •Used to search for a value in the leftmost column of a table and then return a value in the same row from a column you specify. •VLOOKUP and HLOOKUP are similar with VLOOKUP looking for values in columns and HLOOKUP looking for values in rows •The syntax of the function is: =VLOOKUP(lookup_value,table_array,col_index_num,range_lookup) •Lookup_value - value to be found in the first column of the array (can be a value, a reference, or a text string) •Table_Array - the table of information where you are searching (can be a reference to a range or a range name) •Col_Index_Num - the column number in the table array from which the matching value must be returned (1 returns the value in the first column, 2 returns second, etc. •Range_Lookup - optional argument. If TRUE or omitted, an approximate match is returned. If FALSE, the function only return an exact match (or #N/A if no match is found) •NOTE: RECOMMENDED to enter TRUE or FALSE in the optional argument, Range_Lookup to avoid making a mistake.

Grouped Data

•Weighted Mean is a common example of Grouped Data •Suppose grade for a course was computed as follows: •30% for homework •25% for midterm •40% for final exam •5% for class participation •Suppose student's grades were as follows: •75% for homework •89% for midterm •65% for final exam •3% for class participation •Then the Weighted Mean is computed as follows: = (30%)(75%) + (25%)(89%) + (40%)(65%) + (5%)(3%) = 70.9% •Similarly could calculate descriptive statistics (e.g., mean, variance) for grouped data

Explain/Interpret the Vessel Data

•What is 1st question you ask as a user of the data? •What is contained in data ? •What is not contained? •What might its purpose be? •How effectively does it present information and on what basis? •How would data like this be used?

Measures of Center

•Where are the data values concentrated? •What is a typical "middle" of the data values? •Is there a central tendency? •Measures of Center •Mean (Average) •Median (the mid-point of the sorted values) •Median of {1, 2, 3} is 2. •Median of {1, 2, 3, 4} is 2.5. •Median of {1, 1, 2, 2, 3, 4, 5} is 2. •Mode (most frequently occurring data value) [may have multiple modes or no modes] •Mode is the only meaningful measure of center for categorical data. •Midrange (the point halfway between the lowest and the highest values) •The average of the minimum and the maximum.

Emperical Probability

•relies on relative frequency of an event to determine the probability (chance) of that event. • •For example, if out of 1000 pins inspected, 50 pins are rejected, then the probability that the next pin inspected is rejected is determined as: • •P(A) = • • •P(A) = 50/1000 = 0.05

Classical Probability

•uses sample spaces to determine the probability. It assumes that all outcomes in a sample space are equally likely, 'most of the times.' • •Ex: Measuring the resistance of resistors, assuming that the instrument can measure to the nearest ten and that the range of resistance measurements is from 900 to 970 ohms. S = {900, 910, 920, 930, 940, 950, 960, 970} •If we assume that the 8 values are equally likely, then:P (the next resistor will have a resistance value = 940 ohms) = 1/8 = 0.125


Conjuntos de estudio relacionados

Infrastructure - All Lab Questions

View Set

Anatomy & Physiology: Ch.19 Lecture Quiz #4

View Set

ch.7 food & nutrients quiz review

View Set

Section P.3 Radicals and Rational Exponents

View Set