Intro to Managerial Statistics
mean formula
"The sum of observations / n"
normal distribution
-bellshaped, centered at mean, area under curve=probability A function that represents the distribution of variables as a symmetrical bell-shaped graph. used to describe the variability associated with sample proportions which are taken from repeated samples describes variability of many different statistics
Normal distributions assumptions
1. independent observations 2. large enough sample proportions: least 10 expected successes and 10 expected failures in sample
mosaic plot
2 categorical variables the bigger the area the bigger the proportion uses the area of rectangles to display the relative frequency of occurrence of all combinations of two categorical variables use an explanatory variable to represent the first split
95% confidence interval
68-95-99.7 rule tells us 95% of observations are within 2 standard errors of the mean when normal distribution the point estimate we observe will be within 1.96 standard errors of the true value of interest 95% of the time 95% confident this interval captures the value
data distribution
A listing of the values or responses associated with a particular variable in a data set.
range
max-min can be inflamed by ranged
center
mean/average: measure center of a distribution of data
best to use when data is skewed
IQR and range together
95% for finding the true population proportion/mean
If 95% of sample proportions are within two standard deviations of the population proportion/mean, we can say that 95% of the time the sample proportion/mean is within 2 std of p: phat plus or minus 2 times square root of phat qhat over n
doa ll 95% confidence intervals include the population parameter?
no only 95% of confidence intervals do
Tidy Data
One observation per row One variable per column One value of observation per cell
categorical nominal
no order ex fav color, football team,
Investigative Cycle
Problem -Identify the problem -Define the Research Question Plan -Prepare or examine the sampling plan and/or experimental design -Collect the data (if not given) Data -Identify the explanatory and response variables -Identify the variables as categorical or quantitative -Ask questions about data -Are there missing observations? -Where did the data come from? -Is their anyone missing? Analysis - Examine the data by finding statistical and graphical summaries -Determine the appropriate approach Conclusion -What is the conclusion of the analysis? -Can you extend it to the population? -Can you show cause and effect? Why? -What are the next steps?
IQR
Q3 (75%) - Q1(25%), the middle 50% of the data!!
empirical rule
The rules gives the approximate % of observations w/in 1 standard deviation (68%), 2 standard deviations (95%) and 3 standard deviations (99.7%) of the mean when the histogram is well approx. by a normal curve
categorical data display
contingency table, bar plot, stacked bar plot, dodged bar plot, standardized bar plot, mosiac plot
Principles of Experimental Design
control, randomize, replicate, block
margin of error
describes how far observations are from mean; z multiplier * the standard error
median higher than mean
left skewed
histogram
a bar graph depicting a frequency distribution has unimodal, bimodal, multimodal data density, right skewed, left skewed mode = peak in distribution
bar plot
a common way to display a single categorical variable
sampling distribution
a distribution of statistics obtained by selecting all the possible samples of a same size from a !population! how sample statistics vary from one another
point estimate
a summary statistic from a sample used as an estimate of the population parameter
explanatory variable
a variable that we think explains or causes changes in the response variable
prospective study
an observational study in which subjects are followed to observe future outcomes
retrospective study
an observational study in which subjects are selected and then their previous conditions or behaviors are determined
mode best used when data is
bimodal or unimodal to find peaks
sampling with replacement
bootstrapping Once a member of the population is selected for inclusion in a sample, that member is returned to the population for the selection of the next individual. taking repeated sample from a population is impossible so we bootstrap= resample from the sample
multistage
cluster + strata like a cluster sample but rather than keeping all observations in each cluster, we collect a random sample within each selected cluster more economical helpful when there is a lot of case to case variability within a cluster ex someone wanted to survey parents of elementary school kids. They randomly picked 20 school districts in the us and then randomly picked 2 elementary schools within each district. for each school they took a random sample of 100 parents.
census
collecting data from an entire population hard to do and expensive difficult to identify the entire population of interest
ridge plot
combines density plots for various groups drawn on the same scale in a single plotting window
stacked bar plot
displays distributions of two categorical variables on a bar plot useful for visualizing the relationship between two categorical variables on a bar plot one variable is explanatory--one is a response
null distribution
distribution of simulated statistics that represent what could have happened in the study assuming the null hypothesis was true always centered at the value where the null hypothesis is true
stratified
divde + conquer 1.population divided into groups called strata strata are chosen so that similar cases are grouped together 2. usually simple random sample employed within each strata useful when cases in strata are very similar with respect to the outcome of interest population estimate more precise if each group estimate is more precise helpful when there is a lot of case to case variability within a cluster
difference between a scatterplot and dotplot
dot plot displays one variable scatterplot displays two
simple random sample
each case in the population has an equal chance of being included and knowing that a case is included in a sample does not provide useful information about what other cases were included
robust statistic
extreme observation have little effect on their values IQR Median
was the sample mean unusually high
find zscore find prob yhat=. x or more
normal quantile plot
if data follows mostly straight line = normal data curves up or down at ends = not approx normal
What does changing the confidence level affect
increasing CI increases width of the interval
what does changing sample size affect CI
increasing sample decreases width of CI
assumptions needed to use t model to make one sample t interval for the man
independence assumption: check randomization, check 10% normal population assumption: nearly normal condition n<30 with no outliers n>=30 good
convenience sample
individuals who are easily accessible are more likely to be included in the sample
categorical ordinal
intrinsic order ex somewhat agree, disagree, somewhat disagree, great terrible, bad, severity 1-10, star ratings
data frame
is a convenient and common way to organize a data frame where each row is a unique case (observational unit), each column is a variable and each cell is a single value
summary statistic
is a single number summarizing data from a sample
sampling distribution of the sample proportion;
is all the possible values of a stat from taking samples of the same size from the population with these conditions: independence assumption: sampled values are independent randomization condition-random samples or random application treatments 10% condition-the sample should not be larger than 10% of the population success/failures:you must have at least 10 expected successes (np) and failures (nq) note =1-p
zscore
number of standard deviations away from the mean ex if observation is one std above man zscore is 1 observations below mean have negative z scores observations above the mean always have positive z scores if observation=mean z score is 0 if the absolute value of a zscore is larger than the absolute value of the other observations x1 is more unusual
sample distribution
one possible sample simulation
scatterplots
one type of graph used to study relationship between two numerical variables if two variables show some connection they are associated variable not related are independent
Dot Plot
one variable, quantitative a graphical device that summarizes data by the number of dots above each data value on the horizontal axis
variability is small
original stat is close we expect the sample stat to be close to the true parameter
variability is large (bootstrap)
original stat is far from true population parameter
bias
over-representing someones interest nonresponse bias- ex only 30% of ppl sampled actually respond
mu
population mean
confidence interval for the population proportion
question data: bar graph or pie chart variables: categorical w y/n parameter: p=population proportion of success this value is unknown assumptions meet to be able to us CI for p: independence assumption: check randomization condition, check 10% condition sample size assumption: success/failure condition if not met we would say we could not use the normal model and thus cannot proceed with the interval different approaches Classical Approach Bootstrap Approach: Percentiles (find middle 95% of the sample means in the bootstrap CI), Standard Error Approach(use p hat plus-minus 2 times SEboot_strap
bootstrap percentile confidence interval
range of values for the true proportion
measures of spread (variability) (dispersion)
range, variance, standard deviation
Bootstrapping
repeatedly sample from the sample with replacement best for modeling studies where data has been generated through random sampling from a population model how a statistic varies from one sample to another taken goal: to produce an interval estimate for the population parameter
observational study
researchers collect data in a way that does not interfere with how data arises ex surverys, review records
mean higher than median
right skewed, median is preferred measure of center, IQR and range better describers
standard deviation
s:a measure of variability that describes an average distance of every score from the mean if points are closer to mean standard deviation is smaller
variance
s^2: standard deviation squared (not measured in same units as data)
Histogram
shape : skewed left or right outliers: between center: mean median mode spread: standard deviation, variance, IQR, range
quantitative data display
side by side box plot, box plot, histogram, ridge plot, faceting, scatterplot, dot plot
4 types of sampling
simple, clustered, multistage, stratified
median best used when data is
skewed median is better here because mean is pulled by extreme skewedvalues
For a t distribution...
smaller the distribution the more spread there is in the tails
Faceting
split geographical display of the data across plotting windows based on groups
sample
subset of whole
box plot
summarizes 5 stats +identifying unusual observations
contingency table (cross tabulation)
summarizes data for twi categorical variable, each value in the table represents the number of times a particular combination of variable outcomes occurred
mean best used when data is
symmetrical, bell shaped
dodged bar plot
tends to use too much horizontal space difficult to know if there is a relationship
for the t distribution: higher the df
the closer it gets to the z shape
Central Limit Therom
the distribution of all the sample means taken from the same population with mean equal to m and standard deviation equal to simga bell shaped curve
sampling distribution of the sample mean
the distribution of all the sample means taken from the same population with mean equal to mu and std equal to sigma with these conditions: large enough n = 30 independence assumption sampled values are independent randomization conditions: random sample or random application of treatments 10% condition: the sample should not be larger than 10% of the population
bootstrap distribution
the distribution of many bootstrap statistics approximating the sampling distribution
standard error
the standard deviation of a sampling distribution
How is the 95% confidence interval derived using the classical approach
the standard error is estimated from the theoretical sampling distribution of the statistic. The point estimate is the center of the interval and the margin of error is the z multiplier *the standard error
population
the whole we are interested in
side by side box plot
two box plots one for each group
scatterplot
twovariablesquantitative
What are Confidence Interval Interpretations about
unknown population parameters NOT sample statistics or individuals
t distribution
use for quantitative data, paired data ie one mean degrees of freedom symmetric and bell shaped centered at zero more spread out at tails than the z distribution smaller sample, more variability, fatter tails one mean df=n-1
sample
used to provide estimate of population average
standardized bar plot
useful for understanding fraction +associations, useful if the primary variable in the stacked bar plot is relatively unbalanced lose send of how many cases each bar represents
quantitative discrete
variable counted ex # of absences, # w/ jumps, whole #
quantitative continuous
variable that can take on any value within a range (usually measuring) ex sq footage, height, weight
best to use when data is symmetric
variance and standard deviation
randomized experiment
when individuals are randomly assigned to a group
statistic
when number is being calculated on a sample of data
Parameter
when number is being calculated on an entire population true value we use stats to estimate the parameter
we use a ____ confidence interval if we want to be more certain that we capture the parameter.
wider
cluster
would not represent all population 1. break up populations into groups 2.sample a fixed # of clusters>>>include all observations from each of those clusters in the sample use bc more economical and geographic limitation example the mayor of Gainesville would like to take a survey of Gainesville residents. he decides to send out pollsters to randomly selected city blocks and randomly select participants from each city block.
estimate mu using
x bar