BSTAT
Interval Scale
We can categorize and rank the data, and see that the differences between scale values are meaningful. (fahrenheit scale, SAT score, credit score)
Explanatory variable
We use the information on the explanatory variables to predict and/or describe changes in the response variable. Also called independent variable, predictor variable, control variable, regressors
Normal probability distribution
Also known as a normal distribution and bell shaped distribution. Most extensively used probability distribution in statistical work. It closely approximates the probability distribution for a wide range of random variables of interest. (height and weight of newborn babies, scores on the SAT, cumulative debt of college graduates, advertising expenditure of firms, rate of return on an investment). The normal distribution is symmetric around the mean. The mean, median, and mode are all equal for a normally distributed random variable. The normal distribution is completely described by two parameters, the population mean and the population variance. The normal distribution is asymptotic in the sense that the tails get closer and closer to the horizontal axis but never touch it.
Stratified random sampling
An attempt to ensure that each area of the country, each ethnic group, each religious group, and so forth, is appropriately represented in the sample.
Mean Absolute Deviation (MAD)
An average of the absolute differences between the observations and the mean. Sample/population MAD= Sum of (Xi - x bar or M)/n
Quantitative Variable
Either discrete or continuous.
The required sample size when estimating the population mean
For a desired margin of error E, the minimum sample size n required to estimate a 100(1 - a)% confidence interval for the population mean M is n= (za/2 x sigma/E)^2. Where sigma hat is a reasonable estimate of sigma in the planning stage.
The required sample size when estimating the population proportion
For a desired margin of error E, the minimum sample size n required to estimate a 100(1-a)% confidence interval for the population proportion p is n= (za/2/E)^2phat(1-phat) Where phat is a reasonable estimate of p in the planning stage.
Residual
For a given value of x, the observed and the predicted values of the response variable are likely to be different since many factors besides x influence y. The difference between the observed and the predicted values of y, that is y- yhat, as the residual
Chebyshev's Theorem
For any data set, the proportion of observations that lie within k standard deviations from the mean is at least 1- 1/k^2, where k is any number greater than 1. Applies to all data sets. To estimate the percent of values in a distribution within
Cluster sampling
Formed by dividing the population into groups, such as geographic areas, and then selecting a sample of the groups for the analysis,
T distribution
If a random sample of size n is taken from a normal population with a finite variance, then the statistic follows the t distribution with (n-1) degrees of freedom. A family of distributions that are similar to the z distribution except they have broader tails. Bell shaped and symmetric around 0.
Big data
Massive volume of both structured and unstructured data that are extremely difficult to manage using traditional data processing tools.
Selection Bias
Occurs when portions of the population are underrepresented in the sample.
Nonresponse Bias
Occurs when those responding to a survey or poll differ systematically from the nonrespondents.
Social desirability bias
Occurs when voters provide incorrect answers to a survey or poll because they think that others will look unfavorably on their ultimate choices
The decision to reject or not reject the null hypothesis
On the basis of sample information, we either "reject the null hypothesis", or "do not reject the null hypothesis"
Variance
One of the most widely used measures of dispersion. The average of the squared differences between the observations and the mean. The squaring of differences from the mean emphasizes larger differences more than smaller ones; The MAD weighs large and small differences equally. In units squared. Not negative s^2= Sum of (Xi - x bar or M)^2/N or n-1
Limitations of correlation analysis
Only captures a linear relationship -may not be reliable when outliers are present in one or both of variables -correlation does not imply causation
Critical value for a right tailed test
P(Z>=za)=a. The resulting rejection region includes values greater than Za.
Sampling distribution
Probability distribution of the sample mean, xbar
Inferential statistics
Refers to drawing conclusions about a large set of data called a population, based on a smaller set of sample data. Hypothesis testing, testing whether data is generalizable to the population. Hypothesis testing, confidence intervals. Makes inferences about the data
Descriptive Statistics
Refers to the summary of important aspects of a data set. Includes collecting data, organizing the data, and then presenting the data in the form of charts and tables. Measures of centrality, dispersion, skewness. Describes data
Central Limit Theorem
States that the sum or the average of a large number of independent observations from the same underlying distribution has an approximate normal distribution. The normal distribution approximation is justified when n>=30.
Interpreting the 95% confidence interval
Technically, the 95% confidence interval for the population mean M implies that for 95% of the samples, the procedure (formula) produces an interval that contains M.
Unstructured data (unmodeled data)
Tends to be textual, (written reports, email messages, open ended survey results). Does not conform to a row-column model
Empirical Rule
The approximate percentage of observations that fall within 1,2, or 3 standard deviations from the mean. 1 deviation: 68%, 2 deviations: 95%, 3 deviations: almost 100%
General format of the confidence interval for M and p
The confidence interval for the population mean and the population proportion is constructed as point estimate +- margin of error
Outliers
The mean can give a misleading description of the center of the distribution in the presence of extremely small or large values
Degrees of freedom
The number of independent pieces of information that go into the calculation of a given statistic. Many probability distributions are identified by the degrees of freedom.
The standard error of the sample mean
The standard deviation of the sample mean, xbar It equals the population standard deviation divided by the square root of the sample size.
The standard error of the sample proportion
The standard deviation of the sample proportion se(Phat)= square root of p(1-p)/n
Test statistic for M when sigma is known
The value of the hypothesis test of the population mean M when the population standard deviation sigma is known is computed as z= xbar-Mo/sigma/square root of n Where Mo is the hypothesized value of the population mean. Only valid if Xbar follows a normal distribution
Test statistic for M when sigma is unknown
The value of the test statistic for the hypothesis test of the population mean M when the population standard deviation sigma is unknown is computed as tdf= xbar - Mo/s/square root of n
Hypothesis test for the population proportion
The variable of interest is qualitative rather than quantitative. The population proportion p is the essential descriptive measure. The parameter p represents the proportion of observations with a particular attribute.
Z score
Use the z score to find the relative position of a sample value within the data set by dividing the deviation of the sample value from the mean by the standard deviation. Measures the distance of a given sample value from the mean in terms of standard deviations. Converting sample data into z scores is also called standardizing the data. Almost all observations fall within three standard deviations of the mean. If more than three, then the value could be considered an outlier z=X-M/s
Skew
When the mean is greater than the median, the set is positively skewed. When the mean is less than the median, the set is negatively skewed
Variable
a characteristic of interest differs in kind or degree among various observations
If sample correlation coefficient= -1
a perfect negative linear relationship exists if rxy equals -1.
If the minimum and maximum values of the population are available
a rough approximation for the population standard deviation is given by sigma hat = range/4
Standard normal distribution
a special case of the normal distribution with a mean equal to zero and a standard deviation (or variance) equal to one.
margin of error
a value that accounts for the standard error of the estimator and the desired confidence level of the interval
Method of least squares
also referred to as ordinary least squares. We use OLS to estimate the parameters Bo and B1. Chooses the line whereby the error sum of squares is minimized. Produces the straight line that is closest to the data
Mean and standard error of the sample proportion Pbar
are given by E(Pbar)=p and se(Pbar)= square root of p(1-p)/n, respectively.
Discrete random variable
assumes a countable number of distinct values such as x1,x2, x3, and so on.
Test of individual significance
can be implemented in the context of the simple and multiple regression models
Regression Analysis
captures the casual relationship between two or more variables, referred to as the simple linear regression model. One of the most widely used statistical methodologies in business, engineering, and the social sciences.
Continuous
characterized by uncountable values within an interval. (Weight, height, time, investment return)
Type 2 error
committed when we fail to reject the null hypothesis when the null hypothesis is actual false.
Type 1 error
committed when we reject the null hypothesis when the null hypothesis is actually true
Alternative hypothesis
contradicts the default state or status quo. We use the alternative hypothesis as a vehicle to establish something new- that is , contest the status quo. In most applications, the null hypothesis regarding a particular population parameter of interest is specified with one of the following signs: =,<=, >=, the alternate hypothesis is then specified with the corresponding opposite sign: =x, >, <
Null hypothesis
corresponding to a presumed default state of nature or status quo.
Discrete
countable number of values. (number of children in a family, number of points scored in a basketball game)
Cross sectional data
data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time. (current price of gasoline in different states across the country)
Time Series Data
data collected over several time periods focusing on certain groups of people, specific events, or objects. (hourly body temperature)
Sample proportion Pbar
estimates the population proportion p. Considered valid when np>=5 and n(1-p)>=5. Since p is not known, we typically test the sample size requirement under the hypothesized value of the population proportion po. In most applications, the sample size is large and the normal distribution approximation is justified. However, when the sample size is not deemed large enough the statistical methods suggested here for inference regarding the population proportion are no longer valid.
central limit theorem for the sample proportion
for any population proportion p, the sampling distribution of phat is approximately normal if the sample size n is sufficiently large. The normal distribution approximation is justified when np>=5, and n(1-p)>=5
Sample correlation coefficient
gauges the direction and the strength of the linear relationship between two variables x and y. We should only comment on the direction of the relationship if the correlation coefficient is found to be statistically significant
Structured data
generally refers to data that has a well-defined length and format. Numerical information that is not up to interpretation ( numbers, dates, groups of words)
Scatter Plot
graphically shows the relationship between two variables
Qualitative data
labels or names to identify the distinguishing characteristics of each observation
Critical value approach
makes the comparison directly in terms of the value of the test statistic. Specifies a region of values, such that if the value of the test statistic falls into this region, then we reject the null hypothesis. The critical value is a point that separates the rejection region from the non-rejection region.
Median
measure of central location, middle value of the data set. If the mean and median differ significantly, it is likely that there are many outliers. If there is an outlier, then the median most accurately reflects the center.
Sample covariance
measures the direction of the linear relationship between two variables, x and y. Cannot comment on the strength of the linear relationships
Multiple linear regression model
more than one explanatory variable is presumed to have a linear relationship with the response variable y= Bo + B1x1 + B2x2 + ... e
Sample variance of the residual
numerical measure that gauges dispersion from the sample regression equation, denoted s^2e. Average squared differences between yi and yhat.
Ordinal Scale
reflects a stronger level of measurement. Able to both categorize and rank the data with respect some characteristic or trait (satisfaction rating, economic status)
Weighted mean
relevant when some observations contribute more than others. For example: a student is often evaluated on the basis of the weighted mean since the score on the final exam is typically worth more than the score on the midterm
Nominal scale
represent the least sophisticated level of measurement. All we can do is group or categorize the data. Name or label the data (religious affiliation, gender, home town)
Ratio Scale
represents the strongest level of measurement. Have all the characteristics of interval scale, as well as a true zero point. (weight, time, distance, sales, profits)
Goodness of fit
summarize how well the sample regression equation fits the data. If the predicted value (yhat) is equal to its observed values y, then we have a perfect fit.
Random Variable
summarizes outcomes of an experiment with numerical values
a
the allowed probability of making a Type 1 error. Significance level. The smaller the a when rejecting the null, the stronger the evidence that the null hypothesis is false
Range
the simplest measure of dispersion; it is the difference between the maximum value and the minimum value in a data set.
SSE
the sum of the squared differences between the observed values y and their predicted value that or, equivalently, the sum of the squared distances from the regression equation.
If sample correlation coefficient= 1
then a perfect positive linear relationship exists between x and y.
If rxy equals zero
then no linear relationship exists between x and y.
Simple linear regression model
uses one explanatory variable, denoted x1, to explain the variation in the response variable, denoted y. y= Bo + B1x, where Bo is the unknown intercept, and B1 the unknown slope parameter
Mode
value in the data set that occurs most frequently. The mode's usefulness as a measure of central location tends to diminish with data sets that have more than three modes. It is the only meaningful measure of central location
Regression analysis
we change the emphasis from correlation to causation
Estimator and Estimate
when a statistic is used to estimate a parameter, it is referred to as an estimator. A particular value of the estimator is called an estimate.
Positive covariance
when one variable is above its mean, the other variable is also above its mean.
Negative covariance
when x is above its mean, y is below its mean.
Confidence interval for the population mean when sigma is known
xbar +- ta/2 x (sigma/square root of n)
Confidence Interval when M is known but sigma is unknown
xbar +-ta/2,df,s/square root of n
Sample regression equation for the simple linear regression model
yhat= bo + b1x bo and b1 are estimates of Bo and B1 Provides a good fit when the dispersion of the residuals is relatively small
Test statistic of p
z= Pbar - po/square root of po(1 - po)/n
Standard deviation
statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. Defined as the square root of the variance.
Coefficient of determination
-The proportion of the sample variation in the response variable that is explained by the sample regression equation. -referred to as R^2. Easier to interpret than the standard error of the estimate.
One tailed versus two tailed hypothesis tests
: In a one tailed test, we can reject the null hypothesis only on one side of the hypothesized value of the population parameter. In a two tailed test, we can reject the null hypothesis on either side of the hypothesized value of the population parameter.
confidence interval
A confidence interval, or interval estimate, provides a range of values that, with a certain level of confidence, contains the population parameter of interest.
Simple random sample
A sample of n observations that has the same probability of being selected from the population as any other sample of n observations. Most statistical observations presume simple random samples.
Symmetry
If one side of the histogram is a mirror image of the other side. If symmetric, the mean, median, mode, and range are equal
Confidence intervals and two tailed hypothesis tests
If the confidence interval does not contain the hypothesized value of the population mean Mo, then we reject the null hypothesis. If the confidence interval contains Mo, then we do not reject the null hypothesis.
Stochastic
If the relationship is not deterministic, then it is stochastic
Deterministic
If the value of the response variable is uniquely determined by the values of the explanatory variables
Hypothesis test for the population mean when sigma is unknown
In most business applications, sigma is not known and we have to replace sigma with the sample standard deviation s to estimate the standard error of Xbar
Stratified vs Cluster
In stratified, the sample consists of observations from each group, whereas in cluster, the sample consists of observations from the selected groups. Stratified used to increase precision, cluster used to save money.
Response variable
influenced or caused by other variables. Referred to as the dependent variable
Confidence interval for population proportion
pbar +- za/2 * square root of p hat(1- p hat)/n
Parameter
population mean, uses N observations
Mean
primary measure of central location. The average. sample mean: x bar, population mean: M
Z table
provides areas (probabilities) under the z curve. The left hand page provides cumulative probabilities for z values less than or equal to zero. The right hand page shows cumulative probabilities for z values greater than or equal to zero.
dummy variable
qualitative explanatory variable with two categories can be associated. defined as a variable that takes on values of 1 or 0.
Standard error of the estimate
residual e represents the difference between an observed value and the predicted value of the response variable, that is e=y-yhat. If all the data points had fallen on the line, then each residual would be zero, in other words, there would be no dispersion between the observed and the predicted values. -useful goodness of fit measure when we are comparing various models. The model with the smaller se provides the better relative fit. It provides less useful when we are assessing a single model
Point estimator
sample mean
Statistic
sample mean, used to estimate the population mean
P value approach
the decision rule is to reject the null hypothesis if the p value < a and not reject the null hypothesis if the p value >= a (commonly 5%)
P value
the likelihood of obtaining a sample mean that is at least as extreme as the one derived from the given sample, under the assumption that the null hypothesis is true as an equality. The observed probability of making a Type 1 error