BSTAT

¡Supera tus tareas y exámenes ahora con Quizwiz!

Interval Scale

We can categorize and rank the data, and see that the differences between scale values are meaningful. (fahrenheit scale, SAT score, credit score)

Explanatory variable

We use the information on the explanatory variables to predict and/or describe changes in the response variable. Also called independent variable, predictor variable, control variable, regressors

Normal probability distribution

Also known as a normal distribution and bell shaped distribution. Most extensively used probability distribution in statistical work. It closely approximates the probability distribution for a wide range of random variables of interest. (height and weight of newborn babies, scores on the SAT, cumulative debt of college graduates, advertising expenditure of firms, rate of return on an investment). The normal distribution is symmetric around the mean. The mean, median, and mode are all equal for a normally distributed random variable. The normal distribution is completely described by two parameters, the population mean and the population variance. The normal distribution is asymptotic in the sense that the tails get closer and closer to the horizontal axis but never touch it.

Stratified random sampling

An attempt to ensure that each area of the country, each ethnic group, each religious group, and so forth, is appropriately represented in the sample.

Mean Absolute Deviation (MAD)

An average of the absolute differences between the observations and the mean. Sample/population MAD= Sum of (Xi - x bar or M)/n

Quantitative Variable

Either discrete or continuous.

The required sample size when estimating the population mean

For a desired margin of error E, the minimum sample size n required to estimate a 100(1 - a)% confidence interval for the population mean M is n= (za/2 x sigma/E)^2. Where sigma hat is a reasonable estimate of sigma in the planning stage.

The required sample size when estimating the population proportion

For a desired margin of error E, the minimum sample size n required to estimate a 100(1-a)% confidence interval for the population proportion p is n= (za/2/E)^2phat(1-phat) Where phat is a reasonable estimate of p in the planning stage.

Residual

For a given value of x, the observed and the predicted values of the response variable are likely to be different since many factors besides x influence y. The difference between the observed and the predicted values of y, that is y- yhat, as the residual

Chebyshev's Theorem

For any data set, the proportion of observations that lie within k standard deviations from the mean is at least 1- 1/k^2, where k is any number greater than 1. Applies to all data sets. To estimate the percent of values in a distribution within

Cluster sampling

Formed by dividing the population into groups, such as geographic areas, and then selecting a sample of the groups for the analysis,

T distribution

If a random sample of size n is taken from a normal population with a finite variance, then the statistic follows the t distribution with (n-1) degrees of freedom. A family of distributions that are similar to the z distribution except they have broader tails. Bell shaped and symmetric around 0.

Big data

Massive volume of both structured and unstructured data that are extremely difficult to manage using traditional data processing tools.

Selection Bias

Occurs when portions of the population are underrepresented in the sample.

Nonresponse Bias

Occurs when those responding to a survey or poll differ systematically from the nonrespondents.

Social desirability bias

Occurs when voters provide incorrect answers to a survey or poll because they think that others will look unfavorably on their ultimate choices

The decision to reject or not reject the null hypothesis

On the basis of sample information, we either "reject the null hypothesis", or "do not reject the null hypothesis"

Variance

One of the most widely used measures of dispersion. The average of the squared differences between the observations and the mean. The squaring of differences from the mean emphasizes larger differences more than smaller ones; The MAD weighs large and small differences equally. In units squared. Not negative s^2= Sum of (Xi - x bar or M)^2/N or n-1

Limitations of correlation analysis

Only captures a linear relationship -may not be reliable when outliers are present in one or both of variables -correlation does not imply causation

Critical value for a right tailed test

P(Z>=za)=a. The resulting rejection region includes values greater than Za.

Sampling distribution

Probability distribution of the sample mean, xbar

Inferential statistics

Refers to drawing conclusions about a large set of data called a population, based on a smaller set of sample data. Hypothesis testing, testing whether data is generalizable to the population. Hypothesis testing, confidence intervals. Makes inferences about the data

Descriptive Statistics

Refers to the summary of important aspects of a data set. Includes collecting data, organizing the data, and then presenting the data in the form of charts and tables. Measures of centrality, dispersion, skewness. Describes data

Central Limit Theorem

States that the sum or the average of a large number of independent observations from the same underlying distribution has an approximate normal distribution. The normal distribution approximation is justified when n>=30.

Interpreting the 95% confidence interval

Technically, the 95% confidence interval for the population mean M implies that for 95% of the samples, the procedure (formula) produces an interval that contains M.

Unstructured data (unmodeled data)

Tends to be textual, (written reports, email messages, open ended survey results). Does not conform to a row-column model

Empirical Rule

The approximate percentage of observations that fall within 1,2, or 3 standard deviations from the mean. 1 deviation: 68%, 2 deviations: 95%, 3 deviations: almost 100%

General format of the confidence interval for M and p

The confidence interval for the population mean and the population proportion is constructed as point estimate +- margin of error

Outliers

The mean can give a misleading description of the center of the distribution in the presence of extremely small or large values

Degrees of freedom

The number of independent pieces of information that go into the calculation of a given statistic. Many probability distributions are identified by the degrees of freedom.

The standard error of the sample mean

The standard deviation of the sample mean, xbar It equals the population standard deviation divided by the square root of the sample size.

The standard error of the sample proportion

The standard deviation of the sample proportion se(Phat)= square root of p(1-p)/n

Test statistic for M when sigma is known

The value of the hypothesis test of the population mean M when the population standard deviation sigma is known is computed as z= xbar-Mo/sigma/square root of n Where Mo is the hypothesized value of the population mean. Only valid if Xbar follows a normal distribution

Test statistic for M when sigma is unknown

The value of the test statistic for the hypothesis test of the population mean M when the population standard deviation sigma is unknown is computed as tdf= xbar - Mo/s/square root of n

Hypothesis test for the population proportion

The variable of interest is qualitative rather than quantitative. The population proportion p is the essential descriptive measure. The parameter p represents the proportion of observations with a particular attribute.

Z score

Use the z score to find the relative position of a sample value within the data set by dividing the deviation of the sample value from the mean by the standard deviation. Measures the distance of a given sample value from the mean in terms of standard deviations. Converting sample data into z scores is also called standardizing the data. Almost all observations fall within three standard deviations of the mean. If more than three, then the value could be considered an outlier z=X-M/s

Skew

When the mean is greater than the median, the set is positively skewed. When the mean is less than the median, the set is negatively skewed

Variable

a characteristic of interest differs in kind or degree among various observations

If sample correlation coefficient= -1

a perfect negative linear relationship exists if rxy equals -1.

If the minimum and maximum values of the population are available

a rough approximation for the population standard deviation is given by sigma hat = range/4

Standard normal distribution

a special case of the normal distribution with a mean equal to zero and a standard deviation (or variance) equal to one.

margin of error

a value that accounts for the standard error of the estimator and the desired confidence level of the interval

Method of least squares

also referred to as ordinary least squares. We use OLS to estimate the parameters Bo and B1. Chooses the line whereby the error sum of squares is minimized. Produces the straight line that is closest to the data

Mean and standard error of the sample proportion Pbar

are given by E(Pbar)=p and se(Pbar)= square root of p(1-p)/n, respectively.

Discrete random variable

assumes a countable number of distinct values such as x1,x2, x3, and so on.

Test of individual significance

can be implemented in the context of the simple and multiple regression models

Regression Analysis

captures the casual relationship between two or more variables, referred to as the simple linear regression model. One of the most widely used statistical methodologies in business, engineering, and the social sciences.

Continuous

characterized by uncountable values within an interval. (Weight, height, time, investment return)

Type 2 error

committed when we fail to reject the null hypothesis when the null hypothesis is actual false.

Type 1 error

committed when we reject the null hypothesis when the null hypothesis is actually true

Alternative hypothesis

contradicts the default state or status quo. We use the alternative hypothesis as a vehicle to establish something new- that is , contest the status quo. In most applications, the null hypothesis regarding a particular population parameter of interest is specified with one of the following signs: =,<=, >=, the alternate hypothesis is then specified with the corresponding opposite sign: =x, >, <

Null hypothesis

corresponding to a presumed default state of nature or status quo.

Discrete

countable number of values. (number of children in a family, number of points scored in a basketball game)

Cross sectional data

data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time. (current price of gasoline in different states across the country)

Time Series Data

data collected over several time periods focusing on certain groups of people, specific events, or objects. (hourly body temperature)

Sample proportion Pbar

estimates the population proportion p. Considered valid when np>=5 and n(1-p)>=5. Since p is not known, we typically test the sample size requirement under the hypothesized value of the population proportion po. In most applications, the sample size is large and the normal distribution approximation is justified. However, when the sample size is not deemed large enough the statistical methods suggested here for inference regarding the population proportion are no longer valid.

central limit theorem for the sample proportion

for any population proportion p, the sampling distribution of phat is approximately normal if the sample size n is sufficiently large. The normal distribution approximation is justified when np>=5, and n(1-p)>=5

Sample correlation coefficient

gauges the direction and the strength of the linear relationship between two variables x and y. We should only comment on the direction of the relationship if the correlation coefficient is found to be statistically significant

Structured data

generally refers to data that has a well-defined length and format. Numerical information that is not up to interpretation ( numbers, dates, groups of words)

Scatter Plot

graphically shows the relationship between two variables

Qualitative data

labels or names to identify the distinguishing characteristics of each observation

Critical value approach

makes the comparison directly in terms of the value of the test statistic. Specifies a region of values, such that if the value of the test statistic falls into this region, then we reject the null hypothesis. The critical value is a point that separates the rejection region from the non-rejection region.

Median

measure of central location, middle value of the data set. If the mean and median differ significantly, it is likely that there are many outliers. If there is an outlier, then the median most accurately reflects the center.

Sample covariance

measures the direction of the linear relationship between two variables, x and y. Cannot comment on the strength of the linear relationships

Multiple linear regression model

more than one explanatory variable is presumed to have a linear relationship with the response variable y= Bo + B1x1 + B2x2 + ... e

Sample variance of the residual

numerical measure that gauges dispersion from the sample regression equation, denoted s^2e. Average squared differences between yi and yhat.

Ordinal Scale

reflects a stronger level of measurement. Able to both categorize and rank the data with respect some characteristic or trait (satisfaction rating, economic status)

Weighted mean

relevant when some observations contribute more than others. For example: a student is often evaluated on the basis of the weighted mean since the score on the final exam is typically worth more than the score on the midterm

Nominal scale

represent the least sophisticated level of measurement. All we can do is group or categorize the data. Name or label the data (religious affiliation, gender, home town)

Ratio Scale

represents the strongest level of measurement. Have all the characteristics of interval scale, as well as a true zero point. (weight, time, distance, sales, profits)

Goodness of fit

summarize how well the sample regression equation fits the data. If the predicted value (yhat) is equal to its observed values y, then we have a perfect fit.

Random Variable

summarizes outcomes of an experiment with numerical values

a

the allowed probability of making a Type 1 error. Significance level. The smaller the a when rejecting the null, the stronger the evidence that the null hypothesis is false

Range

the simplest measure of dispersion; it is the difference between the maximum value and the minimum value in a data set.

SSE

the sum of the squared differences between the observed values y and their predicted value that or, equivalently, the sum of the squared distances from the regression equation.

If sample correlation coefficient= 1

then a perfect positive linear relationship exists between x and y.

If rxy equals zero

then no linear relationship exists between x and y.

Simple linear regression model

uses one explanatory variable, denoted x1, to explain the variation in the response variable, denoted y. y= Bo + B1x, where Bo is the unknown intercept, and B1 the unknown slope parameter

Mode

value in the data set that occurs most frequently. The mode's usefulness as a measure of central location tends to diminish with data sets that have more than three modes. It is the only meaningful measure of central location

Regression analysis

we change the emphasis from correlation to causation

Estimator and Estimate

when a statistic is used to estimate a parameter, it is referred to as an estimator. A particular value of the estimator is called an estimate.

Positive covariance

when one variable is above its mean, the other variable is also above its mean.

Negative covariance

when x is above its mean, y is below its mean.

Confidence interval for the population mean when sigma is known

xbar +- ta/2 x (sigma/square root of n)

Confidence Interval when M is known but sigma is unknown

xbar +-ta/2,df,s/square root of n

Sample regression equation for the simple linear regression model

yhat= bo + b1x bo and b1 are estimates of Bo and B1 Provides a good fit when the dispersion of the residuals is relatively small

Test statistic of p

z= Pbar - po/square root of po(1 - po)/n

Standard deviation

statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. Defined as the square root of the variance.

Coefficient of determination

-The proportion of the sample variation in the response variable that is explained by the sample regression equation. -referred to as R^2. Easier to interpret than the standard error of the estimate.

One tailed versus two tailed hypothesis tests

: In a one tailed test, we can reject the null hypothesis only on one side of the hypothesized value of the population parameter. In a two tailed test, we can reject the null hypothesis on either side of the hypothesized value of the population parameter.

confidence interval

A confidence interval, or interval estimate, provides a range of values that, with a certain level of confidence, contains the population parameter of interest.

Simple random sample

A sample of n observations that has the same probability of being selected from the population as any other sample of n observations. Most statistical observations presume simple random samples.

Symmetry

If one side of the histogram is a mirror image of the other side. If symmetric, the mean, median, mode, and range are equal

Confidence intervals and two tailed hypothesis tests

If the confidence interval does not contain the hypothesized value of the population mean Mo, then we reject the null hypothesis. If the confidence interval contains Mo, then we do not reject the null hypothesis.

Stochastic

If the relationship is not deterministic, then it is stochastic

Deterministic

If the value of the response variable is uniquely determined by the values of the explanatory variables

Hypothesis test for the population mean when sigma is unknown

In most business applications, sigma is not known and we have to replace sigma with the sample standard deviation s to estimate the standard error of Xbar

Stratified vs Cluster

In stratified, the sample consists of observations from each group, whereas in cluster, the sample consists of observations from the selected groups. Stratified used to increase precision, cluster used to save money.

Response variable

influenced or caused by other variables. Referred to as the dependent variable

Confidence interval for population proportion

pbar +- za/2 * square root of p hat(1- p hat)/n

Parameter

population mean, uses N observations

Mean

primary measure of central location. The average. sample mean: x bar, population mean: M

Z table

provides areas (probabilities) under the z curve. The left hand page provides cumulative probabilities for z values less than or equal to zero. The right hand page shows cumulative probabilities for z values greater than or equal to zero.

dummy variable

qualitative explanatory variable with two categories can be associated. defined as a variable that takes on values of 1 or 0.

Standard error of the estimate

residual e represents the difference between an observed value and the predicted value of the response variable, that is e=y-yhat. If all the data points had fallen on the line, then each residual would be zero, in other words, there would be no dispersion between the observed and the predicted values. -useful goodness of fit measure when we are comparing various models. The model with the smaller se provides the better relative fit. It provides less useful when we are assessing a single model

Point estimator

sample mean

Statistic

sample mean, used to estimate the population mean

P value approach

the decision rule is to reject the null hypothesis if the p value < a and not reject the null hypothesis if the p value >= a (commonly 5%)

P value

the likelihood of obtaining a sample mean that is at least as extreme as the one derived from the given sample, under the assumption that the null hypothesis is true as an equality. The observed probability of making a Type 1 error


Conjuntos de estudio relacionados

Fin 201 Ch.4 Time Value of Money

View Set

Web Application & Software Architecture 101

View Set

CHEM 112 SB ELECTROCHEMISTRY ASSIGNMENT

View Set

Quiz 4 Prep (Source 24.4: Militant Suffrage-Emmeline Pankhurst) *Credits to Aniston for Answers to Source Interpretation Q's!

View Set

Care of Patients with Endocrine Disorders Chapter 37

View Set

MODULE 10: Ch. 42 (Fluid & Electrolytes) - FLUID BALANCE

View Set