HBX Core Business Analytics
explanatory power of a regression analysis
The extent to which changes in the observed values of the independent variable(s) in a regression analysis explain the changes in the observed values of the dependent variable. Measures of explanatory power include multiple R, R-squared and adjusted R-squared.
intercept
The intersection of a line or curve with an axis on a graph. The y-intercept of the regression line y = a + bx is a, the value of y when x=0. A straight line is completely described by its y-intercept and its slope.
regression line
The line that describes the best linear relationship between a dependent variable and an independent variable. The regression equation defines the regression line: it is the line that minimizes the sum of squared errors between each observed value of the independent variable in the sample and the corresponding predicted value of the independent variable on the regression line. A regression line is typically used to forecast the behavior of the dependent variable and/or to better understand the nature of the relationship between the dependent and the independent variable. Also called the "best fit line."
mean square error
The mean square error is an average of the squared errors. For a regression, the mean square error equals the residual sum of squares (that is, the sum of squared errors) divided by the residual degrees of freedom (that is, the number of observations minus the number of independent variables minus 1). The square root of the mean square error is the standard error of the regression.
population mean
The mean, or average, value of a variable in a population. The population mean is denoted by the Greek letter µ. Because we typically do not know the true value of the population mean, we usually estimate it using a sample mean, denoted as x-bar.
sample mean
The mean, or average, value of a variable in a sample. The sample mean is denoted by x-bar. For a given sample, the sample mean is the best estimate of the true population mean, provided that the sample is randomly selected. The sample mean varies for different samples drawn from a population. For a given population, the accuracy of a sample mean generally increases as the sample size increases. In general, the lower the variability in a population, the more accurate the sample mean is as an estimate of the population mean.
distribution of sample means
The probability distribution of the means of all randomly-selected samples of the same size that could be taken from a population. The Central Limit Theorem states that for sufficiently large randomly-selected samples, the distribution of sample means approximates a normal distribution. The standard deviation of the distribution of sample means is equal to the standard deviation of the population divided by the square root of the sample size. If we do not know the standard deviation of the population, we can estimate it using the sample standard deviation.
data visualization
The process of putting quantitative or qualitative data into a visual context so it is easier to understand. Examples of data visualization include graphs, word clouds, dashboards, and financial statements. Data visualization is especially helpful for highlighting data patterns or trends.
response rate
The proportion of people who were invited to participate in research (e.g., a survey) who actually participate. The response rate is usually expressed as a percentage.
gross relationship
The relationship between a single independent variable and a dependent variable. The gross relationship is affected by any variables that are related to the independent and/or dependent variable but are not included in the model.
residual plot
The residual plot is a scatter plot with residuals on the y-axis and the independent variable on the x-axis. The plot graphically represents the residual (the difference between the observed value and predicted value of the dependent variable) for each observation. Examining residual plots can provide significant insight into the relationships among variables and the validity of the assumptions underlying regression models.
bias
The tendency of a measurement process to over- or under-estimate the value of a population parameter. Although a sample statistic will almost always differ from the population parameter, for an unbiased sample, the difference will be random. In contrast, for a biased sample, the statistic will differ in a systematic way (e.g., tend to be too high). Some common reasons for bias include non-random sampling methods and non-neutral question phrasing.
significance level
The threshold for deciding whether to reject the null hypothesis. The most commonly used significance level is .05 (corresponding to a confidence level of 95%), which means we would reject the null hypothesis when the p-value < .05. The significance level is represented by the Greek letter α (alpha) and is equal to 1-confidence level.
percentile
The value of a variable for which a certain percentage of the data set falls below. For example, if 87% of students taking the GMAT exam earn scores below 670, the 87th percentile for the GMAT exam is 670 points.
nonlinearity
A characteristic of a relationship between two variables that cannot be described by a linear equation, but can be described by a nonlinear equation.
skew
A characteristic of an asymmetric distribution. A skewed distribution is sometimes characterized by the behavior of the distribution's tails. For example, a right skewed distribution may have a longer or fatter "tail" of observations extending to the right than to the left and a left skewed distribution may have a longer or fatter "tail" to the left than to the right. Although often useful, this approach can be uninformative in assessing certain distributions.
heteroskedasticity
A characteristic of the distribution of the residuals (error terms) in a regression. The error terms are heteroskedastic if the size of the error terms depends systematically upon the value(s) of the independent variable(s). Examining residual plots for patterns is useful for identifying heteroskedasticity (for example, if the error terms grow larger as the value of the independent variable grows larger, a classic funnel shape may be visible in the residual plot). Inferences drawn from a regression analysis with heteroskedastic error terms are suspect.
graph
A chart, diagram, or visual representation of data used to illustrate relationships or trends among numeric values, categories or variables.
histogram
A common graphical representation of statistical data, used to represent the distribution of values of a single variable in a data set. The full range of the variable's values is drawn on the x-axis and divided into non-overlapping intervals called bins. A vertical bar is constructed for each bin. The height of the bar corresponding to a bin is equal to the number of data points in the bin (that is the number of data points with a value within the range of the interval). In an Excel histogram, each bin is labeled by the value of the upper boundary of the bin's range. For example, in a histogram with three bins (each of width 1), labeled 1, 2, and 3, the bin labeled 2 contains all observations greater than 1 and less than or equal to 2. See bin.
conditional mean
A conditional mean is the mean (average) of a subset of data. We apply a condition and calculate the mean for values that meet that condition. For example, in a data set that contains data on both males and females, a conditional mean might be the mean of the data pertaining to only the females in the data set.
range of likely sample means
A confidence interval around the population mean assumed under the null hypothesis. This width of the range is determined by the sample standard deviation and the desired confidence level. When a sample mean falls outside the range of likely sample means, we reject the null hypothesis at the stated confidence level.
outlier
A data point in a data set that is atypical in that it lies far outside of the range of the other points in the data set. Technically, an outlier is more than 1.5 times the interquartile range greater than the upper quartile or 1.5 times the interquartile range less than the lower quartile. For example, if the lower quartile, Q1, is 500 and the upper quartile, Q3, is 700, then the interquartile range is 700 - 500 = 200. So any observation with a value less than 200 = 500 - (200*1.5) or greater than 1,000 = 700 + (200*1.5) would be considered an outlier. See quartile.
binomial distribution
A distribution of the possible successful outcomes in a given number of trials, where there are only two possible outcomes for each trial, and each trial has the same probability of success (e.g., flipping a coin). For example, the binomial distribution for the number of "heads" that result from flipping a coin 50 times specifies the probability for each possible outcome, from observing 0 "heads" to observing 50 "heads". The binomial distribution is used to create confidence intervals for proportions.
multi-modal distribution
A distribution with more than one clearly discernable peaks. The peaks may be of the same frequency, or one may be the true mode while the others have very high (but not the highest) frequencies.
scatter plot
A graph showing the relationship between two variables. One variable (generally the independent variable) is measured along the x-axis, and the other (generally the dependent variable) is measured along the y-axis. A single marker is placed for each observation in the data set, allowing for easy visualization of the relationship between the variables.
sample
A group of observations selected from a population. We generally compute statistics based on a random sample to help us estimate the parameters of a population.
one-sided hypothesis test
A hypothesis test that tests for a difference in a parameter in only one direction (e.g., if the mean of one group is greater than the mean of another group). This test should be used only if the researcher has strong convictions about the direction of the change, for example, that the mean of Group A cannot be less than the mean of Group B. In such a case, the null hypothesis might be that the mean of Group A is less than or equal to the mean of Group B, and the alternative hypothesis is that the mean of Group A is greater than the mean of Group B. The rejection region for a one-sided hypothesis test appears in only one tail of the distribution.
coefficient of variation (CV)
A measure of a data set's variability relative to its mean. The coefficient of variation (CV) is particularly helpful when comparing the variability of two data sets with different means. Calculated as the standard deviation divided by the mean, the CV is typically expressed as a percentage. For example the CV of a data set with mean = 100 hours and standard deviation = 15 hours is 15 hours/100 hours = 15%.
adjusted R-squared
A measure of the explanatory power of a regression analysis. Adjusted R-squared is equal to R-squared multiplied by an adjustment factor that decreases slightly as each independent variable is added to a regression model. Unlike R-squared, which can never decrease when a new independent variable is added to a regression model, Adjusted R-squared drops when an independent variable is added that does not improve the model's true explanatory power. Adjusted R2 should always be used when comparing the explanatory power of regression models that have different numbers of independent variables.
R-squared
A measure of the explanatory power of a regression analysis. R-squared measures how much of the variation in a dependent variable can be "explained" by the variation in the independent variable(s). Specifically, R-squared measures the vertical dispersion of the independent variable about the regression line compared to the dispersion of the independent variable about its mean. To calculate R-squared, divide the regression sum of squares by the total sum of squares.
kurtosis
A measure of the flatness or sharpness of a distribution. A flat distribution with thick tails has a low or negative kurtosis; a very sharp distribution, with thin tails and a sharp rise to the peak, has a large, positive kurtosis.
standard deviation
A measure of the spread of a data set's values around its mean value. The standard deviation is the square root of the variance. The standard deviation is measured in the same units (such as dollars or hours) as the observations in the data.
sample standard deviation
A measure of the spread of the values of a variable in a sample around the sample mean. The sample standard deviation is denoted by the Latin letter s. For a given sample, the sample standard deviation is the best estimate of the true population standard deviation, provided that the sample is randomly selected. The sample standard deviation varies for different samples drawn from a population. The sample standard deviation is calculated with a bias correction adjustment for sample size (dividing by the sample size minus 1), whereas the population standard deviation is calculated by dividing by the size of the population. The adjustment is small for large sample sizes, and is much larger for small sample sizes.
population standard deviation
A measure of the spread of values around the mean of a variable in a population. The population standard deviation is denoted by the Greek letter σ. Because we typically do not know the true value of the population standard deviation, we usually estimate it using a sample standard deviation, denoted by the Latin letter s.
correlation coefficient
A measure of the strength of a linear relationship between two variables. The correlation coefficient can range from -1 to +1. A correlation coefficient of -1 indicates a perfect negative linear relationship between two variables, whereas a correlation coefficient of +1 indicates a perfect positive linear relationship. A correlation coefficient of 0 indicates that no linear relationship exists between two variables, though it is possible that a non-linear relationship exists between the two variables.
bimodal distribution
A multi-modal distribution with two clearly discernable peaks. The two peaks may be of the same height (that is, have equal frequency), or one may be the true mode while the other has a very high (but not the highest) frequency.
standard normal distribution
A normal distribution with a mean of 0 and a standard deviation of 1.
null hypothesis
A null hypothesis is a statement about a topic of interest, typically based on historical information or conventional wisdom. We start a hypothesis test by assuming that the null hypothesis is true and then test to see if we can nullify it, which is why it's called the "null" hypothesis. The null hypothesis is the opposite of the hypothesis we are trying to substantiate (the alternative hypothesis).
parameter
A numerical property (such as the mean or variance) of a population. Because we typically do not know the true value of a population parameter, we usually estimate the parameter using a statistic computed from a sample of the population.
p-value
A p-value can be interpreted as the probability, assuming the null hypothesis is true, of obtaining an outcome that is equal to or more extreme than the result obtained from a data sample. The lower the p-value, the greater the strength of statistical evidence against the null hypothesis.
prediction interval
A prediction interval is a range of values constructed around a point forecast. The center of the prediction interval is the point forecast, that is, the expected value of y for a specified value of x. The range of the interval extends above and below the point forecast. The width of the interval is based on the standard error of the regression and the desired level of confidence in the prediction.
asymmetric distribution
A probability distribution that is not symmetric around the mean.
statistic
A quantity measured or calculated from sample data, used to describe or quantify a characteristic of a sample (e.g. sample mean, sample standard deviation). We generally use sample statistics to help us understand the nature of population parameters that cannot be directly observed or measured.
confidence interval for a population mean
A range constructed around a sample mean that estimates the true population mean. The confidence level of a confidence interval indicates how confident we are that the range contains the true population mean. For example, we are 95% confident that a 95% confidence interval contains the true population mean. The confidence level is equal to 1 - significance level.
bin
A range of values used to categorize data. In a histogram, observations are divided into a set of non-overlapping bins, each corresponding to a range of values. The bins are constructed to ensure that the set of bins contains all observations in the data set. The height of the bar corresponding to a bin is equal to the number of observations in the data set that fall within that bin's range. Typically, all bins in a given histogram are the same width (i.e., the difference between the largest value and the smallest value is the same for each bin). In an Excel histogram, each bin is labeled by the value of the upper boundary of the bin's range. For example, in a histogram with three bins (each of width 1), labeled 1, 2, and 3, the bin labeled 2 contains all observations greater than 1 and less than or equal to 2. See histogram.
multiple regression
A regression analysis with two or more independent variables.
net relationship
A relationship between an independent variable and a dependent variable that controls for other independent variables in a multiple regression. Because we can never include every variable that is related to the independent and dependent variables, we generally consider the relationship between the independent & dependent variables to be net with regard to the other independent variables in the model, and gross with regard to variables that are not included.
observational study
A research method in which researchers collect data without manipulating the observations or the variables under investigation. An observational study provides insight into relationships, but does not establish causality.
experiment
A research method used to test whether changing the values of one or more independent variables changes the value of the dependent variable. An experiment involves randomly dividing observations into two or more groups; each group is treated the same except for the independent variable(s), which is (are) systematically varied across the groups. The dependent variable is then measured, and the differences across the groups are tested for statistical significance.
random sample
A sample (a subset of items) chosen from a population (the full set of items) such that every item in the population has an equal likelihood of being chosen. All statistical inferences about a population based on sample data should be based on random samples.
biased sample
A sample that is not representative of the population from which it is collected. Sampling practices that can introduce bias include poorly phrased survey questions and non-random sampling.
point estimate
A single point used to estimate a population parameter. For example, the sample mean, x-bar, is often used as a point estimate of a population mean.
multicollinearity
A situation that occurs when two independent variables are so highly correlated that it is difficult for the regression model to separate the relationship between each variable and the dependent variable. Multicollinearity can obscure the results of a regression analysis. If adding a new independent variable decreases the significance of another independent variable in the model that was previously significant, multicollinearity may well be the culprit. Another symptom of multicollinearity is when the R-square of a regression is high but none of the independent variables are significant.
observation
A specific case or data point in a data set.
linear regression
A specific form of regression analysis that examines the linear relationship between a dependent variable and one or more independent variables. Linear regression analysis identifies the "best fit line," the line that minimizes the sum of squared error terms between the observed values in the sample and the predicted values that lie on the regression line. This best-fit line is called the regression line.
F-statistic
A statistic used to test the hypothesis that the true slope of every independent variable in a regression model is zero, or equivalently, that the true coefficient of every independent variable in the model is zero. The F-statistic is reported in the Analysis of Variance (ANOVA) table. If the value of F exceeds a specific criterion value, we reject the null hypothesis that the slope of each independent variable is 0. Significance F is a p-value for the F-test; it tells us the likelihood of obtaining an F-statistic as high as the observed F-statistic if the null hypothesis is true. Note that the F-test tests a null hypothesis about all of the independent variables simultaneously, whereas the t-test for a coefficient tests a null hypothesis about only about one independent variable in the context of the regression.
regression analysis
A statistical technique used to identify a mathematical relationship between a dependent variable and one or more independent variables. Regression analysis can be used to forecast the behavior of the dependent variable and/or to better understand the nature of the relationship between the dependent and the independent variable(s).
single-population hypothesis test
A test in which a single population is sampled to test whether a parameter's value is different from a specific value (often a historical average).
Central Limit Theorem
A theorem stating that if we take sufficiently large randomly-selected samples from a population, the means of these samples will be normally distributed regardless of the shape of the underlying population. (Technically, the underlying population must have a finite variance.)
lagged variable
A type of independent variable often used in a regression analysis. When data are collected as a time series, a regression analysis is often performed by analyzing values of the dependent with independent variables from the same time period. However, if researchers hypothesize that there is a relationship between the dependent variable and values of an independent variable from a previous time period, may include a "lagged variable", that is, and independent variable based on data from a previous time period.
quantitative variable
A variable that can be counted and/or measured and takes on meaningful numerical values.
hidden variable
A variable that is correlated with two different variables that are not directly related to each other. The two variables may appear to be unrelated, but are mathematically correlated because each of them is correlated with a third, the hidden variable that drives the observed correlation. A hidden variable makes its presence known through its relationship with each of the two variables that are being observed.
independent variable
A variable that is presumed to be related to a dependent variable. In a regression model, independent variables can be used to help predict the value of the dependent variable. A regression model seeks to find the best-fit linear relationship between a dependent variable and one or more independent variables.
dependent variable
A variable that is presumed to be related to one or more independent variables. In a regression model, the dependent variable is the value we are trying to understand or predict. A regression model seeks to find the best-fit linear relationship between a dependent variable and one or more independent variables.
proxy variable
A variable that is presumed to be strongly correlated with a variable of interest. Proxy variables may be used if data about them are more easily available than about the variable of interest.
dummy variable
A variable that takes on one of two values: 0 or 1. Dummy variables are used to transform categorical variables into quantitative variables. A categorical variable with only two categories (e.g. "heads" or "tails") can be transformed into a quantitative variable using a single dummy variable that takes on the value 1 when a data point falls into one category (e.g. "heads") and 0 when a data point falls into the other category (e.g. "tails"). For categorical variables with more than two categories, multiple dummy variables are required. Specifically, the number of dummy variables must be the total number of categories minus one.
qualitative variable
Also called categorical variable. A variable that can be sorted or grouped into categories. Qualitative variables must be transformed into dummy variables (see dummy variables) before they can be included in a regression analysis.
descriptive statistics
Also known as summary statistics, these are numbers that provide a quick overview of the properties of a data set. Typically, descriptive statistics include the data set's mean, median, mode, standard deviation, sample size, minimum, maximum, and range.
alternative hypothesis
An alternative hypothesis is the theory or claim we are trying to substantiate, and is stated as the opposite of a null hypothesis. When our data allow us to nullify the null hypothesis, we substantiate the alternative hypothesis.
A/B test
An experiment that compares the value of a specified dependent variable (such as the likelihood that a web site visitor purchases an item) across two different groups (usually a control group and a treatment group). The members of each group must be randomly selected to ensure that the only difference between the groups is the "manipulated" independent variable (for example, the size of the font on two otherwise-identical web sites). An A/B test is a hypothesis test that tests whether the means of the dependent variable are the same across the two groups. (An A/B test can also be used to test whether another parameter, such a standard deviation, is the same across two groups.)
cross-sectional data
Data that provide a measure of an attribute across multiple different subjects (e.g. people, organizations, or countries) at a given moment in time or during a given time period.
standard error
Estimates how close the mean of the sample is to the mean of the overall population. The standard error is calculated by dividing the standard deviation of the sample by the square root of the total number of data points in the sample.
point forecast
For a given regression model, a point forecast is the predicted value of y for a specified value of x. The point forecast is a single value, equal to the expected value of y for a specified value of x For example, for the regression line y =10+25x, the point forecast when x=2 is y-hat =10+25(2) = 60. See expected value.
slope
For a line defined by the equation y = a + bx, the slope b specifies how much the value of y changes for each unit of change in x (that is, when x increases by 1 unit). In a single variable linear regression, the slope is equal to the regression coefficient of the independent variable. Thus the regression line's slope indicates the expected change in the dependent variable for each unit change in the independent variable x. A straight line is completely described by its y-intercept and its slope.
residual
For a specified value of an independent variable x, the residual in a regression model is the vertical distance between the observed value of the dependent variable y corresponding to that x-value and the expected value of y for that x-value. To calculate the residual for a given x-value, subtract the expected value of y from the observed value of y. See error term.
distribution
Observed or theoretical frequencies or probabilities of a variable across a range of values.
control group
One of two or more groups (typically a control group and one or more treatment groups) in an experiment. The control group either is not manipulated in any way or is treated in the way the population has historically been treated (e.g., exposed to traditional advertising rather than proposed advertising), whereas the treatment group(s) is (are) are manipulated. Ideally, participants should be randomly assigned to groups so that there are no systematic differences between the members of the control and the treatment groups.
quartile
If the ordered observations in a data set are divided into four equal sections, the upper boundary of each section is the quartile. One-fourth (25%) of the data fall below Q1, also called the lower quartile. One-half (50%) of the data fall below Q2, also called the median. Three-fourths (75%) of the data fall below Q3, also called the upper quartile. For example, if a data set consists of the values: 5, 8, 10, 11, 12, 25, 30, 73, 800, 1500, 2000, 2001, 2003 then Q1 would be 11 (three values are less than 11, nine values are greater, and one value is exactly equal to 11); Q2 would be 30 (six values are less than 30, six values are greater, and one value is exactly equal to 30); and Q3 would be 1500 (nine values are less than 1500, three values are greater, and one value is exactly equal to 1500). If there is not an observation in the data set that divides the set evenly, the quartile is the average of the two points that are closest to dividing it evenly. For example, if the data set consists of the values. 5, 8, 10, 11, 12, 25, 30, 73, 800, 1500, 2000, 2001, Q1 would be 10.5 (the average of 10 and 11; three values are less than 10.5, and nine values are greater), Q2 would be 27.5 (the average of 25 and 30, six values are less, and six values greater, than 27.5),and Q3 would be 1150 (the average of 800 and 1500; nine values are less, and three values greater than, 1150). See median.
rejection region
In a hypothesis test, the rejection region is the region outside the range of likely sample means. If the sample statistic falls in the rejection region, there is sufficient evidence to reject the null hypothesis at the confidence level used to create the range of likely sample means.
over-fitting
Increasing the apparent explanatory power of a regression analysis simply by increasing the number of independent variables. We account for overfitting in regression analysis by looking at the adjusted R square which accounts for the number of independent variables.
regression sum of squares
The amount of variation that is explained by the regression line. To calculate the regression sum of squares, subtract the residual sum of squares from the total sum of squares.
residual sum of squares
The amount of variation that is not explained by the regression line. The residual sum of squares is equal to the sum of the squared residuals, that is, the sum of the squared differences between the observed values of the dependent variable and the predicted values of the dependent variable. To calculate the residual sum of squares, subtract the regression sum of squares from the total sum of squares.
expected value
The best estimate for the value of the dependent variable y for a specified value of x. The expected value is the predicted value of y (that is, the y-value that lies on the regression line) corresponding to a specified value of x. For example, for the regression line y =10+25x, the expected value of y when x=2 is y=10+2(25) = 60. See point forecast. In general the expected value of a variable is the mean of that variable as defined in Module 1.
base case
The category of a categorical variable for which a dummy variable is NOT included in a regression model. A regression model with a categorical variable that has n categories should have n-1 dummy variables. The coefficients of the dummy variables included in the regression model are interpreted in relation to the base case. The analyst can select any category to be excluded from the regression model; however, different base cases lead to different interpretations of the dummy variables' coefficients. For example, suppose we are trying to determine the average difference in height between men and women in a sample, and suppose that on average men are 5 inches taller than women in the sample. If we use Female as the base case then the coefficient for the dummy variable for Male would be +5. If we use Male as the base case, the coefficient for the dummy variable for Female would be -5.
population
The complete set of individuals or items in which an analyst or researcher is interested. When it is difficult to learn about every member of a population, random samples are often drawn from a population and analyzed in order to draw inferences about the population.
interquartile range
The difference between the upper quartile (the 75th percentile) and lower quartile (the 25th percentile). The interquartile range contains the middle 50% of the observations in a given data set.
range
The distance between the smallest and greatest values in a data set. Range is the simplest measure of the variability of a data set.
regression equation
The equation, calculated by performing regression analysis, which defines the regression line or regression plane. The general form of the regression equation is: y-hat = a + b(1)x(1) + b(2)x(2) + ... + b(k)x(k); where y is the dependent variable, x(1), x(2), ..., x(k) are the independent variables, a is the intercept, and b(1), b(2),..., b(k) are the coefficients. Note that in the course flow, y-hat, is a y with an angled "hat" over it, and x(1) is x with a subscripted 1. However, this glossary has minimal formatting capability.
error term
The error term, ε, is the difference between the actual observed value of the dependent variable y for a specified value of x and the expected value of y for that value of x. An error term in a regression analysis is also called a residual.
median
The median identifies the middle of a data set such that the same number of data points have values below the median as have values above the median. The median is found by arranging the values of the data points in order of magnitude. If the total number of data points is odd, the median is the value that lies exactly in the middle. For example, if a data set has values 5, 8, 10, 12, 25, 30, 73, 800, 1500, 2000, 2001, the median would be 30, because five values are less than 30 and five values are greater than 30.If the total number of data points is even, the median is the average of the two values in the middle. For example, if a data set has values 5, 8, 10, 12, 25, 30, 73, 800, 1500, 2000, the median would be 27.5, the average of 25 and 30. Five values are less than 27.5 and five of the values are greater than 27.5.
mode
The mode is the value that occurs most frequently in a data set.
average
The most common statistic used to describe the center of the values in a data set. The mean is also known as the average. For a distribution that has discrete values, the mean is equal to sum of the values of all the data points in the set, divided by the number of data points.
mean
The most common statistic used to describe the center of the values in a data set. The mean is also known as the average. For a distribution that has discrete values, the mean is equal to sum of the values of all the data points in the set, divided by the number of data points.
normal distribution
The normal distribution is a symmetric, bell-shaped continuous distribution, with a peak at the mean. A normal distribution is completely determined by two parameters, its mean and standard deviation. Approximately 68% of a normal distributions outcomes fall within one standard deviation of the mean and approximately 95% of its outcomes fall within two standard deviations of the mean. The mean, median and mode of a normal distribution are equal.
population proportion
The number of data points in a population with a certain characteristic divided by the total number of data points in the population. We generally cannot measure the full population proportion directly; we estimate it from the sample proportion. Population proportions are often expressed as fractions or percentages.
sample proportion
The number of data points in a sample with a certain characteristic divided by the sample size (the total number of data points in the sample). The sample proportion is used to estimate the population proportion. To ensure that we obtain a good estimate of the population proportion, the sample should be large enough that n*p ≥5 and n*(1-p) ≥5, where n denotes the sample size and p denotes the sample proportion. This means there are at least 5 cases in our sample WITH the characteristic, and at least 5 cases in our sample WITHOUT the characteristic.
proportion
The number of data points in a set with a certain characteristic divided by the total number of data points in the set. Proportions are often expressed as fractions or percentages.
degrees of freedom (df)
The number of independent pieces of information available for a statistical test or estimation procedure. Degrees of freedom are needed to specify a t-distribution from the family of t-distributions. When calculating a confidence interval with a t-value, the number of degrees of freedom is equal to the sample size minus 1. For example, if the sample size is 20, the number of degrees of freedom would be 20-1=19. Thus, our t-distribution would have 19 df.
sample size
The number of observations in a sample. In general, all else being equal, the larger the sample size of a randomly-drawn sample, the more closely a statistic based on that sample approximates the population parameter it estimates.
confidence level
The percentage of all possible samples that can be expected to include the true population parameter. For example, for a 95% confidence level, the intervals should be constructed so that, on average, for 95 out of 100 samples, the confidence interval will contain the true population mean. Note that this does not mean that for any given sample, there is a 95% chance that the population mean is in the interval; each confidence interval either contains the true mean or it does not.
