Statistics

¡Supera tus tareas y exámenes ahora con Quizwiz!

Skewed data

By Jim Frost Skewed data are not equally distributed on both sides of the distribution—it is not a symmetrical distribution. Use a histogram to easily see whether your data are skewed. When you refer to skewed data, you can describe it as either right skewed or left skewed. Data are skewed right when most of the data are on the left side of the graph and the long skinny tail extends to the right. Data are skewed left when most of the data are on the right side of the graph and the long skinny tail extends to the left. If your data are skewed, the mean can be misleading because the most common values in the distribution might not be near the mean. Additionally, skewed data can affect which types of analyses are valid to perform.

Continuous variables

Continuous variables can take on almost any numeric value and can be meaningfully divided into smaller increments, including fractional and decimal values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data. With continuous variables, you can calculate and assess the mean, median, standard deviation, or variance.

Factors

Factors are the variables that experimenters control during an experiment in order to determine their effect on the response variable. A factor can take on only a small number of values, which are known as factor levels. Factors can be a categorical variable or based on a continuous variable but only use a limited number of values chosen by the experimenters.

Ordinary least squares [OLS]

Ordinary least squares, or linear least squares, estimates the parameters in a regression model by minimizing the sum of the squared residuals. This method draws a line through the data points that minimizes the sum of the squared differences between the observed values and the corresponding fitted values.

Regression analysis

Regression analysis models the relationships between a response variable and one or more predictor variables. Use a regression model to understand how changes in the predictor values are associated with changes in the response mean. You can also use regression to make predictions based on the values of the predictors. There are a variety of regression methodologies that you choose based on the type of response variable, the type of model that is required to provide an adequate fit to the data, and the estimation method.

Mean

The mean describes an entire sample with a single number that represents the center of the data. The mean is the arithmetic average. You calculate the mean by adding up all of the observations and then dividing the total by the number of observations. For example, if the weights of five apples are 5, 5, 6, 7, and 8, the average apple weight is 6.4. 5 + 7 + 6 + 5 +9 / 5 = 6.4 The mean is sensitive to skewed data and extreme values. For data sets with these properties, the mean gets pulled away from the center of the data. In these cases, the mean can be misleading because the most common values in the distribution might not be near the mean.

Mode

The mode is the value that occurs most frequently in a set of observations. You can find the mode simply by counting the number of times each value occurs in a data set. For example, if the weights of five apples are 5, 5, 6, 7, and 8, the apple weight mode is 5 because it is the most frequent value. Identifying the mode can help you understand your distribution

Categorical variables

A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic. For a categorical variable, you can assign categories but the categories have no natural order. If the variable has a natural order, it is an ordinal variable. Categorical variables are also called qualitative variables or attribute variables. For example, college major is a categorical variable that can have values such as psychology, political science, engineering, biology, etc.

Correlation

A correlation between variables indicates that as one variable changes in value, the other variable tends to change in a specific direction. A correlation coefficient measures both the direction and the strength of this tendency to vary together. A positive correlation indicates that as one variable increases the other variable tends to increase. A correlation near zero indicates that as one variable increases, there is no tendency in the other variable to either increase or decrease. A negative correlation indicates that as one variable increases the other variable tends to decrease. The correlation coefficient can range from -1 to 1. The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one variable is accompanied by a perfectly consistent change in the other. In practice, you won't see either type of perfect relationship. The two most common types of correlation coefficients are Pearson's product moment correlation and the Spearman rank-order correlation. Pearson product moment correlation The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable. Spearman rank-order correlation Also called Spearman's rho, the Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

Fixed and Random factors

In ANOVA, factors are either fixed or random. In general, if the investigator controls the levels of a factor, the factor is fixed. The investigator gathers data for all factor levels she is interested in. On the other hand, if the investigator randomly sampled the levels of a factor from a population, the factor is random. A random factor has many possible levels and the investigator is interested in all of them. However, she can only collect a random sample of some factor levels. Suppose you have a factor called "operator," and it has ten levels. If you intentionally select these ten operators and want your results to apply to just these operators, then the factor is fixed. However, if you randomly sample ten operators from a larger number of operators, and you want your results to apply to all operators, then the factor is random. These two types of factors require different types of analyses. The conclusions that you draw from an analysis can be incorrect if you specify the type of factor incorrectly.

Residuals

In statistical models, a residual is the difference between the observed value and the mean value that the model predicts for that observation. Residual values are especially useful in regression and ANOVA procedures because they indicate the extent to which a model accounts for the variation in the observed data.

Regression coefficients

Regression coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response. In linear regression, coefficients are the values that multiply the predictor values. Suppose you have the following regression equation: y = 3X + 5. In this equation, +3 is the coefficient, X is the predictor, and +5 is the constant. The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable. A positive sign indicates that as the predictor variable increases, the response variable also increases. A negative sign indicates that as the predictor variable increases, the response variable decreases. The coefficient value represents the mean change in the response given a one unit change in the predictor. For example, if a coefficient is +3, the mean response value increases by 3 for every one unit change in the predictor.

Median

The median is the middle of the data. Half of the observations are less than or equal to it and half of the observations are greater than or equal to it. The median is equivalent to the second quartile or the 50th percentile. For example, if the weights of five apples are 5, 5, 6, 7, and 8, the median apple weight is 6 because it is the middle value. If there is an even number of observations, you take the average of the two middle values. The median is less sensitive than the mean to skewed data and extreme values. For data sets with these properties, the mean gets pulled away from the center of the data. In these cases, the mean can be misleading because the most common values in the distribution might not be near the mean. For example, the mean might not be a good statistic for describing annual income. A few extremely wealthy individuals can increase the overall average, giving a misleading view of annual incomes. In this case, the median is more informative.


Conjuntos de estudio relacionados

Chapter 4: Life Insurance Policy Provisions, Options and Riders

View Set

Chapter 3- Stress--its Meaning, Impact and Sources

View Set

What is a Machine? Section 8.2 quiz

View Set

Chapter 4: Listening Effectively

View Set

Oklahoma Insurance Adjuster's License

View Set