269 chapter 9
4. Measures of Variation
Why use a measure of variation? Also want to know how clustered or spread out the data are around the center A summary of distributions based only on their central tendency can be very incomplete, even misleading
2. Univariate Graph
Your absolute best friend when sharing frequency data! Even for the uninitiated, graphs can be very easy to read. And good, professional-looking graphs can now be produced relatively easily with software available for personal computers.
Variance
The average squared deviation of each case from the mean It takes into account the amount by which each case differs from the mean. Class example: [1,2,3,8,10,13,13,14]
Univariate Statistics (These are all descriptive statistics)
1. Frequency distributions 2. Univariate Graphs Bar charts (for nominal & categorical variables) Histograms (for continuous variables) 3. Measures of central tendency Mean (aka average) Median Mode 4. Measures of variation Range and Interquartile range Variance and standard deviation Skewness
Correlations
Correlation: a standardized number (coefficient) ranging from -1 to 1 Indicate the direction (+/-) and strength of the linear relationship between two variables The most commonly used measure of association Correlations are bivariate Correlations large in magnitude (i.e. closer to -1 or 1) have a stronger association. Negative (-) correlation: as one variable increases, the other variable tends to decrease Positive (+) correlation: as one variable increases, then the other variable also tends to increase An italicized letter r to report correlations E.g. r = -.48.
Looking for Patterns in a Crosstabulation
Existence Strength Direction (quantitative variables) Pattern (quantitative variables)
What happens after data collection
Analyze it using staistical procedures
Research process
Asking research question Formulating research process Collecting data Data analysis Evaluate the hypotheses
Graphs
Bar chart: used to display categorical data, and show the differences in frequencies of the different categories Histograms : used to display continuous data
What graphs reveal
Central tendency The most common value (for variables measured at the nominal level) or the value around which cases tend to center (for a quantitative variable). Variability The extent to which cases are spread out through the distribution or clustered in just one location. Skewness The extent to which cases are clustered more at one or the other end of the distribution of a quantitative variable rather than in a symmetric pattern around its center.
Measures of Central Tendency
Central tendency is usually summarized with one of three statistics: the mode, the median, or the mean. For any particular application, one of these statistics may be preferable, but each has a role to play in data analysis. To choose an appropriate measure of central tendency, the analyst must consider a variable's level of measurement, the skewness of a quantitative variable's distribution, and the purpose for which the statistic is used.
1. Frequencies (Frequency Distributions)
Counts how many cases fall into each category. Your absolute best friend when cleaning data! Use them to diagnose problems and check your work every time you recode, rescale, collapse categories, reverse code, or otherwise change or convert variables. The only exception is for variables that have many response categories (like income) and cannot be easily visualized because the list is so long.
Crosstabulation (Bivariate distribution)
Crosstabulation (or contingency table) Displays the distribution of one variable for each category of another variable It can also be termed a bivariate distribution. [i.e. it is a chart mapping overlapping frequencies in a matrix.] Provide a simple tool for statistically controlling one or more variables while examining the associations among others
Important Bivariate and Multivariate Statistics
Crosstabulations Correlations Visualized with scatter plots Inferences can be made with p-values Regressions Inferences can be made with p-values
Types of Statistics
Descriptive statistics: procedures used to describe a set of data E.g. Means Inferential statistics: procedures used to make inferences (statements) about study population based on our sample E.g. p-values. "Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data"
Measures of Variation, cont.
Four popular measures of variation Range (span) Interquartile range (IQR; middle half span) Variance Standard deviation The variable must be at the ordinal, interval or ratio level
Correlation Does Not Imply Causation!
It is sometimes unclear which variable is cause and which is effect A third unmeasured variable may be responsible for the relationship Take home message: one cannot draw causal conclusions from correlation
Mean (Average)
Mathematical average of the data The mean is computed by adding up the value of all the cases and dividing by the total number of cases, thereby taking into account the value of each case in the distribution: Mean = Sum of value of cases/Number of cases Class example: [1,2,3,8,10,13,13,14] Mean = 64/8 = 8
Interquartile Range (IQR)
Quartiles The dividing points between the four quarters in the distribution IQR = Q3-(Q1+1) Not influenced by extreme observations but it only uses the middle half of the data Class example: [1,2,3,8,10,13,13,14] IQR = 13-(3+1)=9
Mode
One problem with the mode occurs when a distribution is bimodal, in contrast to being unimodal. E.g. a class of mostly freshmen and seniors A bimodal (or trimodal, and so on) distribution Two or more categories with about the same number of cases and with more cases than any of the other categories.
Visualizing Data Using Scatter Plots
Scatter plot: A visual representation of two variables in a dataset. Think-Share-Pair: Suppose you were given only the ages of ten individuals and asked to guess the weight of each person. Q: Would you expect your guesses to be more or less accurate if all ten people were children instead of adults? How might the scatter plots from a sample of children look different from a sample of adults?
Standard Deviation (SD)
The average distance an observation is from its mean (SD= √variance) The bigger SD, the more spread out the variability A very useful measure of variation because each individual score's distance from the mean of the distribution is factored into its computation Class example: What is the SD? SD= √25 = 5
Regression Analysis and P-values
The betas (β): tells which independent variables have the greatest contributions in predicting the level of the dependent variable. We have confidence in these estimates based on the p-values. *p < .05 (95%) **p < .01 (99%) ***p < .001 (99.9%) Variables work as a system As more important predictors enter the equation, less important variables fade in significance. Every variable can make it's own contribution relative to the others.
Range
The difference between the highest and the lowest value Range = Max - (Min+1) The possible range is different from the actual range. E.g. on a scale from 1-10, there might not be anyone reporting a 1 or 10. Range uses all the data but is heavily influenced by outliers/extreme observation Class example: [1,2,3,8,10,13,13,14] Range = 14-(1+1) = 12
Mode
The most frequent value of the data Is used much less often than the other two measures of central tendency Appropriate for variables measured at the nominal level Class example: [1,2,3,8,10,13,13,14] Mode=13
Some Terminology
Univariate: Description of one variable. How happy is the grandchild? Bivariate: Description of two variables. If the amount of time the grandparent spends with the grandchild associated with happiness? Multivariate: Description of three or more variables. Whose time with the grandchild is more influential in promoting happiness; the grandmother's or grandfather's?
Median
Value of the data such that 50% of the observation are smaller than this value Steps to calculate the median: Put the scores in order from lowest to highest If N is odd, then the median is the middle score of the list If N is even, then the median is the mean of the two middle scores on the ordered list Class example: [1,2,3,8,10,13,13,14] Median = (8+10)/2 = 9
Median vs. Mean
We can use either mean or the median to summarize continuous data One problem with the mean: Heavily influenced by extreme small and large values! Median is often a better descriptive statistic in skewed distributions We can compare the mean and median to gain information on the skewness of a distribution Mean = Median (symmetric) Mean > Median (positively skewed) Mean < Median (negatively skewed)
Regression Analysis
a statistical method that fit through a set of data, so that the distance between the data points are minimized Used in the vast majority of statistical reports and social science journal articles Other more complex methods (e.g. structural equation models) that based on regression analyses