Chapter 9: Interpreting Quantitative Data
What are the 2 types of quantitative analytical techniques?
1. Descriptive statistics 2. Inferential statistics
Inferential statistics
uncover relationships that are often causal and make inferences from the sample about the population. - can be used to conduct bivariate and multivariate analyses which investigate the relationship b/t three or more variables.
Positive Linear relationship
when an increase in the value of one variable increases the value of the other - the closer the dots are to the line, the stronger the relationship e.g., of positive relationship is height and weight, the taller person tends to be heavier
Histogram
(continuous variables) vertical bars on the horizontal to represent a continuous variable's frequency distribution.
Limitations of cross-tabulations
- can only show correlation b/t 2 variables not causation - can't see other variables impacting results - not good for continuous variables
What are the ways we can measure a bivariate relationship?
- cross-tabulation - scattergram - correlation coefficient
What happens after data collection?
1. Data entry: putting data into usable analysis 2. Coding: analyzing data in terms of variables e.g., anyone in the sample b.t ages 40 and 65 are coded as "middle-aged" 3. Cleansing: checking the quality of data entry coding
What are the two types of errors that can occur when calculating and interpreting statistical significance?
1. Type I or alpha error 2. Type II or beta error
What are the 2 types of descriptive statistics?
1. Univariate analysis 2. Bivariate analysis
Can we every truly know the answer to statistical significance or directly test the null hypothesis?
No, because it applies to the populate and not the sample. Thus, we make inferences from the sample.
Pie Chart
a circular graphic that visually represents a variable's frequency distributions, using percentages of the total; ideal visual representation for highlighting an important point - not appropriate when categories are relatively equal
Cross-tabulation
a contingency table (the values of the dependent variable contingent on the independent variable) that displays the frequencies and percentages for a specific combinations of at least two categorical variables e.g.. mental illness and being under the influence when coming into contact with law enforcement
Codebook
a document containing coding rules, definitions and number assigned for each variable attribute used by researchers for data entry and analysis.
Frequency polygon or line graph
a graph with a continuous line used to present the frequency distribution of interval or ratio variables visually; the line connects the mid points of the bars within a histogram; better option than histogram when the highest and lowest values are far apart.
what does it mean when the values of the mean, median and mode are not the same?
a skewed distribution occurs; we strive for a symmetrical distribution with both sides virtually identical
Bivariate Analysis
a statistical analysis of two variables that assesses the empirical relationship.
Chi-square
a test of statistical significance determining the degree of confidence that an association between two categorical variables did no occur by chance the greater the value of the chi-square, the the more likely it did not occur by chance. e.g., Males and females responding differently on a yes/no question, the chi-square can tell you the probability that this occurred by chance.
T-test
a test of statistical significance determining whether the means for two groups differ from on another beyond what would occur by random chance. e.g., suppose you are sick one time and take homeopathic medicine and recover in 2 days the next time you are sick you take a prescription and get better in a week. A few of your friends report the same things.A t test can tell you by comparing the means of the two groups and letting you know the probability of those results happening by chance.
Bar Graph
a visual depiction using bars to represent nominal or ordinal variable frequency distributions.
Box-and-shisker plot or boxplot
a way of representing dispersion, central tendency based on the interquartile range
Mean
average score in a frequency distribution; calculated by adding the scores and then dividing the total by the number of cases. - can be used with continuous data only or it won't make sense
Range
basic measure of dispersion; the highest score minus the lowest score in a frequency distribution. - shows us the spread of data but cannot tell us how dispersed the scores are within the distribution - very susceptible to outliers (extremely low or high scores) and therefore isn't always an accurate measure.
Data coding
coding the data involves transforming them from a raw state into a format that can be analyzed. e.g., a nominal variable for sex is straightforward and can be coded as 0 for F and 1 of M. - always enter data at the highest level of measurement possible b/c you can always collapse interval/ ration data into a categorical variable later but cannot do the opposite. - in open-ended surveys, you need to identify patterns in order to create coding categories e.g., creating "other" category
Data cleansing
data cleansing is an important process b/c data entry errors occur and lead to problems of validity. Techniques: - recode 10% of the data, if any mistakes are found, the entire dataset reds to be checked for invalid codes - create a list of all values assigned to cases for that variable and look for entries that don't belong - check for impossible categories by comparing the coding for two variables (cross-tabulation) e.g. an impossible code for sex and pregnancy would be a pregnant man
Negative or left skew
distributions with a long tail on the left side, a large number of high scores and from lowest to highest mean, median and mode.
Positive or right skew
distributions with a long tail on the right side, a larger number of low scores and from lowest to highest mode, median and mean.
Scattergram or scatterplot
graphical representation of scores for two continuous variables districted across the range of all possible values; used to indicate a relationship's strength and direction
A symmetrical distribution
has the most scores in the middle, leading to an equivalent length of tails on both sides. the mean, median and mode are the same value.
Negative (or inverse) linear relationship
increase in one variable result in the decrease in the other e.g., the more cigarettes you smoke, the worse your health becomes OR the more you drive, the less gas you have
Frequency distribution
initial step of univariate analysis; summarizes the number and percentage of cases that fall within variable categories in tables
Measure of dispersion
measures of central tenancy provide information on the typical case, whereas measures of dispersion give us information on the variability within the distribution - the extent to which scores vary
What are the 3 measures of central tendency?
measures of central tendency are statistics that summarize where the numbers in a distribution are clustered. 1. mode 2. median 3. mean
What measure of central tendency do you use with skewed distributions?
median or the results will be misleading
Standard deviation
most commonly used measure of dispersion for continuous variables; an estimate of how widely the scores are spread around the mean in a frequency distribution. - low values of standard deviation mean that the scores are close to the mean - high levels of standard deviation mean the scores are more spread out - using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.
Mode
most frequent value found in a frequency distribution - only one that can be used with nominal variables e.g., sex. - most imprecise measure of central tendency (think what is in fashion, most ppl wear)
Rate
number of incidents divided by size of population (or other similar denominator) a standardized value assigned to a frequency to allow comparisons between groups on the particular measure. e.g., City A has 1,000 assaults in a population of 50, 000 and city B has 1, 200 in a population of 100,000. Problem is that we cannot compare b/c they don't have the same population. divide 1000/50 000=0.02 or 2% and 1200 / 100 000=0.012 or 1%. therefore city A has more assaults per capita than city B.
Non-linear or curvilinear relationship
occurs when there is a positive linear relationship that becomes negative at a certain threshold or vice versa e.g., age-crime cruve, as age increases, so does the amount of criminal offending. However, beyond the age of 25, the amount of criminal offending decreases.
Type II or beta error
occurs when we assert that there is no relationship between two variables but one actually exists. the acceptance of a null hypothesis when it is false This is like finding a guilty person not guilty. - probability of committing a type II error decreases as the number of cases increases
Type I or alpha error
occurs when we incorrectly conclude that two variables are related when the association actually occurred by random chance. the rejection of a null hypothesis when it is true This is like finding an innocent person guilty.
Interquartile range (IQR)
removes the top and bottom quarters; summarize the data using the middle 50 % of the scores.
What is the difference b/t statistical and substantive significance?
statistical significance can tell us whether the observed values didn't occur by chance, but it can't tell us whether it is something worth worrying about in the first place. - substantive significant asks how big of an effect is there - no objective test for substantive significance, but measures of common sense, previous research and theory e.g., two drugs have statistically significant relationship with lifespan, this can have substantive significance when one drug adds one hour to your lifespan and the other adds five years
Descriptive statistics
statistics that tell us something about the characteristics of some sample of the relationship b/t variables - summarize a set of observations
Univariate analysis
summarize data from one variable, using tables, graphs, frequency distributions and statistical measures of central tendency, dispersion and association. - every quantitative analysis starts with univariate analysis to look at one variable at a time - information on the distribution of data is then used to inform how data are used in bivariate and multivariate analyses.
Statistical significance
tell us the chances of finding a relationship in the sample occurring by chance and not existing in the population - stated as a level of confidence that the results are not caused by chance. - no absolute certainty about the true different b/t sample statistic and population parameters i.e., we can only be 95% certain that the results did not occur by chance - to do so we test the null hypothesis which says that there is no relationship b/t two variables
Regression line
the line running through points on a scattergram that shows the direction of the relationship; used to understand the relationship between an independent and dependent variable
Median
the middle point found in the distribution when the values are listed from lowest to highest. - breaks distribution into two equal halves - if equal number, the median is the two middle scores divided by 2
Scattergram of no linear relationship
x's randomly scattered, indicating that two variables don't co-vary in any predictable way.