Chi Square
Layered Cross-tabulation: A simple form of multivariate analysis
Adding a third variable to a cross-tabulation is easy enough to do, providing that: There are enough cases in the sample to sustain the extra cells created by the resulting table - sample size impacts your decision here No category of the variable used to create the new layers has too few cases Using more than one layering variable will usually result in too few cases for an analysis
Assumptions
Inferential tests have at least a few (if not more) assumptions Assumptions defined: a set of criteria the data must meet for the test to work properly Some tests have more assumptions than others Some tests' assumptions are more difficult to meet than others' We must "screen" the data to make sure assumptions are met in order to determine the reliability of the results... ...execute a pre-test checklist (when possible)
Example of a layered table: Bivariate one-tailed hypothesis H1: Gang affiliation results in greater parole failure — check the table below, is this true? gang affiliation 57% failure 43% success no affiliation 70.8% failure 29.2% success yes it is true
Okay, the hypothesis is supported... ...BUT, gang affiliated parolees are likely to be younger Are these results actually a product of age and not gang affiliation? Let's layer the table using age (which another analysis shows is significantly related to both parole outcome and gang membership) It appears that age does have an effect on the relationship between gang affiliation and parole outcome The relationship "disappears" (is not significant) for parolees over 40 All other parolees retain the gang/outcome relationship, thus gang affiliation results in greater parole failure except for older parolees BUT, our original gang affiliation/parole failure hypothesis holds up even when we control for age
Expected Cell Counts: Expected Counts = row marginal X column marginal ÷ N Steps in Interpreting a Chi-Square:
Propose a hypothesis (either one- or two tailed) Examine the table • Determine the apparent direction of the relationship. (this is important for a one-tailed hypothesis) • Does direction match your hypothesis? If not, stop and reject your hypothesis. If yes, continue by examining probability. Look at the probability level associated with the chi square value • If it is equal to, or smaller than, your alpha level, then accept your research hypothesis and reject the null hypothesis • If the p value is larger than your alpha level, then reject your research hypothesis and accept the null hypothesis
Basic Assumptions of χ2
Random sampling has been used • (No random sampling, no use of χ2 ) Observations on all variables are independent • This means no related variables (pre-test/post-test), matched samples, or time series data Large samples have been used The independent variable marginals are equal • Don't use χ2 if one of the 2x2 marginals has more than 80% of the data (common in comparing men and women in criminal justice populations) No cell has an expected value less than 5 If these assumptions aren't met—Error is present in the results
Why Do Multivariate Analysis?
Relationships between two variables may not be real (spuriousness) ○ Both variables may be caused by a third variable ○ A third variable may mediate the relationship between the two ○ A third variable may also cause the dependent variable Multivariate analysis is designed to help detect these situations
degrees of freedom
The previous table had a column, just prior to the probability, labeled "df " — that stands for "degrees of freedom" Degrees of freedom means the number of scores that are free to vary before the remaining scores are fixed • Put another way, the df is how many scores you have to know before you can guess the remaining ones • So, df is actually a way to control for the effect of the number of scores on the size of the statistic value
The Chi-Square Test (χ2) A inferential test used to primarily analyze categorical data via a cross-tabulation of two variables A nominal level test for significance of difference OR relationship As such, it can be used for all levels of measurement Most appropriate for categorical levels of measurement
Two approaches for χ2 use • Homogeneity - univariate examination of whether categories have the same number of cases or different ones (a.ka. - goodness of fit or single-sample) • Test for independence - bivariate tests examining whether a relationship exists between two variables OR there is a difference in the distribution of one variable across the categories of another variable
Violating χ2 Assumptions?
When χ2 assumptions are violated, the probability (random error) values are smaller than they really are. Result: You might think something is real when it is not This is called a Type I error, and is considered to be egregious Type I errors are "false-positives" suggesting that you have a significant difference when there isn't a one... ...so, you reject the null hypothesis when it should be accepted Type II errors are false-negatives" where you accept the null hypothesis when it should be rejected
This crosstabulation is a 2 x 2 table. When dealing with a 2 x 2 crosstab, the Fisher's Exact Test is the best option for testing significance of a relationship and/or testing the null hypothesis. To answer the following question you will need to execute a layered crosstabs with chi-square tests in SPSS. Use the Crime Survey - short SPSS file that was used above. Setup the crosstabulation such that "v64" (ARE YOU INVOLVED IN CRIME WATCH PROGRAM) are the rows and "police" (Rated police performance - 3 category) are the columns. Assign "edu3" (Education levels - 3 category) as the layering variable.
At the bivariate level, there is a relationship between being involved in crime watch and ratings of police performance. Among those with medium levels of education, crime watch is related to ratings of police performance. While there is a general relationship between being involved in crime watch and ratings of police performance, the relationship varies among individuals with different education levels.
Example in class: race and crime watched helped relations with police H1: (two tailed hypothesis) Race affects perceptions of crime watch helping relations with police. H0: Race does not affect perceptions of crime watch helping relations with police.
Chi-Square Tests Value df Asymptotic (2-sided) Pearson .879a 3 .830 Likelihood .846 3 .838 Linear .213 1 .645 N of Valid Cases 329 a 1 cells (12.5%) have expected count less than 5. The minimum expected count is 2.07. You would accept the null hypothesis, we must take notice that one cell has an expected count less than 5 which is a violation of one of the assumptions of the chi square test. There is also marginal violation with this chi square test. so with the two violations we would not be confident with the chi square test.
The following are examples of situations in which a chi-square contingency table analysis would be appropriate.
Criminal behavior and alcohol drinking. A study compares types of crime and whether the criminal is a drinker or abstainer. Is there a gender preference? An analysis is undertaken to determine whether there is a gender preference between candidates running for state governor. Job training and dropout rates. Reviewers want to know whether worker dropout rates are different for participants in two different job-training programs. Analysis of questionnaire response rates. A marketing research company wants to know whether there is a difference in response rates among small, medium, and large companies that were sent a questionnaire.
What If χ2 Can't Be Used? Picking the Appropriate χ2 Test
If there are problems with the use of traditional χ2 (Pearson's χ2 ), use these alternatives: 1. If the table is 2x2, use the Fisher's Exact Test Provides direct probability, with both one- and two-tailed probabilities SPSS automatically generates this when using a 2X2 cross-tabulation...it is up to you to report the correct result 2. If the table is larger than 2x2, use a Maximum Likelihood χ2 - It doesn't use the marginals, nor require large samples If SPSS produces this results, use it It is likely produced when marginals are unequal or there is some other issue that reduces the confidence in the Pearson χ2 result It is more desirable than Pearon's χ2 , but far less reported in research studies
statement when analyzing results from chi-square
In this instance, the null hypothesis was accepted. An association between being forced to have sex and having a weapon used against them was not significant, χ² (2, N = 1,629) = .333, p = .847.
Narrative for the Methods Section "A chi-square test was performed to test the hypothesis of no association between exposure and reaction." Narrative for the Results Section "A higher proportion of the exposed group showed a reaction to the reagent, χ2 (1, N = 40) = 8.29, p = 0.004." Or, to be more complete, "A higher proportion of the exposed group (65% or 13 of 20) showed a reaction to the reagent than did the nonexposed group (20% or 4 of 20), χ2 (1, N = 40) = 8.29, p = 0.004."
Narrative for the Methods Section "A chi-square test was performed to test the null hypothesis of no association between type of crime and incidence of drinking." Narrative for the Results Section "An association between drinking preference and type of crime committed was found χ2 (5, N = 1,426) = 49.7, p < 0.001." Or, to be more complete, "An association between drinking preference and type of crime committed was found, χ2 (5, N = 1,426) = 49.7, p < 0.001. Examination of the cell frequencies showed that about 70% (144 out of 207) of the criminals convicted of fraud were abstainers, while the percentage of abstainers in all of the other crime categories was less than 50%."
Two separate sampling strategies lead to the chi-square contingency table analysis discussed here.
Test of independence. A single random sample of observations is selected from the population of interest, and the data are categorized on the basis of the two variables of interest. For example, in the marketing research example above, this sampling strategy would indicate that a single random sample of companies is selected, and each selected company is categorized by size (small, medium, or large) and whether that company returned the survey. In this case, you have two variables and are interested in testing whether there is an association between the two variables. Specifically, the hypotheses to be tested are the following: H0: There is no association between the two variables. Ha: The two variables are associated. Test for homogeneity. Separate random samples are taken from each of two or more populations to determine whether the responses related to a single categorical variable are consistent across populations. In the marketing research example above, this sampling strategy would consider there to be three populations of companies (based on size), and you would select a sample from each of these populations. You then test to determine whether the response rates differ among the three company types. In this setting, you have a categorical variable collected separately from two or more populations. The hypotheses are as follows: H0: The distribution of the categorical variable is the same across the populations. Ha: The distribution of the categorical variable differs across the populations.
Formula for Tabular df
The formula for a tabular df for Chi-square is: (r - 1) ( c - 1) r = number of rows c = number of columns So, the df of a 2x2 table is always (2 - 1) ( 2 - 1) = 1 But the good news is that, unless you have to calculate χ2 and use a back-of-the-book table, you still only need to look at the probability
Chi Square: Chi-square tests compare the actual number of cases in a category or cell to the number that is expected. • Example: If we expect a uniform distribution in a variable with 4 values and N=100, then the number of cases in each category should be: Value 1 = 25 Value 2 = 25 Value 3 = 25 Value 4 = 25 Compare expected counts to actual counts
Using χ² (spelled "chi-square" and pronounced "ki skwar") is an extenstion of the bivariate crosstabulation. Chi-square compares the actual, or observed, raw count of each cell and compares it to what is expected to be found in the cell. The larger the difference between the observed count and expected counts for each cell, the greater the likelihood that there will be a significant relationship.
X2 formula X^2 is the sum of (O-E)^2 / E and this is done for all cells they are then added to achieve the sum.
Where: O= the observed frequency of a cell E = the expected frequency of a cell So, you are comparing what was observed (the collected data) to what is expected The greater the differences between O & E, the more likely it is that χ2 test result will be significant. With: degrees of freedom (df ) = (r - 1) (c - 1) df represents a control for the number of cells in the table, or categories in a single sample (k - 1)
Chi-square test: Asymptotic significance (2 sided) for 3 category household income and 3 category education reported .000 the finding is significant since it is lower than our usual alpha level of .05
stated as the probability being less than .001 looking at race and house hold income, although percentage wise whites are evenly distributed across low medium and high house hold income where as black and hispanic are majorly in the low house hold income, the number of white respondents is 845 where as blacks are at 116 and hispanics are at 179, which is called unequal marginals and violates an assumption within the chi square and means so dont have alot of confidence in the chi square results. so then we would go to the likelihood ratio and look at the asymptotic significance