Soc 200 Ch 12

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Introductory overview: the process of quantitative analysis

pg 335 (figure) the process as a whole is most applicable to the analysis of survey and existing (quantitative) data - Just like the other steps of research, there are variations depending on the data collection approach; moreover, there are substeps within each step. 1. step of preparing data for computerized analysis = "data processing" - overlaps with data collection To conduct quantitative analyses, the information gathered in a survey, for example, must be quantified and tran- scribed to a computer-readable data file. In addition, the data file should be as error-free as possible. - Researchers analyzing existing data—whether from a survey, such as the GSS, or another source—begin their analysis at the second step: data inspec- tion and data modification. - *the goal of inspection* is to get a clear picture of the data in order to determine appropriate statistical analyses and necessary data modifications reasons for data modification are many: ex. a researcher may want to combine the responses to several items in order to create an index or scale (see Chapter 11), change one or more of the values for a variable, or col- lapse categories for purposes of analysis. As we describe below, Singleton combined answers to two survey questions to create a measure of alcohol consumption. The analysis then turns to empirical testing of the hypothesized relationships. For simple two-variable (bivariate) hypotheses, the analyst determines whether the associ- ation between the independent and dependent variables confirms theoretical expecta- tions. (ex. Singleton found that alcohol consumption was negatively associated with GPA, as predicted. In a *true experiment*, assessing the relationship between the independent and dependent variables is often the final analysis step because an ade- quate design effectively controls extraneous variables) In *nonexperimental designs*, extraneous variables may pose serious rival explanations that require statistical control, thereby leading to another analysis step: conduct multivariate testing. If preliminary hypothesis testing supports theoretical expectations, the analyst formulates multivariate models to rule out, to the extent possible, that the initial results were a spurious consequence of uncontrolled antecedent variables. Singleton had to examine the possibility, for example, that students' academic aptitude creates a spurious association between drinking and grades. Conversely, if hypothesized relationships are not supported in preliminary testing, the researcher designs multivariate models to determine if uncontrolled extraneous variables are blocking or distorting the initial results. *quantitative data analysis = deductive logic of inquiry* - it may also follow the inductive logic of inquiry - [ex. the preliminary testing step may reveal unanticipated (serendipitous) findings that suggest alternative multivariate models] *process of quantitative analysis as a whole involves transforming raw quantitative data into statistics* - It is at the second step, data inspection and modification, where statistics enters in. (researchers draw upon 2 broad types of statistics: descriptive & inferential) *Descriptive stats*: organize and summarize data to make them more intelligible (the high and low scores and avg. score on an exam are descriptive stats that readily summarize a class's performance) *Inferential stats*: are used to estimate population characteristics from a sample data and to test hypotheses (are essential for data inspection, whereas both descriptive & and inferential stats. are used in the last two steps) - overall ex. reference: Singleton needed a descriptive statistic that would summarize the degree of association b/w alcohol consumption and academic achievement in his sample. He also needed inferential stats to test his hypothesis and to determine whether the association based on sample data applied to ALL students at the college

nominal- and ordinal-scale variables (bivariate data)

When the *variables analyzed have only a few categories, as in most nominal- and ordinal-scale measurement*, bivariate data are presented in tables. The tables constructed are known as cross-tabulations or contingency tables. A *cross-tabulation* requires a table with rows representing the categories of one variable and columns representing the cat- egories of another. When a dependent variable can be identified, it is customary to make this the row variable and to treat the independent variable as the column variable. Let us first consider the cross-tabulation of two nominal-scale variables from the 2012 GSS shown in Table 12.4. The row variable consists of "attitude toward gun con- trol" or, more precisely, whether the respondent favors or opposes a law that would re- quire a person to obtain a police permit before he or she could buy a gun. The column variable is "sex." Sex is clearly the independent variable in this relationship. What sort of information does this table convey? First, notice that the last column and the bottom row, each labeled "Total," show the total number of respondents with each single characteristic, for example, 569 males. Because these four numbers (944, 339, 569, 714) are along the right side and the bottom margin of the table, they are called *marginal frequencies*, or marginals [row marginals= 944,339 are the univariate frequency distribution for the variable "attitude toward gun control"; the column marginals (569, 714) are the univariate frequency distribution for the variable "sex"; ALSO, the # at the lower right hand corner (1283) is N, the total sample size excluding missing cases] ---- N equals either the sum of the row or column marginals or the sum of the four numbers (377, 567, 192, 147) in the body of the table ------- pg. 370 body of the table: where the categories of the 2 variables intersect contains bivariate frequency distribution (each intersection is called a "cell", and the # in each cell is called the *cell frequency*) - cell frequencies in a bivariate table: indicate the # of cases w/ each possible combination of two characteristics - for ex. pg.370 - there were 377 males who favored gun control [b/c table 12.4 has two rows and two columns, it is referred to as a 2 x 2 table] Now that we know the meaning of the numbers in a cross-tabulation, how do we ana- lyze these numbers to assess the relationship between the variables? - With sex as the inde- pendent variable in Table 12.4, we can assess the relationship by examining whether males or females are more likely to favor (or oppose) gun control. To determine the "likelihood" that a male or female either supports or opposes gun control, we need to create separate percentage distributions for males and females. [doing this converts each column total to 100 percent, so that the cell values are based on the same total] = bivariate percentage distribution, presented as table 12.5 - now we can more better compare responses across sex - pg. 371] - a bivariate percentage distribution enables one to compare the distribution of one variable across the categories of the other [in table 12.5, we created such a distribution by percentaging down so that the columns totals, corresponding to the categories of the independent variable, equaled 100 percent; the rule that we followed in deriving this table is to compute percentages in the direction of the independent variable - based on the categories of the ind. variable] - to interpret the relationship in table 12.5, we compared percentages by reading ACROSS the table - in doing so, we followed a 2nd rule: Make comparisons in the opposite direction from the way percentages are run - cross tabulations can be easily misinterpreted - as you read across table 12.5, you see that there is a diff. of 13.1 percent in females as opposed to males who favor gun control - this "percentage diff." indicates that a relationship exists for these data; if there were no diff. b/w the percentages, we would conclude that no relationship exists --- REMEMBER, however, that these are sample data - the important question is not whether a relationship exists in these data; rather, do the observed cell frequencies reveal a true relationship between the variables in the population, or are they simply the result of sampling and other random error? To answer this question, you need to understand the logic of tests of statistical significance.

Intro Ex. of survey data analysis

"Drinking and Grades" Singleton's (2007) quantitative analysis of data from the four campus surveys showed that the amount of alcohol that students reported consuming on a typical weekend night was negatively associated with their cumulative GPA, even after statistically controlling for key background variables. In other words, the more drinks a student consumed, the lower his or her grades. Below we describe in detail how Singleton analyzed the data to reach this conclusion. But first we provide an overview of the steps involved in quantitative analysis.

scatterplot

A graph plotting the values of two variables for each observation.

multiple regression

Multiple regression is simply an extension of bivariate regression to include two or more independent variables. Like the partial tables of elaboration analysis, it provides information on the impact of an independent variable on the dependent variable while controlling for the effects of other independent variables. Unlike partial tables, control is not limited to a few variables nor is information lost from collapsing variables into fewer categories - formula - pg. 385 In this equation, Ŷ is the predicted value of the dependent variable, each X represents an independent variable, the b values are *partial regression coefficients or partial slopes*, and a is the Y-intercept (the predicted value of the dependent variable when all of the X values are equal to zero). *the slopes in this equation differ from those in bivariate regression in that each represents the impact of the independent variable when all other variables in the equation are held constant*. This tells us the effect of a given variable beyond the effects of all the other variables included in the analysis. - When researchers report the results of multiple regression, they present the intercept and the regression coefficient for each variable in a table. Table 12.12 shows the out- come for the campus survey of the regression of cumulative GPA on selected indepen- dent variables. Model 1 includes seven independent variables; Model 2 adds an eighth: number of drinks consumed, which is the independent variable of theoretical interest. To interpret the results of this analysis, we need to explain several features of the vari- ables and the statistics presented in the table. - three of the variables—"male," "white," and "intercollegiate athlete"—represent dummy variables. A *dummy variable* is a variable that is recoded into two categories that are assigned the values of 1 and 0. Dummy coding enables the researcher to manipulate the variable numerically, as in multiple regression. Thus, for the campus survey data, 1 = male and 0 = female; 1 = white and 0 = nonwhite; 1 = intercollegiate athlete and 0 = not an intercollegiate athlete. - The last row of the table contains a statistic, *R2*, which indicates how well the data fit the model or how well we can predict the dependent variable based on this set of independent variables. R2 may vary from 0 to 1. It is particularly useful for comparing models containing the same dependent variable but different sets of independent variables, such as Models 1 and 2 in Table 12.12. - regression coefficients (or slopes) indicate the amount of change in Y that is associated with each change of one unit in X. For example, the unstandardized regression coeffi- cient of -.025 for number of drinks means that cumulative GPA decreases by .025 with each drink that is consumed. The problem with unstandardized regression coefficients, however, is that the units differ for each independent variable. - For SAT total score, it is a point; for income, the unit is dollars; for amount of alcohol con- sumed, it is a single drink; and so on. Because the units differ, you cannot compare the slopes to determine their relative impact on the dependent variable. After all, what sense does it make to compare the impact of a single point on the SAT test w/ a single drink of alcohol ---------------------------- - to compare slopes, the should be "standardized" in a way that expresses the effects in terms of a common unit; *a standard deviation is a unit common to both ind. variables and the dep. variable* - (table 12.13 - The statistics opposite each independent variable in Table 12.13 are standardized regression coefficients. Notice that we have not included the Y-intercept in this table.) = *when standardizing the coefficients, the y-intercept is set to 0* - Notice also that the standardized regression coefficient for number of drinks in Model 2 is -.233. This means that for every increase of one standard deviation in number of drinks, there is a decrease of .233 standard deviation in cumulative GPA. With each variable now standardized, we can compare coefficients to see which has the biggest impact. The variable with the biggest impact is SAT total score, followed by number of drinks. But *there are many methods of multivariate analysis other than elaboration and multiple regression*. For example, there are special techniques for modeling different types of dependent variables (e.g., dichotomous, nominal, and ordinal measures), for straightening or transforming nonlinear relationships into a form suitable for linear modeling, for modeling mutual causation, processes occurring over time, and so forth. - all forms of regression analysis are subject to what is called a *specification error*. All regression analysis involves the specification of a model contain- ing a set of variables. A specification error occurs when the equation leaves out important variables. For exam- ple, if an equation fails to contain an antecedent vari- able that is a common cause of a bivariate relationship, the results of a bivariate analysis may be misleading because the relationship is spurious. To the extent that a model is misspecified, a regression analysis will produce biased estimates of coefficients.

data processing

the preparation of data for analysis

interval and ratio-scale variables

the analysis of the relationship between two interval/ ratio variables begins by plotting the values of each variable in a graphic coordinate system - a regression analysis is then used to determine the mathematical equation that most closely describes the data [through this equation, they identify statistics that show the effect of one variable on another] ex. we begin with a scatterplot - pg. 376 - each plot in the graph represents the values of one of the 710 students for whom we have data on both these variables {with the vertical axis as our reference, we can read the value of the dependent variable (cumulative GPA); and the horizontal as our reference, we can read the value of the independent variable (semester GPA) - he scatterplot gives the researcher a rough sense of the form of the relationship: whether it is best characterized with a straight or a curved line and whether it is positive or negative. This is crucial information because regression analysis assumes that the data have a particular form. If a straight line provides the best fit with the data, we should do linear regression; if a curve provides the best fit, we should use special tech- niques for fitting curvilinear relationships (which are beyond the scope of this book).- figure 12.5 pg. 376 - figure 12.5 - Having decided to fit a straight line to the data, and therefore to do linear regression analysis, we need to know two things: (1) the mathematical equation for a straight line and (2) the criterion for selecting a line to represent the data. -- equation for straight line relating cumulative GPA and semester GPA = pg. 376 The value a, called the Y-intercept, is the point where the line crosses the vertical axis (where Semester GPA = 0). The value b, called the slope or regression coefficient, indicates how much Yˆ increases (or decreases) for every change of one unit in X—in other words, how much increase (or decrease) occurs in cumulative GPA for every change of 1 grade point in semester GPA. - Regression analysis uses the method of least squares as the criterion for selecting the line that best describes the data. According to this method, the best-fitting line mini- mizes the sum of the squared vertical distances from the data points to the line. We have drawn the *regression line*, also called the "least squares line," on the scatterplot. - Now imagine a dashed line showing the vertical distance, as measured by cumulative GPA, between a specific data point, say ID 150, and the regression line. The regression line represents the equation for predicting Y from X; the vertical distances between data points and this line represent prediction errors (also called *residuals*). - pg. 378 the strength of the association between two interval/ratio variables is frequently measured by the *correlation coefficient* (symbolized as r), which may vary between -1 and +1. The sign of the coefficient, which is always the same as the sign of the regres- sion coefficient, indicates the direction of the relationship. The magnitude of its value depends on two factors: (1) the steepness of the regression line and (2) the variation or scatter of the data points around this line. If the line is not very steep, so that it is nearly parallel to the X-axis, then we might as well predict the same value of Y for every unit change in X as there is very little change in our predic- tion (as indicated by b in the equation) for every unit change in the independent variable. *By the same token, the greater the spread of values about the regression line (regardless of the steepness of the slope), the less accurate are predictions based on the linear regression.*

summary - pg. 390

two methods of model testing that are used to examine the ef- fects of an independent variable on a dependent variable while controlling for other relevant independent variables. In elaboration of contingency tables, we begin with a two-variable relationship and then systematically reassess this relation- ship when controls are introduced for a third variable (and sometimes additional ones). Variables are controlled, or held constant, by computing partial tables, which examine the original two-variable relationship separately for each category of the control variable. Of the numerous possible outcomes, one is of particular in- terest: if the model specifies the control variable as causally antecedent to the other two variables and the original rela- tionship disappears in each partial table, then the original relationship is spurious. A better technique for analyzing the simulta- neous effects of several independent vari- ables on a dependent variable is multiple regression. The partial-regression coeffi- cients in a multiple-regression equation show the effects of each independent variable on the dependent variable when all other variables in the equation are held constant. *comparison of partial-regression coefficients may be facilitated by standardizing them to the same metric of standard deviation units* - Data inspection is facilitated by creating percentage distributions and, for interval/ratio variables, by calculating statistics that describe the central tendency, variation, and shape of a distribution. - Bivariate relationships of nominal/ordinal variables are depicted in contingency tables. - Examining the relationship of two interval/ratio variables often involves regres- sion analysis: plotting the variables in a graph and then determining the best- fitting line. - Various statistics are available to test for significance (e.g., chi-square for contin- gency tables) and to measure degree of association (e.g., the correlation coefficient) - Elaboration of contingency tables systematically controls for third variables by creating partial tables in which categories of the third variable are held constant. - Multiple regression can examine the effect of an independent variable by con- trolling simultaneously for several other variables.

histogram

A graphic display in which the height of a vertical bar represents the frequency or percentage of cases in each category of an interval/ratio variable

slope/regression coefficient

A bivariate regression statistic indicating how much the dependent variable increases (or decreases) for every unit change in the independent variable; the slope of a regression line.

wild-code checking

A data-cleaning procedure involving checking for out-of- range and other "illegal" codes among the values recorded for each variable.

consistency checking

A data-cleaning procedure involving checking for unreasonable patterns of responses, such as a 12-year-old who voted in the last U.S. presidential election.

regression line

A geometric representation of a bivariate regression equation that provides the best linear fit to the observed data by virtue of minimizing the sum of the squared deviations from the line; also called the "least squares line."

R^2

A measure of fit in multiple regression that indicates approximately the proportion of the variation in the dependent variable predicted or "explained" by the independent variables.

correlation coefficient

A measure of the strength and direction of a linear relationship between two variables; it may vary from −1 to 0 to +1.

standard deviation

A measure of variability or dispersion that indicates the average "spread" of observations about the mean.

regression analysis

A statistical method for analyzing bivariate (simple regression) and multivariate (multiple regression) relationships among interval- or ratio-scale variables

multiple regression (def.)

A statistical method for determining the simultaneous effects of several independent variables on a dependent variable.

Box 12.1 Codebook Documentation

A survey codebook is like a dictionary in that it defines the meaning of the numerical codes for each named variable, such as the codes for sex in the campus survey. Codebooks also may contain question wording, interviewer directions, and coding and editing decision rules. - Examining a codebook and other study documentation, such as information about the sample and data collection procedures, should help you decide if a given data set will be useful for your research.

partial table

A table in elaboration analysis which displays the original two-variable relationship for a single category of the control variable, thereby holding the control variable constant. - pg 383

frequency distribution

A tabulation of the number of cases falling into each category of a variable

chi-square test for independence (X^2)

A test of statistical significance used to assess the likelihood that an observed association between two variables could have occurred by chance

dummy variables

A variable or set of variable categories recoded to have values of 0 and 1. Dummy coding may be applied to nominal- or ordinal- scale variables for the purpose of regression or other numerical analysis.

Nominal & Ordinal - Scale Variables

At first, Singleton performed univariate analyses to observe the amount of variation in the variables to be analyzed. It is generally a good idea, for example, to see if there is sufficient variation in responses to warrant including the variable in the analysis. As a rule, the less variation, the more difficult it is to detect how differences in one variable are related to differences in another variable. (To take an extreme example, if almost all of the students in the campus survey identified their race as white, it would be impos- sible to determine how differences in race were related to differences in alcohol con- sumption or in academic achievement) One means of data inspection is to organize responses into a table called a *frequency distribution*. A frequency distribution is created by adding up the number of cases that occur for each coded category. When we used SPSS to do this for the race/ethnicity question, our output looked like that in Table 12.1. (ex. the number of "whites" in the combined sample is 642) - this is impt. info however, this # by itself is meaningless unless we provide a standard or reference point with which to interpret it) - to provide an explicit comparative framework for interpreting distributions, researchers often create *percentage distributions*, which show the size of a category relative to the size of the sample [To create a percentage distribution, you divide the number of cases in each category by the total number of cases and multiply by 100.] - In Table 12.2, the percentages are based on the total number of responses, excluding missing data—those in the "no answer" category. One student ("no answer") either was not asked the question (interviewer error) or did not respond. Since this is not a mean- ingful variable category, it would be misleading to include it in the percentage distribu- tion. he total number of missing responses is important information. If this information is not placed in the main body of a table, then it at least should be reported in a footnote to the relevant table or in the text of the research report. Also notice that the base for computing percentages, 753, is given in parentheses below the percentage total of 100 percent. It is customary to indicate in tables the total number of observations from which the statistics are computed. This information may be found elsewhere—at the end of the table title or in a headnote or footnote to the table; often it is signified with the *letter N*. = An abbreviation representing the number of observations on which a statistic is based (e.g., N = 753). Univariate analysis is seldom an end in itself. One important function mentioned ear- lier is to determine how to collapse or recode categories for further analysis. - collapsing decisions may be based on theoretical criteria and/or may hinge on the empirical varia- tion in responses. Thus, years of education might be collapsed into "theoretically" mean- ingful categories (grade 8 or lower, some high school, high school graduate, some college, college graduate) on the basis of the years of schooling deemed appropriate over time in the United States for leaving school and qualifying for certain occupations. - Alterna- tively, one might collapse categories according to how many respondents fall into each category. If the sample contains only a handful of respondents with less than a college education, these respondents may be placed in one category for purposes of analysis. One problem with the race/ethnicity data is that there are too few respondents in several categories to provide reliable bases of comparison. To resolve this problem, Singleton applied both theoretical and practical criteria. Prior research and theory in- dicate that there is an association between being white and heavy alcohol use among U.S. college students. In addition, no racial identity other than "white" had more than 29 respondents - therefore, Singleton created a "new" variable by collapsing all of the categories other than "white" into a single "nonwhite" category. This produced a two- category variable for race/ethnicity: white and nonwhite.

summary pg 380

Bivariate analysis examines the relation- ship between two variables. For relation- ships involving exclusively nominal- or ordinal-scale variables, such analysis begins with the construction of cross-tab- ulations. For relationships involving inter- val- or ratio-scale variables, the data are plotted in a scatterplot and character- ized in terms of a mathematical equation.

partial regression coefficient/ partial slope

Coefficients in a multiple-regression equation that estimate the effects of each independent variable on the dependent variable when all other variables in the equation are held constant.

Interval- and Ratio-Scale Variables

Creating frequency or percentage distributions is about as far as the univariate analysis of nominal- and ordinal-scale variables usually goes. On the other hand, data on inter- val and ratio variables may be summarized not only in tables or graphs, but also in terms of various statistics. - Consider question 36 from the campus survey, which asks respondents, "On a typical weekend night when you choose to drink, about how many drinks do you consume?" (see Figure 12.1). Since respondents' answers were recorded in number of drinks, this variable may be considered a ratio-scale measure. We could get a picture of the number of drinks consumed, as we did with the race/ethnicity variable, by generating a distribution of the responses. - Table 12.3 presents a computer-like output for the number of drinks students reported consuming. Notice that Table 12.3 presents two kinds of distributions: frequency and percentage. Notice also that ab- stainers (who were not asked this question) were coded as "97," for "not applicable." - We also can get a picture of a distribution by looking at its various statistical proper- ties. Three properties may be examined. The first consists of measures of central ten- dency—the mean, median, and mode. These indicate various "averages" or points of concentration in a set of values. - *mean* - arithmetical avg. calculated by adding up all of the responses and dividing by the total number of respondents [it is the "balancing" point in a distribution b/c the sum of the differences of all values from the mean is exactly equal to zero.] - *median* - is the midpoint in a distribution - the value the middle response; half of the responses are above it and half are below [in an even number data set, (N=4), the median would be the avg. of the second and third ordered values] - *mode* - is the value or category with the highest frequency - a second property that we can summarize statistically is the degree of variability or dispersion among a set of values (the simplest dispersion is the *range*) - statistically, this is the diff. b/w the lowest and the highest values, but it is usually reported by identifying these end points, such as "the number of drinks consumed ranged from 0 to 25" - of the several other measures of dispersion, the most commonly reported is the *standard deviation* - this is a measure of the "average" spread of observations around the mean With respect to the variable of number of drinks con- sumed, the standard deviation could be used to compare the degree of variability in drinks consumed among dif- ferent subsamples or in samples from different popula- tions. - The standard deviation of number of drinks consumed for the campus survey was 3.68. Among male respondents, the standard deviation was 3.97, revealing more variability among men than women, for whom the standard deviation was 2.23. - a third statistical property of univariate distributions is their shape. This property is most readily apparent from a graphic representation called a *histogram* - pg 367 figure 12.4 ---- Superimposed on the histogram is a "bell-shaped" distribution, so called because it has the general shape of a bell. In a bell-shaped distribution, the three mea- sures of central tendency are identical, whereas in a positively skewed distribution like Figure 12.4 the mean has a higher value than the mode and median. One particular type of bell-shaped distribution is the normal distribution, which we described in Chap- ter 6. The normal distribution describes the shape of many variables and statistics, such as the sampling distribution of the mean. - Collectively, these three statistical properties—central tendency, dispersion, and shape—provide a good picture of quantitative data. Many investigators, in fact, describe their data in terms of a mean or median, an index of dispersion, and occasionally the overall form (for which there are also statistical indices). [Inspecting the frequency dis- tribution also enables you to spot extreme values or *outliers* that can adversely affect some statistical procedures.] Data inspection can also reveal the prevalence of missing values. The simplest way to handle cases with missing values, which we did in percentaging Table 12.2, is to remove them from statistical calculations = method called *listwise deletion* - often is used when there are relatively FEW missing cases - Excluding cases with missing data on any of the variables in a planned multivariate analysis, however, can lead to a much smaller, biased sample that is unrepresentative of the target population. (pg. 368) - For almost all the variables in the campus survey, there were few missing values. A major exception was parents' income: 148 respondents either refused to report their parents' income or, more commonly, did not know. Because eliminating these 148 cases would produce a smaller and possibly less representative sample, Singleton did not apply listwise deletion. Instead, he used one of the formal statistical solutions, called *imputation*, that have been devised to replace missing values with a typical value calculated from the available ("non missing") data [ex. one procedure would be to assign the mean income of nonmissing values to the missing values. pg. 368] - another procedure predicts missing values from known values of other variables - pg. 368 [ex. missing values for income were predicted based on the regression of income on race and parents' education for respondents with non missing values on income] - another impt. function of data modification is to reduce data complexity by combining variables into indexes, scales, or other composite measures - Singleton combined answers to two questions for the final operational definition of number of drinks con- sumed. Because the research question asks about all students, not just nonabstainers, those who reported "abstain" in question 34 (ALCCONS) were recoded as "0" on "number of drinks consumed" (NUMDRNKS).

measures of association (def)

Descriptive statistics used to measure the strength and direction of a bivariate relationship.

carry out preliminary hypothesis testing

Having collected, processed, inspected, and modified the data, a researcher is finally in position to carry out preliminary hypothesis testing. For novice researchers, this can be an exciting—but also potentially disappointing—stage in the research process. This is because whenever we formulate a hypothesis, it is possible that the hypothesis is "wrong" (and it's hard not to take this personally!). the object of bivariate analysis is to assess the relationship between two variables, such as between the independent and dependent variable in a hypothesis As in the previous section, we begin by showing how tables and figures can be used to depict the joint distribution of two variables. "Eyeballing" the data helps, but there are more precise ways of assessing bivariate relationships. And so, we also introduce two types of statistics for deter- mining whether one variable is associated with the other: 1. tests of statistical significance 2. measures of association In general, this amounts to determining, first, whether the relationship is likely to exist (or whether it might be a product of random error) and, second, the strength of the relationship between the variables. Finally, as with univariate analysis, the way in which bivariate analysis is done depends on the level of measurement.

summary

Having entered and cleaned the data, the researcher is ready to inspect and modify the data for the planned analy- sis. The goal of inspection is to get a clear picture of the data by examining each variable singly (univariate analysis). At first, the categories or values of each variable are organized into frequency and percentage distributions. If the data constitute interval-level measurement, the researcher will also compute statis- tics that define various properties of the distribution. Statistical measures of cen- tral tendency include the most typical value (mode), the middle value (median), and the average (mean). Common mea- sures of dispersion are the difference between the lowest and highest values (range) and an index of the spread of the values around the mean (standard devi- ation). Distributions also may be described in terms of their shape. Data modifications include changing one or more variable codes, collapsing variable categories, imputing estimated values for missing data, and adding together the codes for several variables to create an index or scale.

coding (data processing)

In quantitative research, coding consists of transforming data into numbers, if it is not already in this form. most surveys are precoded; that is, prior to data collection, each response category of closed-ended questions is assigned a number, which may be specified directly on the interview schedule. Figure 12.1 shows the numerical codes assigned in questions 34 and 35. In conducting the survey, interviewers either circle the numbered response or, in computer-assisted interviewing, enter the numbered response directly into a data file. Coding answers to closed-ended survey questions is straightforward: there are rel- atively few categories, and you simply assign a different code to each category. How- ever, the coding of responses to open-ended survey questions with large numbers of unique responses and of other textual data, as in some forms of content analysis (see Chapter 10), is much more complicated. = "qualitative analysis"

entering the data

Once data are coded and edited, they need to be entered into a computer data file. As with editing, data entry into a computer file occurs automatically in computer-assisted interviews. For some paper-and-pencil surveys, data may be entered using software programmed to detect some kinds of erroneous entries; this is called computer-assisted data entry (CADE). Data from the campus surveys were entered by hand into a statisti- cal software package without the aid of a data-entry program that checks for errors. The software, originally named Statistical Package for the Social Sciences and known today by the acronym SPSS, is widely used in the social sciences. when data are entered, they are stored in a *data matrix* or spreadsheet, with observations as rows and variables as columns - pg. 358 - example figure - the rows represent respondents, who are identified by unique ID codes (first column). The remaining columns contain the coded responses to each question or variable data matrix: - Notice that the columns are headed by abbreviated variable names. To facilitate data analysis, the campus survey used mnemonic labels. For example, the labels FREQALC and NUMDRNKS, which represent questions 35 and 36, respectively, stand for "frequency of alcohol consumption" and "number of drinks consumed." - Numerical codes in the matrix cells identify question responses or variable cate- gories. For respondent's sex (sixth column), a code of 1 was used for males and 2 for females. Thus the four listed respondents are females, and the next two are males - Distinct codes are used to identify respondents to whom the question does not apply or *missing data*. The term missing data refers to the absence of substantive information on a variable for a respondent. (In the campus survey, "don't know" responses and refusals to answer the question were treated as missing data. Thus, the code "9" in column 3 (FREQALC) and the code "97" in column 4 (NUMDRNKS) opposite ID 005 indicate that these questions were skipped and did not apply to this respondent (because a code of "1" [for "abstain"] was entered in column 2 [ALCCONS]). The code "99" in column 8 (CUMGPA) for IDs 005 and 007 indicates that these data are missing (because these respondents did not grant permission to have their GPAs obtained from the Registrar). In addition to the data file, *most researchers create a codebook*. Like a codebook for content analysis, described in Chapter 10, a survey codebook serves as a guide for coding and data entry and as a detailed record of the electronic layout of the data. Codebooks are essential to researchers who are analyzing available survey data such as the GSS. - pg. 359 reference box comprehension

conduct multivariate testing

Once you've conducted bivariate analyses on nonexperimental data, all is not said and done. If you've found a statistically significant relationship, there is reason to hold your excitement in abeyance. The regression analysis in the previous section shows that there is a bivariate association between cumulative GPA and number of drinks con- sumed. However, if the goal is to test the causal hypothesis that drinking lowers grades, our analysis cannot end here. As we have emphasized repeatedly, causal inferences are based not only on association but also theoretical assumptions and empirical evidence about direction of influence and nonspuriousness. it is important to realize that statistical analyses by themselves do not provide a basis for inferring causal relationships. Instead, a researcher starts with a theoretical model of the causal process linking X and Y and then determines if the data are consistent with the theory. - In this section, we use data from the campus survey to test model A, leaving tests of alternative directional models to future research. Another shortcoming of model A is that there are many other, extraneous variables that may be a cause of drinking or grades or both. Of the many possible alter- native models, we will briefly describe three. To test for spuriousness (model B) as well as independent effects on GPA (model D), we now introduce two of the several strategies for multivariate analysis: elaboration of contingency tables and multiple regression. Furthermore, to show how these strategies are based on the same logic of analysis, we use the same data, albeit in different forms.

descriptive statistics

Procedures for organizing and summarizing data.

Inspect and modify data

Starting with a cleaned data set, the next analysis step is to inspect the data to decide on subsequent data modifications and statistical analyses. *goal of inspection is to get a clear picture of the data by examining one variable at a time*. The data "pictures" generated by univariate analysis come in various forms—tables, graphs, charts, and statistical measures. the nature of the techniques depends on whether the level of measurement of the variables you are analyzing is nominal/ordinal or interval/ratio. - RECALL: we cannot add, subtract, multiply, or divide the numbers assigned to the categories of nominal and ordinal variables, whereas we can perform basic mathemati- cal operations on the values of interval and ratio variables. - Consequently, different forms of analysis and statistics are applied to variables measured at different levels. Fol- lowing data inspection, the researcher may want to change one or more variable codes, rearrange the numerical order of variable codes, collapse variable categories, impute estimated values for missing data, add together the codes for several variables to create an index or scale, and otherwise modify the data for analysis.

marginal frequencies

Row and column totals in a contingency table (cross-tabulation) that represent the univariate frequency distributions for the row and column variables

tests of statistical significance

To determine whether a relationship is due to chance factors, researchers use tests of *statistical significance*. The way that such tests work is that we first assume what the data would look like if there were no relationship between the variables—that is, if the distribution were completely random. - The assumption of no relationship or complete randomness is called the *null hypothesis*. The null hypothesis in Singleton's research is that there is no relationship between alcohol consumption and academic performance (i.e., Singleton is wrong). - Based on the null hypothesis, we calculate the likelihood that the observed data could have occurred at random (statistically sig.). If the relationship is unlikely to have occurred randomly, we reject the null hypothesis of no relationship b/w the variables - in such cases, researchers generally interpret this as supportive of the hypothesis that there IS a relationship for cross- tabulations, the most commonly used statistic is the chi-square or X^2 test for independence - the chi-square test is based on a comparison of the observed cell frequencies with the cell frequencies one would expect if there were no relationship b/w the variables - table 12.6 [notice that the cell percentages - reading across- are the same as the marginals; this indicates that knowing whether a respondent is male or female is of no help in predicting attitude toward gun control, precisely the meaning of null hypothesis of "no relationship" b/w the variables] ------- the larger the diff.'s b/w the actual cell frequencies and those expected assuming no relationship, the larger the value of chi-square, the less likely the relationship occurred randomly, and the more likely that it exists in the population *remember that the lower case p stands for "probability"; p < .05 means that the probability is less than .05, or 5 in 100, that the association could have occurred at random, assuming there is no relationship in the larger population from which the sample was drawn* - with odds this low, we can be confident that the result would NOT have occurred by chance - therefore, we can conclude that in the American adult population, females are more likely than males to favor gun control - Knowing that this relationship is likely to exist in the population, however, does not tell us the strength of the relationship between the independent variable and the depen- dent variable. It is possible for a relationship to exist when changes in one variable cor- respond only slightly to changes in the other. The degree of this correspondence, or association, is a second measurable property of bivariate distributions.

Outliers

Unusual or suspicious values that are far removed from the preponderance of observations for a variable

listwise deletion

a common procedure for handling missing values in multivariate analysis that excludes cases which have missing values on any of the variables in the analysis

percentage distribution

a norming operation that facilitates interpreting and comparing frequency distributions by transforming each frequency to a common yardstick of 100 units (percentage points) in length; the number of cases in each category is divided by the total and multiplied by 100

imputation

a procedure for handling missing data in which missing values are assigned based on other info., such as the sample mean or known values of other variables

test of statistical significance

a statistical procedure used to assess the likelihood that the results of a study could have occurred by chance.

cleaning (data processing)

after the data have been entered into a computer file, the researcher "cleans" the data. - *data cleaning*: referes to detecting and resolving errors in coding and in transmitting the data to the computer - Data entry can introduce errors when entry operators misread codes, transpose numbers, skip over or repeat responses to survey questions, and so on. first step: check for these kinds of errors by verifying data entries whenever feasible (one procedure is to have two persons independently enter the info. into separate computer files and the use a software program to compare the two files for noncomparable entries) (another procedure: which was used in the campus surveys and which we recommend for small-scale student projects, is to have one person enter the information and then have another person compare on-screen data entries with the completed survey.) beyond verification, two cleaning techniques generally are applied: these techniques check for the same kinds of errors that could occur during data collection, except that *they screen data entries in a computer file* rather than responses recorded on a questionnaire or interview schedule. 1. wild-code checking: consists of examining the values entered for each item to see whether there are any illegitimate codes. For example, any code other than 1 ("male") or 2 ("female") for the variable "sex" is not legitimate. 2. consistency checking: used in most large-scale surveys: the idea here is to see whether responses to certain q's are related in reasonable ways to responses to particular other q.'s (it thus req's comparisons across variables, such as comparing data entries for ALCCONS and FREQALC to see if respondents who "abstain" from drinking are correctly coded as "9" (for "not applicable") on the frequency of consumption item - once you have entered and cleaned the data, you are ready to inspect, modify, and analyze them

standardized regression coefficients

coefficients obtained from a norming operation that puts the various partial-regression coefficients on common footing by converting them to the same metric of standard deviation units

measures of association

in a 2X2 table, the percentage diff. provides one indicator, albeit a poor one, of the strength of the relationship: the larger the diff., the stronger the relationship - however, researchers prefer to use one of several other statistics to measure relationship strength [these measures of association are standardized to vary b/w 0 (no association) and plus or minus 1.0 (perfect association)] - one such measure, which can be used for 2X2 tables, is Cramer's phi coefficient; this equals .15 for the data in table 12.4 - although the choice of labels is somewhat arbitrary, this magnitude suggests a "low" association b/w sex and attitude toward gun control -Phi varies from 0 to 1, the sign + or - does NOT reveal anything meaningful about the relationship - however when both variables have at least ordinal-level measurement, the sign indicates the direction of the relationship - statistically, "direction" refers to the tendency for increases in the values of one variable to be associated with systematic increases or decreases in the values of another variable [both variables may change in the same direction = positive relationship or in the opposite direction = negative relationship] positive relationship: lower values of one variable tend to be associated with lower values of the other variable, and higher values of one variable tend to go along with higher values of the other. ex. table 12.5 - pg. 374 - The percentage of students who describe themselves as "light" drinkers (first row) falls with increasing frequency of consumption: 82.6 to 31.3 to 7.0 percent. Similarly, the percentage describing themselves as "heavy" drinkers (third row) consistently rises as frequency of consumption increases: 0.0 to 2.9 to 33.1 percent. It is this sort of pattern (self-described heavier drinking associated with more frequent consumption) that sug- gests a clearly positive relationship. negative (inverse) relationship: there is a tendency for lower values of one vari- able to be associated with higher values of the other variable. (table 12.8 pg. 375) - Table 12.8, based on GSS data, reveals such a relationship between education and the number of hours of televi- sion watched on an average day: as education increases, television-viewing time de- creases. *variables with a relatively large number of categories either constitute or tend to approximate interval-scale measurement*. With interval-scale variables, we can use a more precise and more powerful form of statistical analysis known as "correlation" and "regression."

editing (data processing)

involves checking for errors and inconsistencies in the data An error in the campus survey, for example, would be the recording of a student's birth date as 4/11/1899; an inconsistency would have occurred if a respondent who reportedly "ab- stains" from drinking (question 34) also drinks alcoholic beverages "almost every day" (question 35). most editing is programmed into commuter- assisted and online surveys; for instance if an interviewer entered "1" (abstain) on q. 34, the program would prompt the interviewer to skip questions 35 and 36 For the campus survey, interviewers checked over each completed form for errors and omissions soon after each interview was conducted. Respondents were recontacted if necessary, or corrections were made from memory. - in addition, the instructor (or survey supervisor) went over each completed survey to check for omitted q's and to verify that answers were recorded legibly, the correct forms were used, and so forth

data analysis

is part of a cycle of inquiry that takes place whenever theory and data are compared (this comparison occurs continually in qualitative research when an investigator struggles to bring order, or to make sense of, his or her observations and interviews)

quantitative data

observations that have been transformed into counts or numbers. These are the data most typically generated in experiments, surveys, and some forms of research using existing data. quantitive analysis is synonymous with statistical analysis *statistic*: a summary statement about a set of data; statistics as a discipline provides techniques for organizing and analyzing data Because much of quantitative research follows the deductive model of inquiry, we focus on the process by which investigators perform statistical tests of hypotheses. The most elaborate forms of analysis are done with survey data and with some forms of ex- isting (quantitative) data. Given the widespread use of surveys in social research, we concentrate on survey data in this chapter.

summary - pg. 361

preparing data for quantitative analysis entails 4 steps: 1. coding 2. editing 3. data entry 4. cleaning Coding consists of assigning numbers to the categories of each variable. Editing is designed to ensure that the data to be entered into the computer are as complete, error- free, and readable as possible. (When data are entered into the computer and stored in a data file, they are organized as a matrix or spreadsheet, with observations as rows and variables as columns) After entry, the data are cleaned for error in coding and transmission to the computer [this is a multistep process usually beginning with a verification procedure and continuing with checks for "illegal" codes - wild- code checking and inconsistent patterns - consistency checking]

inferential statistics

procedures for determining the extent to which one may generalize beyond the data at hand

missing data

refers to the absence of information on a variable for a given case

mean

the average value of a data set, calculated by adding up the individual values and dividing by the total number of cases.

data cleaning

the detection and correction of errors in a computer data file that may have occurred during data collection, coding, and/or data entry

residuals

the difference between observed values of the dependent variable and those predicted by a regression equation

range

the difference between the lowest and highest values in a distribution, which is usually reported by identifying these two extreme values.

data matrix

the form of a computer data file, with rows as cases, and columns as variables; each cell represents the value of a particular variable (column) for a particular case (row)

null hypothesis

the hypothesis, associated with tests of statistical significance, that an observed relationship IS due to chance; a test that is significant rejects the null hypothesis at a specified level of probability.

median

the midpoint in a distribution of interval- or ratio-scale data; indicates the point below and above which 50 percent of the values fall.

elaboration of contingency tables

the multivariate analysis of contingency tables introduces a third control variable (and sometimes additional variables) into the analysis to enhance or "elaborate" our understanding of a bivariate relationship. To illustrate the logic of this elaboration, we cre- ated a contingency table of the bivariate relationship between cumulative GPA and number of drinks by dichotomizing each of these variables. That is, for each variable, we collapsed all the values into two categories with an approximately equal proportion of cases in each category. For cumulative GPA, the split occurred at less than 3.2 and 3.2 or greater; for number of drinks, the split occurred between 5 or fewer drinks and 6 or more drinks. The objective of elaboration is to examine the impact of additional "third" variables on a bivariate relationship. We are especially interested in understanding the impact of antecedent variables that might create a spurious relationship. (For example, research- ers have speculated that the association between drinking and GPA might be spurious due to precollege factors such as academic aptitude. That is, students with a relatively low aptitude might be likely to drink more and to do less well academically than stu- dents with a relatively high aptitude. To explore this possibility, we introduce a measure of respondents' aptitude—SAT total score—into the analysis by holding it constant. --- in a contingency table elaboration, third variables are held constant by means of a subgroup classification) Notice that we now have two "tables," one for each category of the variable "SAT total score." These are called *partial tables* or partials, because each shows the associ- ation between alcohol consumption and GPA for part of the total number of observations. - SAT total score is held constant, because in each partial table all respondents are alike with respect to their SAT scores. Reading across each partial table, we find no association between alcohol consumption and GPA (e.g., in the first partial, 62.3262.3 5 0). Thus, the original relationship (shown in Table 12.9), which indicated that heavier drinkers had lower GPAs than lighter drinkers, has disappeared when SAT score is controlled. - pg. 383 criticisms that pertain to much analysis of contingency tables: (1) Collapsing variables such as amount of alcohol consumed, GPA, and SAT scores into two categories may eliminate important information and distort the results; (2) several other variables (e.g., gender, race, parents' income) might produce a spurious association between al- cohol consumption and GPA; and (3) controlling for one variable at a time ignores the possibility that spuriousness may be created by the simultaneous action of two or more extraneous variables.

cell frequency

the number of cases in a cell of a cross-tabulation (contingency table)

y-intercept

the predicted value of the dependent variable in regression when the independent variable or variables have a value of zero; graphically, the point at which the regression line crosses the Y-axis. (value 'a' in the equation)

Prepare data for computerized analysis: data processing

the quality of the data rests largely on data processing: many errors may be introduced into the data, and many checks and safeguards should be incorporated to avoid such errors. To make this step more manageable, *data processing can be broken down into four smaller steps*: coding, editing, entering data into a data file, and checking data for errors (cleaning). The accomplishment of each data-processing task depends on the type of data and how they were collected.

univariate analysis

the statistical analysis of one variable at a time

bivariate analysis

the statistical analysis of the relationship between two variables

mode

the value or category of a frequency distribution having the highest frequency; the most typical value.

box 12.2 - the meaning of statistical significance and strength of association

two types of statistics can be used to assess bivariate relationships: tests of statistical significance and measures of strength of association. - let's consider each statistic as it relates to tests of the hypothesis that drinking is associated with cumulative GPA. Q. 1 - At which college is the relationship between number of drinks consumed and cumu- lative GPA strongest? - To answer the first question, we use the correlation coefficient r. This is a descriptive statistic that indicates strength of association, or how well we can predict one variable from knowledge of another. The absolute value of r (ignoring the plus or minus sign) is the mea- sure of strength. The coefficient is largest, and therefore the relationship is strongest, for College A. Q. 2 - Is the association statistically significant at all three colleges? - Using the traditional level of signifi- cance of *.05*, the association is significant at College B and College C, but not at College A. Q. 3 - What do these data tell us about the theoretical and practical importance of the association between drinking and grades? - pg. 380 - *p-values depend not only on the magnitude of the correlation coefficient but also on the size of the sample. If the sample is big enough, even very weak correlations are unlikely to occur by chance and, therefore, are likely to differ from 0 in the population from which the sample was drawn* Just as statistical significance tells us nothing about strength of association, or the mag- nitude of a result, it also reveals nothing about its theoretical or practical importance. In fact, neither of these judgments can be based on statistics alone. For example, statistical significance is insufficient to establish causality, which often determines whether a finding is theoretically important. Further, practical importance depends on the magnitude of the result as well as an assessment of human values and costs


Ensembles d'études connexes

Chapter 11 Disinfectants and Antiseptics

View Set

EXAM 2: X-ray Beam Intensity and Transmission

View Set

Medication Administration Ear (Otic) Skill (See Skill 21.5 on p546 as well)

View Set

Nicholas and Alexandra Questions

View Set

Inquizitive- Chapter 4: Civil Liberties

View Set