CJP - Chapter 5: Psychological Measurement
four steps in the measurement process
(a) conceptually defining the construct, (b) operationally defining the construct, (c) implementing the measure, and (d) evaluating the measure.
Summary of Levels of Measurements
*NOMINAL* 1. Category labels *ORDINAL* 1. Category labels 2. Rank order *INTERVAL* 1. Category labels 2. Rank order 3. Equal intervals *RATIO* 1. Category labels 2. Rank order 3. Equal intervals 4. True Zero
importance of validity
- When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. - As an absurd example, imagine someone who believes that people's index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people's index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person's index finger is a centimetre longer than another's would indicate nothing about which one had higher self-esteem.
Variable measures generally fall into one of three broad categories.
1. *Self-report measures* are those in which participants report on their own thoughts, feelings, and actions, as with the Rosenberg Self- Esteem Scale. 2. *Behavioural measures* are those in which some other aspect of participants' behaviour is observed and recorded. This is an extremely broad category that includes the observation of people's behaviour both in highly structured laboratory tasks and in more natural settings 3.*physiological measures* are those that involve recording any of a wide variety of physiological processes, including heart rate and blood pressure, galvanic skin response, hormone levels, and electrical activity and blood flow in the brain.
precautions you can take to minimize kinds of reactivity
1. One is to make the procedure as clear and brief as possible so that participants are not tempted to vent their frustrations on your results. 2. Another is to guarantee participants' anonymity and make clear to them that you are doing so. - If you are testing them in groups, be sure that they are seated far enough apart that they cannot see each other's responses. Give them all the same type of writing implement so that they cannot be identified by, for example, the pink glitter pen that they used. You can even allow them to seal completed questionnaires into individual envelopes or put them into a drop box where they immediately become mixed with others' questionnaires.
ratio level of measurement
A measurement of a variable in which the numbers indicating a variable's values represent fixed measuring units and an absolute zero point involves assigning scores in such a way that there is a true zero point that represents the complete absence of the quantity. - Height measured in meters and weight measured in kilograms are good examples. So are counts of discrete objects or events such as the number of siblings one has or the number of questions a student answers correctly on an exam. -You can think of a ratio scale as the three earlier scales rolled up in one. Like a nominal scale, it provides a name or category for each object (the numbers serve as labels). Like an ordinal scale, the objects are ordered (in terms of the ordering of the numbers). Like an interval scale, the same difference at two places on the scale has the same meaning.
Test-retest reliability
A method for determining the reliability of a test by comparing a test taker's scores on the same test taken on separate occasions. - For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today.
Stress Operationally Defined
A rough conceptual definition is that stress is an adaptive response to a perceived danger or threat that involves physiological, cognitive, affective, and behavioural components. But researchers have operationally defined it in several ways. - The Social Readjustment Rating Scale is a self-report questionnaire on which people identify stressful events that they have experienced in the past year and assigns points for each one depending on its severity. For example, a man who has been divorced (73 points), changed jobs (36 points), and had a change in sleeping habits (16 points) in the past year would have a total score of 125. - The Daily Hassles and Uplifts Scale is similar but focuses on everyday stressors like misplacing things and being concerned about one's weight. - The Perceived Stress Scale is another self-report measure that focuses on people's feelings of stress (e.g., "How often have you felt nervous and stressed?"). - Researchers have also operationally defined stress in terms of several physiological variables including blood pressure and levels of the stress hormone cortisol.
Cronbach's α
Conceptually, α is the mean of all possible split- half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach's α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency. - the most common measure of internal consistency used by researchers in psychology
convergent validity
Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. - Assessing convergent validity requires collecting data using the measure.
Big Five Facets
Each of the Big Five personality traits in the five factor model contains six facets, each of which is measured with a separate scale (pp. 91). *Openness to Experience*: Intellect, Imagination-Creativity, Perceptiveness Conscientiousness: Orderliness, Decisiveness-Consistency, Reliability, Industriousness Extroversion: Sociability, Unrestraint, Assertiveness, Activity-Adventurousness Agreeableness: Warmth-Affection, Gentleness, Generosity, Modesty-Humility Neuroticism: Irritability, Insecurity, Emotionality
Stevens's levels of measurement are important for at least two reasons
First, they emphasize the generality of the concept of measurement. Although people do not normally think of categorizing or ranking individuals as measurement, in fact they are as long as they are done so that they represent some characteristic of the individuals. Second, the levels of measurement can serve as a rough guide to the statistical procedures that can be used with the data and the conclusions that can be drawn from them.
socially desirable responding
For example, people with low self-esteem agree that they feel they are a person of worth not because they really feel this way but because they believe this is the socially appropriate response and do not want to look bad in the eyes of the researcher.
Conceptually Defining the Construct
Having a clear and complete conceptual definition of a construct is a prerequisite for good measurement. For one thing, it allows you to make sound decisions about exactly how to measure the construct. - If you had only a vague idea that you wanted to measure people's "memory," for example, you would have no way to choose whether you should have them remember a list of vocabulary words, a set of photographs, a newly learned skill, or an experience from long ago. Because psychologists now conceptualize memory as a set of semi-independent systems, you would have to be more precise about what you mean by "memory."
Creating Your Own Measure
Instead of using an existing measure, you might want to create your own. Perhaps there is no existing measure of the construct you are interested in or existing ones are too difficult or time-consuming to use. Or perhaps you want to use a new measure specifically to see whether it works in the same way as existing measures—that is, to evaluate convergent validity. (i.e. Stroop effect variations) - When you create a new measure, you should strive for simplicity. Remember that your participants are not as interested in your research as you are and that they will vary widely in their ability to understand and carry out whatever task you give them.
face validity
Is the extent to which a measurement method appears "on its face" to measure the construct of interest. - Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity.
proprietary
Many existing measures—especially those that have applications in clinical psychology—are proprietary. This means that a publisher owns the rights to them and that you would have to purchase them. - These include many standard intelligence tests, the Beck Depression Inventory, and the Minnesota Multiphasic Personality Inventory (MMPI).
Evaluating the Measure
Once you have used your measure on a sample of people and have a set of scores, you are in a position to evaluate it more thoroughly in terms of reliability and validity. Even if the measure has been used extensively by other researchers and has already shown evidence of reliability and validity, you should not assume that it worked as expected for your particular sample and under your particular testing conditions. Regardless, you now have additional evidence bearing on the reliability and validity of the measure, and it would make sense to add that evidence to the research literature.
multiple-item measure
Structuring items in a way that allows them to be combined into a single overall score by summing or averaging - To measure "financial responsibility," a student might ask people about their annual income, obtain their credit score, and have them rate how "thrifty" they are—but there is no obvious way to combine these responses into an overall score. To create a true multiple-item measure, the student might instead ask people to rate the degree to which 10 statements about financial responsibility describe them on the same five-point scale.
The Big Five
The Big Five is a set of five broad dimensions that capture much of the variation in human personality. *OCEAN* openness, conscientiousness, extraversion, agreeableness, neuroticism
It is also the case that many established measures in psychology work quite well despite lacking face validity.
The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items "I enjoy detective or mystery stories" and "The sight of blood doesn't frighten me or make me sick" both measure the suppression of aggression. - In this case, it is not the participants' literal answers to these questions that are of interest, but rather whether the pattern of the participants' responses to a series of questions matches those of individuals who tend to suppress their aggression.
How do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity?
The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. Psychologists do not simply assume that their measures work. Instead, they collect data to demonstrate that they work. If their research does not demonstrate that a measure works, they stop using it.
split-half correlation
The correlation between scores based on one half of the items on a multiple-item measure and scores based on the other half of the items 1. Administer the test to a large group students (ideally, over about 30). 2. Randomly divide the test questions into two parts. For example, separate even questions from odd questions. 3. Score each half of the test for each student. 4. Find the correlation coefficient for the two halves. - A split-half correlation of +.80 or greater is generally considered good internal consistency.
ordinal level of measurement
The ordinal level of measurement involves assigning scores so that they represent the rank order of the individuals. Ranks communicate not only whether any two individuals are the same or different in terms of the variable being measured but also whether one individual is higher or lower on that variable. - For example, a researcher wishing to measure consumers' satisfaction with their microwave ovens might ask them to specify their feelings as either "very dissatisfied," "somewhat dissatisfied," "somewhat satisfied," or "very satisfied." The items in this scale are ordered, ranging from least to most satisfied. -Nothing in our measurement procedure allows us to determine whether the two differences reflect the same difference in psychological satisfaction. Statisticians express this point by saying that the differences between adjacent scale values do not necessarily represent equal intervals on the underlying scale giving rise to the measurements. (In our case, the underlying scale is the true feeling of satisfaction, which we are trying to measure.)
The psychologist S. S. Stevens and four different levels of measurement
The psychologist S. S. Stevens suggested that scores can be assigned to individuals in a way that communicates more or less quantitative information about the variable of interest. Stevens actually suggested four different levels of measurement (which he called "scales of measurement") that correspond to four different levels of quantitative information that can be communicated by a set of scores. 1. The nominal level of measurement 2. The ordinal level of measurement 3. The interval level of measurement 4. The ratio level of measurement
But what if your newly collected data cast doubt on the reliability or validity of your measure?
The short answer is that you have to ask why. It could be that there is something wrong with your measure or how you administered it. It could be that there is something wrong with your conceptual definition. It could be that your experimental manipulation failed. For example, if a mood measure showed no difference between people whom you instructed to think positive versus negative thoughts, maybe it is because the participants did not actually think the thoughts they were supposed to or that the thoughts did not actually affect their moods. In short, it is "back to the drawing board" to revise the measure, revise the conceptual definition, or try a new manipulation.
converging operations
When psychologists use multiple operational definitions of the same construct—either within a study or across studies—they are using converging operations. The idea is that the various operational definitions are "converging" or coming together on the same construct. - When scores based on several different operational definitions are closely related to each other and produce similar patterns of results, this constitutes good evidence that the construct is being measured effectively and that it is useful. The various measures of stress, for example, are all correlated with each other and have all been shown to be correlated with other variables such as immune system functioning (also measured in a variety of ways) (Segerstrom & Miller, 2004)3. This is what allows researchers eventually to draw useful general conclusions, such as "stress is negatively correlated with immune system functioning," as opposed to more specific and less useful ones, such as "people's scores on the Perceived Stress Scale are negatively correlated with their white blood counts."
concurrent validity and predictive validity
When the criterion is measured at the same time as the construct, criterion validity is referred to as *concurrent validity*; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as *predictive validity* (because scores on the measure have "predicted" a future outcome).
Implementing the Measure
You will want to implement any measure in a way that maximizes its reliability and validity. In most cases, it is best to test everyone under similar conditions that, ideally, are quiet and free of distractions. - Testing participants in groups is often done because it is efficient, but be aware that it can create distractions that reduce the reliability and validity of the measure. As always, it is good to use previous research as a guide. If others have successfully tested people in groups using a particular measure, then you should consider doing it too.
Reliability
consistency of a measure Psychologists consider three types of consistency: 1) over time (test-retest reliability), 2) across items (internal consistency), 3) across different researchers (inter-rater reliability).
conceptual definition of a psychological construct
describes the behaviours and internal processes that make up that construct, along with how it relates to other variables
three basic kinds of validity (of a measure)
face validity, content validity, and criterion validity.
constructs
internal attributes or characteristics that cannot be directly observed but are useful for describing and explaining behavior - include personality traits (e.g., extraversion), emotional states (e.g., fear), attitudes (e.g., toward taxes), and abilities (e.g., athleticism).
interval level of measurement
involves assigning scores using numerical scales in which intervals have the same interpretation throughout. As an example, consider either the Fahrenheit or Celsius temperature scales. The difference between 30 degrees and 40 degrees represents the same temperature difference as the difference between 80 degrees and 90 degrees. This is because each 10-degree interval has the same physical meaning (in terms of the kinetic energy of molecules). - Interval scales are not perfect, however. In particular, they do not have a true zero point even if one of the scaled values happens to carry the name "zero." The Fahrenheit scale illustrates the issue. Zero degrees Fahrenheit does not represent the complete absence of temperature (the absence of any molecular kinetic energy). - the intelligence quotient (IQ) is often considered to be measured at the interval level.
operational definition
is a definition of a variable in terms of precisely how it is to be measured. - For any given variable or construct, there will be multiple operational definitions
Rosenberg Self-Esteem Scale
is one of the most common measures of self-esteem - Participants respond to each of the 10 items that follow with a rating on a 4-point scale: Strongly Agree, Agree, Disagree, Strongly Disagree. Score Items 1, 2, 4, 6, and 7 by assigning 3 points for each Strongly Agree response, 2 for each Agree, 1 for each Disagree, and 0 for each Strongly Disagree. Reverse the scoring for Items 3, 5, 8, 9, and 10 by assigning 0 points for each Strongly Agree, 1 point for each Agree, and so on. The overall score is the total number of points.
Content validity
is the extent to which a measure "covers" the construct of interest. - For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. - Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people's attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Inter- rater reliability
is the extent to which different observers are consistent in their judgments. - For example, if you were interested in measuring university students' social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time.
Criterion validity
is the extent to which people's scores on a measure are correlated with other variables (known as criteria) that one would expect them to be correlated with. - For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam.
Discriminant validity
is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. - For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people's scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.
Validity
is the extent to which the scores from a measure represent (correlate to) the variable they are intended to.
Deciding on an Operational Definition
is usually a good idea to use an existing measure that has been used successfully in previous research. Among the advantages are that: (a) you save the time and trouble of creating your own, (b) there is already some evidence that the measure is valid (if it has been used successfully), and (c) your results can more easily be compared with and combined with previous results. - In fact, if there already exists a reliable and valid measure of a construct, other researchers might expect you to use it unless you have a good and clearly stated reason for not doing so.
participant reactivity
participants act differently or unnaturally because they know someone is watching them
Demand Characteristics
subtle cues that reveal how the researcher expects participants to behave. -For example, a participant whose attitude toward exercise is measured immediately after she is asked to read a passage about the dangers of heart disease might reasonably conclude that the passage was meant to improve her attitude. As a result, she might respond more favourably because she believes she is expected to by the researcher.
psychological constructs
such as intelligence, self-esteem, and depression are variables that are not directly observable because they represent behavioural tendencies or complex patterns of behaviour and internal processes. *An important goal of scientific research is to conceptually define psychological constructs in ways that accurately describe them.* - For example, to say that a particular university student is highly extraverted does not necessarily mean that she is behaving in an extraverted way right now. In fact, she might be sitting quietly by herself, reading a book. Instead, it means that she has a general tendency to behave in extraverted ways (talking, laughing, etc.) across a variety of situations. - Another reason psychological constructs cannot be observed directly is that they often involve internal processes. Fear, for example, involves the activation of certain central and peripheral nervous system structures, along with certain kinds of thoughts, feelings, and behaviours—none of which is necessarily obvious to an outside observer. Notice also that neither extraversion nor fear "reduces to" any particular thought, feeling, act, or physiological structure or process. Instead, each is a kind of summary of a complex set of behaviours and internal processes.
Measurement
the assignment of scores to individuals so that the scores represent some characteristic of the individuals - weighing oneself by stepping onto a bathroom scale, or checking the internal temperature of a roasting turkey by inserting a meat thermometer
internal consistency
the consistency of people's responses across the items on a multiple-item measure - On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people's responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct
test-retest correlation
the correlation between individuals' scores on a measure used at two different times r = correlation - a test-retest correlation of +.80 or greater is considered to indicate good reliability.
psychometrics
the scientific study of the measurement of human abilities, attitudes, and traits
nominal level of measurement
used for categorical variables and involves assigning scores that are category labels. Category labels communicate whether any two individuals are the same or different in terms of the variable being measured. For example, if you look at your research participants as they enter the room, decide whether each one is male or female, and type this information into a spreadsheet, you are engaged in nominal-level measurement.
conceptual definition of neuroticism
would be: "it is people's tendency to experience negative emotions such as anxiety, anger, and sadness across a variety of situations"