PsycMeasurement Exam 1
minimum competency exam for graduation from high school. norm-referenced or criterion referenced?
criterion referenced based on objectives and goals of education
state licensening. norm-referenced or criterion referenced?
criterion referenced, there is a certain level of information that all licensed doctors should absolutely know
state certification as a barber. norm-referenced or criterion referenced?
criterion referenced. certain level of information about barbering barbers need to know
Explain the relationship between content validity and internal consistency reliability, using an example to illustrate your answer.
demonstrates the consistency between test items and the variable of interest. Internal consistency measures whether or not each question included in the test is measuring the same thing. In a test of driving ability, internal consistency, or consistent high scores on each of the test items (starting the engine, merging onto the highway, breaking, parking) would reflect good driving ability, and high content validity.
parallel forms reliability
estimates how similar two forms of the same test are. They are not different ways to assess the same concept—they should give relatively the same scores, test the same domains of a construct, and overall mean the same thing. Alternate forms are different versions of a test that have been constructed to measure essentially the same construct. It would be useful in an achievement or skill setting where the test user wants to minimize the effect of memory for the content of the previously administered form of the test—like a makeup History test. different people, same time, different test
Test-retest reliability
estimates reliability obtained by comparing scores of the same participant on the same test, administered at different times. It should be used when evaluating something relatively stable over time. same people different times
Standard deviation
how much scores vary, or spread, from the mean value of the sample. It indicates the average difference between the scores and the mean score
example of content validity
i. Example: a paper-and-pencil, self-report measure of introversion vs. extroversion in various situations is compared with observations of behavior when the participant is alone, with an unfamiliar group of people, with friends, and/or with family
Random assignment
intentionally trying to get a representative sample of the population by randomly assigning participants to a treatment or to a control group
Compare and contrast psychological testing and psychological assessment.
a. A psychological test is only part of the process. It refers to the device or procedure itself, for the purpose of testing a large sample of a population and generalizability. It can rarely be used on its own. b. Assessment may involve many different types of tests, for the purpose of evaluating an individual. Might be comprehensive—even though you're trying to assess depression, you might use all types of different tests to help you, like self-report measures of other constructs (such as anxiety) along with observations, interviews, and case studies to put together an overall view of the individual.
Discuss the implications of McClelland et al.'s article for the use of a multitrait multimethod matrix to evaluate the construct validity of measures of need for achievement and need for affiliation.
a. Different measures/methods potentially get at different parts of construct i. E.g. need for affection, need for achievement ii. Implicit motives represent a more primitive multidirectional system whereas self attributed motives are based on more cognitively elaborated constructs
Explain what is meant by "process dissociation" as applied to psychological measurement
a. Disentangling implicit and explicit psych processes (B, 2002, 51) i. Converging behavioral predictors ii. Modest positive intercorrelates iii. Differential effects of moderating variables
What characteristics of a research study do we take into account when using its findings to evaluate the validity of a test?
a. Do the items adequately sample the range of areas that must be sampled to adequately measure the construct? b. How do individual items contribute or detract from the test's validity? c. What do the scores really tell us about the targeted construct? d. How are high (or low) scores on the test related to testtaker's behavior? e. How do scores on this test relate to other tests that measure the same construct? f. How do scores on this test relate to other tests measuring DIFFERENT constructs?
Explain the historical development of standardized tests
a. Earliest records found in China, where those seeking government jobs had to fill out examinations testing their knowledge of music, horsemanship, writing, agriculture, military strategy, etc. b. In the Western world, people were often given open-ended essays. As the industrial revolution progressed, standardized tests emerged as an easy way to test large numbers of students quickly c. Alfred Binet began developing standardized test of intelligence i. Used in the army d. SAT created first as a method of testing knowledge of college students
What is the "heteromethod convergence problem", and how can it be solved?
a. Even when self report to projective measures of a __ test predict theoretically related features of behavior, there are __ modest correlations between them. B, 2002, 47 b. Solution: realize there are measuring different things
Compare and contrast implicit approaches to measuring human motives with explicit (structured self-report, or self-attributed) approaches. Include both the characteristics of the measures themselves and the uses for which they are most valid.
a. Implicit - like RIT, where testtaker is unsure of the construct being measured, decreases bias like social desirability b. Explicit - like self report, increases bias like social desirability
Explain why it is important to ensure that test administration procedures are uniform (followed consistently every time), and give three examples of how this is done.
a. Test needs to be administered correctly, controlling for as many outside variables as possible, if not: i. data could be essentially useless ii. wrong diagnosis could be made, resulting in harmful treatment b. Done through: . Test manuals detailing information about how to correctly administer the test i. Journal articles that provide practical examples of how an instrument is used in research or applied contexts ii. Test catalogue designed to sell test, usually not containing any critical reviews. Not very detailed.
Explain two reasons why we control the use of psychological tests, and two possible consequences if they were not controlled.
a. Test needs to be administered correctly, controlling for as many outside variables as possible, if not: i. data could be essentially useless ii. wrong diagnosis could be made, resulting in harmful treatment b. Test needs to be administered by a professional who understands its implications and potential pitfalls . Wrong test could be administered for a variable c. Test could be incorrectly interpreted
Give three different meanings of the term "objective" as applied to psychological tests.
a. The goal of administering the test b. Not influenced by the personal feelings or opinions of the test administrator c. Not influenced by the environmental advantages provided to the participant—e.g. an intelligence test that gives the same score regardless of age/education level/SES
What do we mean when we say, "These two measures are confounded"?
a. Two measures are confounded when they don't appear to be measuring what they intend to. i. Recognize: look at convergent and discriminant validity - factor analysis or correlational analysis ii. Solution: throw out the garbage
content validity
measures how well the test measures what it was designed to measure. It estimates how much a measure represents every single element of a construct. • It is based on evaluation of the subjects, topics, or content covered by the items in the test. • Often seen as a prerequisite for criterion validity, because it indicates whether the desired trait was measured. • Qualitative in nature, asks whether a specific element enhances or detracts from a test. b. A content-valid measure would tell us that the test adequately represents the dimensionality present in the content itself. A test with low content validity would tell us that some items are irrelevant to the main construct. If the test items are measuring something else than the construct of interest, it could create potential bias.
internal consistency
measures whether or not each question included in the test is measuring the same thing. A test of a single construct should not measure anything but that construct. One way to assess this is by establishing split-half reliability, or randomly assigning half of the test questions to one half of the test and comparing the score of each half. different questions, same construct
Random errors
naturally occur when gathering any data or taking any measure. They're impossible to predict because they can be caused by any number of variables. naturally occur when gathering any data or taking any measure. They're impossible to predict because they can be caused by any number of variables
convergent and discriminant validity
validity are both types of construct validity that evaluate how similar (or different) two measures are. They both require the researcher to define exactly what trait they are measuring. Both involve a comprehensive analysis of how scores on the test relate to other test scores and measures.
Describe two major scientific questions for which psychological tests have been developed in the past century, and for each, name two researchers who were working to meet that need and what they are best known for.
. How can we screen potential candidates for the military (or any other profession?) . Alfred Binet, Binet measure of intelligence i. David Wechsler, Wechsler Adult Intelligence Scale (WAIS)
Describe two major needs of society for which psychological tests or assessment approaches have been developed in the past century, and for each, name two tests designed to meet that need.
. Standardized measurement of academic achievement i. TACS ii. STARS a. Measurement + diagnosis of psychological disorders . Beck depression inventory i. Depression Anxiety + Stress Scale
4. Consider the 7 present-day assumptions about psychological testing and assessment listed in Chapter 4. Which of these might not apply a) in ancient China (see Chapter 2)? b) in the U.S. in 1900? c) in the U.S. in 1950?
A. 7 Assumptions: a. Psychological traits + states exist i. Trait—any distinguishable, relatively enduring way in which one individual varies from another ii. State—distinguish one person from another but are relatively less enduring b. Can be quantified and measured • If a personality test yields a score purporting to provide information about how aggressive a testtaker is, a first step in understanding the meaning of that score is understanding how aggressive was defined c. Test-related behavior predicts non-test related behavior d. Tests + other measurement techniques have strengths + weaknesses e. Various sources of error are part of the assessment process f. Can be conducted in a fair and unbiased manner . CHINA—long tradition of societal position being determined solely by the family into which one was born, excited about improving life by scoring high on examination • In reality, passing required knowledge that came from long hours of study or work with a tutor, barring many applicants i. 1950's UNITED STATES • 1949 WISC said "if your mother sends you to the store for a loaf of bread..." • Many Hispanic children were sent for tortillas, were not familiar w/loaf of bread g. Testing + assessment benefit society . 1900's UNITED STATES • Goddard used to classify people as "morons" based on intelligence test results, attriburted many of society's problems (crime, unemployment, poverty) to low intelligence • Davenport strong believer in eugenics, the science of improving the qualities of a breed through intervention • "Feebleminded" individuals should be segregated or institutionalized and not permitted to reproduce
Compare and contrast "speed tests" and "power tests". Include in your answer an explanation of how to evaluate the reliability of each type of test.
A. TIME—In a power test, the time limit is long enough to allow testtakers to attempt all items. A speed test has an established time limit, allowing few, if any, testtakers to complete all the test items correctly. B. DIFFICULTY—Power tests contain some items that are so difficult that no testtaker is able to obtain a perfect score. Speed tests generally contain items of the same level of difficulty (usually low) so that all testtakers should be able to complete all the test items correctly, if given enough time. C. SCORING—Power tests are based on percentage of items correctly completed. Scoring of a speed test is based on performance speed, because most of the items the participant has a chance to complete will be correct. D. RELIABILITY—Reliability on a speed test should reflect consistency of response speed, and not be calculated from a single testing session. Split-half reliability would be a great measure of reliability for a power test, and would allow the experimenter to assess the consistency of items throughout the test. However, because most of the items a speed-tester has a chance to answer will be correct, split-half reliability would not be a good indicator of consistency.
Explain three reasons why we use psychological tests.
Assess one construct before and after treatment to see if the treatment is effective Examine spelling ability among a group of third graders Generalize the level of optimism of a sample of people to a larger population
Explain the relationship (if any) between criterion-related validity and criterion-referenced tests.
Criterion related validity assesses whether a test reflects a certain set of abilities. Criterion referenced tests are designed to provide an indication of where a testtaker stands with respect to some variable or criterion, like a diagnosis for treatment.
Describe the strengths and weaknesses of face validity as a means of evaluating a test.
I. STRENGTHS: 1. Generally clear what the test is intended to measure, although doesn't mean it reliably does so 2. Lack of face validity could contribute to lack of confidence in perceived effectiveness of test, may affect cooperation or motivation 3. May make it difficult to get parents/administrators to 'buy in' to use of test because it's not perceived relevant/useful II. WEAKNESSES: 1. Relates more to what the test appears to measure than what the test actually means 2. Not very objective 3. Weak measure overall
Explain the role of culture in the determination of test validity
If not accounted for, various aspects of culture could affect test results, affecting its overall validity. For example, the 1949 WAIS used a question that asked what a participant would do if they went to the store to get bread for their mothers, but the store was out. Validity was affected when the test was used, unchanged or translated into Spanish, because of cultural differences. Hispanic children were routinely sent to the store for tortillas, not bread, and consistently got the question wrong. If culture was not taken into account, this construct could covertly affect results.
Explain why reliability is an important aspect of evaluating a test
In order to meaningfully interpret test scores and make useful employment or career-related decisions, you need a reliable test that consistently measures what it claims to measure. A person taking the same test, under similar conditions, should get relatively the same score as they did the first time. Reliable assessment tools produce dependable, repeatable, consistent information about participants.
Error variance - random error
Random errors naturally occur when gathering any data or taking any measure. They're impossible to predict because they can be caused by any number of variables, so they are impossible to control for or eliminate completely. Examples: noisy construction outside the window during the test, participant has a cold on test day, fly won't leave participant alone
construct validity
Refers to whether the operational definition of the construct the measure is based upon truly reflects the theoretical meaning of a concept. • Can be characterized as the "umbrella validity," because every other variety of validity falls under it
How to evaluate the validity and reliability of a behavioral observation?
Reliability—is behavior consistent in different situations or environments? Is it different at different times of day? Are measures being taken to control extraneous variables? Was interrater reliability calculated? Validity—does the behavior of interest occur? Does the behavior correlate with other measures of themes or patterns recognized in behavior?
Error variance- systematic error
Systematic error causes the measured value to be off by the same amount each time, because it is caused by a consistent error you haven't controlled for yet. Determining the source of the problem is possible, and once eliminated, will reduce the overall error in the results and get closer to the true score. Does not effect score consistency—the relative standings of all participants will stay the same. Examples: ruler is ¼ inch too short, scale is five pounds off
Compare and contrast three approaches to evaluating a test's reliability.
Test-retest reliability, parallel forms reliability, internal consistency
example of construct validity
The experimenter compares scores on a paper-and-pencil measure of introversion/extraversion with behavior exhibited across different settings, interview of family and friends, and an open-ended written response about how the participant thinks their friends would describe them. These measures are more highly correlated with each other than with measures of related but distinctly different constructs, (i.e. shyness, social anxiety)
concurrent validity
a measure of how well scores on a particular test correlate with scores on a previously validated measure. Both measures are given at the same time. Fairly weak type of validity, rarely accepted on its own. Can be a great guide for new testing procedures. Ideally, researchers initially test concurrent validity and then follow up w/a predictive validity based experiment. • Tells us that the measurement is highly correlated with previously validated measures of the same construct • i.e. scores on a new measure of introversion and extroversion correlate well with the Extroversion Introversion Test, a previously validated measure of the same construct
Standard error of measurement
a statistical procedure used to determine the amount of error in a measurement device. It can be used to determine the confidence interval, or how sure we are, of a certain score. Basically, if you had one participant take the same test many, many times, and then took the differences across all of the scores, the variance between them would be the standard error of measurement.
4. Give three examples of situations in which psychological tests are used. For each, describe three possible negative consequences of bad decisions that might result from use of a faulty test, explaining who would suffer and why.
a. When determining whether or not to hire someone i. Could hire them even though they aren't qualified (demand characteristics) ii. Could NOT hire them even though they ARE qualified because confounding variables interfered with the test process iii. Could have wasted time + money administering unnecessary test b. When determining whether or not someone should have accommodations made . May not reach cutoff scores and be denied assistance they are in need of i. Test may give unfair weight to certain variables (type of disability or length of time impaired) and deny some assistance unfairly while some who don't really need it receive it c. Measurement of intelligence . All individuals may not begin on equal footing, may unfairly give them very low scores i. May be further denied from access to education or other resources ii. May even be sterilized/lobotomized as a result (eugenics movement)
Assume you have taken the GRE and applied to the nation's most prestigious graduate program in psychology. With which reference group would you prefer your GRE score to be compared, and why?
a. a fixed reference group, all those who took the test when you did, local norms for that school, or national norms for U.S. high school seniors
Define "criterion contamination", then create and explain an example.
a. a. (p 190, back). Criterion contamination: when a criterion measure has been biased or put on predictor measures. MMPI in checking
standard error of the difference
an estimate of the error within a mean that has been unaccounted for. For example, if the mean is 10 and the standard error of the mean is 2, then the true score is probably between 8 and 12
Define "a psychological test".
an objective and standardized measure of a sample of behavior. May be used to:
criterion validity
assesses whether a test reflects a certain set of abilities. To measure the criterion validity of a test, researchers compare it to a known standard or against itself. It is based on evaluating the relationship of scores obtained on the test to scores other tests or measures. • Derives quantitative correlations from test scores • Both predictive and concurrent validity are criterion measures.The main difference between them is the time lapsed between them.
discriminant validity
assesses whether constructs that are believed to be unrelated are, in fact, unrelated • shows little to no relationship with variables that it should not theoretically be correlated with • example: since extraversion and introversion are typically viewed as a continuum, a high introversion should be correlated with a low extraversion score, and vice-versa. Additionally, a test of introversion should not be highly correlated with measures of shyness or social anxiety, similar but distinctly different constructs. • example: since extraversion and introversion are typically viewed as a continuum, a high introversion should be correlated with a low extraversion score, and vice-versa. Additionally, a test of introversion should not be highly correlated with measures of shyness or social anxiety, similar but distinctly different constructs.
convergent validity
assesses whether constructs that should be related are related • evidence my come from correlations with tests purporting to measure both identical and related constructs • example: a high score on a paper and pencil measure of extraversion is correlated with increased use of words like "energetic," "talkative," "chatty," "social," and "outgoing" as compared to introverts in a description the participant writes of themselves
predictive validity
involves testing a group of subjects for a certain construct, and then comparing them with results obtained at some point in the future. Useful for educational and employment tests that predict future performance. Regarded as a very strong measure of statistical validity, but does not test all of the available data. • Indicates that the measurement is highly predictive of future performance, useful for education/employment
Standard error of the estimate is a measure of the accuracy of predictions made.
is a measure of the accuracy of predictions made.
your grade in this course
norm referenced, based on the relative performance of the class. Although the instructor may make every effort to teach the class in the same way so that measurement by the test is a reliable indicator of class progress, there are always going to be both random and systematic errors affecting the manner and depth material was covered (time, random days the school is closed, mean age/education level of the class, number of statistics courses completed, practice effects that occur because some students took the class before...)
admission to unt honors program. norm-referenced or criterion referenced?
norm referenced. There may be some law changes or some other variable that affects the number of freshman with high GPAs that are admitted to UNT, and the resources and space available in the honors college are limited. Admitting everyone, regardless of available spots, would influences the programs and classes the honors college is able to provide. Additionally, criterion-referenced measurement would unnecessarily bar some students from participating in the honors college when spots are available and waiting..
Random selection
participants are randomly chosen from the population in an attempt to get the most representative sample possible.
Describe four ways that examiners might influence test scores, explaining whether each way might help or hurt the validity of the test and why.
• Examiner may use a test intended for adults in assessment of a child. Though it may be a valid measure for adults, that validity is compromised when used to test a population it wasn't intended for. • May influence responses of the participant unintentionally--head nodding for correct answers, etc. • Examiner's appearance—ethnicity of the therapist vs. patient has been show to affect disclosure rates, may decrease validity based on assumptions made about examiner • Administers test battery in inconsistent order—could cause some tests to influence performance on another test, hurt validity
example of predictive validity
• Example: scores on a measure of introversion and extroversion correlate well with the number of social events an individual attends over the next five years
Explain the importance of understanding the theoretical properties of the construct before deciding how to measure it and how to evaluate its validity and reliability.
• Helps you understand what has and hasn't been studied • May help you identify potential confounding variables you need to control for (i.e. cultural influence on a measure because it has not yet been studied w/in that population) • Helps you conceptualize the different dimensions of a given construct (i.e. • Points out related concepts that may be helpful in putting together a larger picture of the construct (or help you identify threats to divergent validity)
Describe four ways that test-taker characteristics that are not construct-related might influence test scores, and the effect of each one on the test's validity.
• practice effects—the testtaker has taken the test before, hurt validity + generalizability of results • physical discomfort—hurt reliability (therefore hurting validity) increase error variance • inattention—testtaker accidentally skips question on standardized test, all answers are thrown off, data is useless