test and measurements study guide
Samuel has participants answer the items on his measure using the following scale: 1: Not at all true 2: A little true 3: Somewhat true 4: Very true Which of the following is true about this response scale? It is continuous It is unbalanced It is unipolar It is bipolar
It is unipolar
Which of the following is not true about construct validity? It requires making predictions or hypotheses about associations between constructs It involves assembling evidence about what a test really measures It requires expert judges or raters It involves looking at multiple different variables and correlations
It requires expert judges or raters
What does a negative discrimination index indicate to test developers? More people in the high-scoring group (compared to the low-scoring group) got the item correct More people in the low-scoring group (compared to the high-scoring group) got the item correct The item is valid The difficulty index is close to 0
More people in the low-scoring group (compared to the high-scoring group) got the item correct
When an assessment compares the performance of a child against other children of the same age it is called a(n) ________________________. Achievement test Standardized test Norm-referenced test Criterion-referenced test
Norm-referenced test
The score that you actually record is which of the following? Error score False score True score Observed score
Observed score
Tests can apply to a wide variety of content areas. Which of the following is not a general content area that psychological tests cover? Ability level Personality attributes Vocational interests Physical features
Physical features
Dr. Love creates a scale measuring how well couples communicate. He finds that couples scoring low on his scale are much more likely to get divorced within the next five years. This is an example of what type of validity for Dr. Love's scale? Content validity Predictive validity Concurrent validity Construct validity
Predictive validity
In order to determine the potential for performance in a future setting, an aptitude test must have which of the following? Concurrent validity Discriminate validity Predictive validity High reliability
Predictive validity
If a scale is internally consistent, then there is evidence that Individuals perform similarly on the scale in the future and the past All of the items on the scale measure the same construct The scale can be used to measure an individual's true score on a construct All of the above
All of the items on the scale measure the same construct
Why are z scores so important to the field of measurement? Carries a positive connotation regarding performance Allows scores from different scales/distributions to be compared Easily understood by non-testing professionals Allows for positive and negative scores
Allows scores from different scales/distributions to be compared
The first step in scale construction is to _____. Develop an item pool Determine the structure and format of the response scale Determine the extent of the construct/content you want to measure Decide upon the number of items to include
Determine the extent of the construct/content you want to measure
What is the first step you should take when designing an aptitude test? Determine the content & distribution of items Pilot items in a representative sample Determine the skills that are important for success in the role Run item analysis to determine whether the test shows good item discrimination
Determine the skills that are important for success in the role
Poor test instructions are an example of which of the following? Trait error Testing error Method error Instrument error
Method error
The same variable (e.g., happiness) can be assessed using different levels of measurement False True
True
If you add together the bivariate correlations between each predictor variable (X1 and X2) and the outcome (Y), the result should be ______ the multiple correlation between both predictor variables and the outcome. In other words, r2y1 + r2y2 is _____ R2Y.12 equal to greater than no way to tell less than
greater than
The variance of a variable is a measure of: how much variability exists among scores on the variable how well the variable measures what it is supposed to be measuring how consistently the variable is measured over time how internally consistent the items are measuring a variable
how much variability exists among scores on the variable
The squared correlation between two variables tells you what the value of one variable is given the score on another variable how much variance they share whether the variables are positively or negatively related whether two variables are on the same scale
how much variance they share
What is the biggest problem with this question? True or False: I do not vote republican. It is not concrete It involves a negation/double-negative It is a biased/loaded question it is double-bareled
it involves a negation/double-negative
When writing true-false items, what does the test developer most need to consider? Need for precision Number of items Probability of guessing Plausible distractors
Need for precision
A factor's eigenvalue equals the PERCENT of total variance in the data accounted for by a factor the AMOUNT of total variance in the data accounted for by a factor the MEAN of all the factor loadings 1
the AMOUNT of total variance in the data accounted for by a factor
A scree plot is a graph of _______ the factor eigenvalues the factor correlations the item loadings the item correlations
the factor eigenvalues
If a scale is completely unreliable, the correlation coefficient in any test of its reliability will be close to ___. -1.00 0.00 1.00 0.50
0.00
Validity is typically described in which of the following ways? A Continuum of weak to strong A Cronbach's Alpha A P value A correlation coefficient
A Continuum of weak to strong
What type of test measures current knowledge of a specific topic? Intelligence test Aptitude test Achievement test Projective test
Achievement test
A psychological construct is A behavioral tendency or complex pattern of behavior Not directly observable Something you must infer from a variety of questions or behaviors All of the above
All of the above
When administering an intelligence test, what is the term for the lowest point on test where the test taker can pass two consecutive items of equal difficulty? Lower Limit Ceiling Age Upper Limit Basal Age
Basal Age
Which type of item best minimizes the effects of guessing on test scores? True/False Completion Multiple Choice Matching
Completion
Dr. Baylor is concerned that the items on his "psychological health" scale really do adequately represent all aspects of psychological health. Dr. Baylor's concern is one of ______________. Content validity Construct validity Concurrent validity Predictive validity
Content validity
What is the fundamental logic behind the multitrait-multimethod technique? If a measure has construct validity, then... Correlations should be highest amoung different traits (e.g., aggression, intelligence) all measured using the same method (e.g., self-report) Correlations should be lowest amoung different traits (e.g., aggression, intelligence) all measured using the same method (e.g., self-report) Correlations should be highest among different methods (e.g., observation, self-report) all measuring the same trait (e.g., aggression) Correlations should be lowest among different methods (e.g., observation, self-report) all measuring the same trait (e.g., aggression)
Correlations should be highest among different methods (e.g., observation, self-report) all measuring the same trait (e.g., aggression)
If you correlate scores from your test with a real-world indicator of the construct you're trying to test, what type of validity evidence are you collecting? Criterion validity Content validity Face validity Construct validity
Criterion validity
If you are evaluating students' scores based on a predefined level of performance, what type of scores are you using? Stanines Standard scores Criterion-referenced scores Percentile scores
Criterion-referenced scores
What type of test is used when you want to determine if a student has achieved a specific level of mastery in a particular content area? Researcher-made test Criterion-referenced test Norm-referenced test Standardized test
Criterion-referenced test
As part of a study, Sandra answers nine questions asking about depressive symptoms (e.g. "I cry a lot," "I often feel like things are hopeless"). She answers each of these questions on a 1 (not at all true) to 5 (very true) rating scale. Each one of these nine questions could be considered a(n) C) variable A) item B) scale D) both A and C
D) both A and C
The purpose of the scale development project is to Test a hypothesized relationship between scores on a self-report measure and other constructs Design a self-report measure and evaluate its psychometric properties (e.g., reliability, validity) Create a scale measuring a construct no one has ever examined before Describe and distinguish between various forms of assessment
Design a self-report measure and evaluate its psychometric properties (e.g., reliability, validity)
What is calculated when you examine the proportion of test takers who get an item correct? Item analysis Discrimination index Difficulty index Correct alternatives
Difficulty index
What is the computed number for how well an item distinguishes between people high and low in a construct? Item analysis Discrimination index Difficulty index Correct alternatives
Discrimination index
Dr. McStuffins's patients think that her measure of health is valid because it involves the sorts of things they think of when they conceptualize being healthy (e.g., "not having boo-boos," "eating your vegetables"). This is an example of what type of validity? Content validity Construct validity Criterion validity Face validity
Face validity
When ordering items within a scale/survey, you should go from the most _____ to the most ______. General, Specific positive, negative objectionable, benign Specific, General
General, Specific
The optimal difficulty for a question is Close to 1 Close to 0 Halfway between 100% correct and chance level correct Halfway between 100% correct and 0% correct
Halfway between 100% correct and chance level correct
What is the key to establishing criterion validity? Having a criterion that can be measured simultaneously with the test Having a criterion that can be measured after the test takes place Assessing multiple criteria Having a criterion that is meaningfully related to the purpose of the test
Having a criterion that is meaningfully related to the purpose of the test
Which of the following is not ALWAYS recommended when writing items for a scale? Make items specific and concrete Include negatively keyed items Keep items simple, clear, & short Consider the knowledge and experiences of your audience
Include negatively keyed items
Which of the following often underlies individuals' achievement and aptitude in specific content areas? Interpersonal functioning Creativity Intelligence Mechanical ability
Intelligence
Two trained professionals observe the behavior of children in a classroom. They each rate observed behaviors using the same form, and the percent of items that were rated the same is calculated. This is an example of which type of reliability? Test-retest reliability Internal consistency Interrater reliability Parallel reliability
Interrater reliability
When we want to evaluate a person's performance relative to that of others within a particular group, which of the following would be want to use? Norm-referenced score Raw score Criterion-referenced score Performance indicator
Norm-referenced score
What is a common pitfall with completion items? The potential for more than one correct response Including statements that are only partly true Making the answer too easy to guess Being overly detailed
The potential for more than one correct response
Factor loadings are: always greater than 1 a sign of construct validity an indication of reliability correlations between the original items and factors
correlations between the original items and factors
An item whose highest factor loading and next highest factor loading are relatively similar (<.2 different) is considered a(n) _____________. extraction loading crossloader communality eigenvalue
crossloader
Item discrimination can be calculated by a) comparing the number of correct responses in the "high" versus "low" groups b) correlating scores on an item with scores on the total scale c) examining which items correlate most multiple subscales d) both a and b
d) both a and b
When designing a scale, factor analysis should be used to: measure the validity of a scale measure the reliability of a scale determine whether the scale measures one construct or multiple constructs identify the best and worst items on the scale
determine whether the scale measures one construct or multiple constructs
the goal of a factor analysis is to retain a small number of factors that capture very little of the variance in the variables. retain a small number of factors that capture a lot of the variance in the variables. retain a large number of factors that capture very little of the variance in the variables retain a large number of factors that capture a lot of the variance in the variables
retain a small number of factors that capture a lot of the variance in the variables.
A measurement device or technique used to quantify behavior is termed a test construct observation self-report
test
The maximum number of factors resulting from a factor analysis is______ the number of items in the factor analysis. greater than less than equal to inversely related to
equal to
Which kind of question is best for assessing higher order thinking and complex understanding? essay multiple choice matching completion
essay
If a test is being used to determine how scores on a construct relate to something else, then it is being used for hypothesis testing/prediction classification selection diagnosis
hypothesis testing/prediction
When sample size is large and the number of items in the analysis is small, chance results occur _________. never less often more often always
less often
If you want to know the meaning of a factor, you should look at the scree plot look at the factor's eigenvalue look at the communalities look at which items load highly on it
look at which items load highly on it
Factor rotation: makes some items more correlated and some items less correlated with each factor (than they initially were) is only necessary when you have reverse scored variables in the data set flips the meaning of the factors from positive to negative (i.e. reverse scores the factors) a and b only
makes some items more correlated and some items less correlated with each factor (than they initially were)
If a variable was measured in a way that provides magnitude but not equal intervals or an absolute zero, it was measured at the _______ level nominal ordinal ratio interval
ordinal
Dr. Chocula tests his students' understanding of the material in his math class by having them complete math problems. The format of this test is observation performance self-report inference
performance
If a test is being used to determine who should be admitted into graduate school, then it is being used for hypothesis testing/prediction classification selection diagnosis
selection
A difficulty index close to 1 indicates that individuals low in the construct are more likely to answer the question correctly that individuals high in the construct are more likely to answer the question correctly that the question is too hard that the question is too easy
that the question is too easy
Why is studying tests and measurement important within psychology? Tests may be used unfairly or inaccurately if the attributes of the test (e.g., purpose, populations it was designed for, reliability/validity) are not well understood Tests are used ubiquitously throughout psychological research and applied psychology settings (e.g., school psychology, counseling) Our understanding of human behavior is only as good as the tools we use to measure it All of the above
All of the above
Measures of anxiety and depression have a strong, negative correlation (e.g., r = -.70). Knowing this, which of the following is true? All of the above are true You will generally be able to predict someone's score on an depression measure well if you know their score on a measure of anxiety. You will generally be able to predict someone's score on an anxiety measure well if you know their score on a measure of depression. A scatterplot of individuals' scores on an anxiety measure and a depression measure will group relatively close to a line of best fit
All of the above are true
Why are correlation coefficients key to measuring reliability? Because they can test whether... Individuals who get high scores relative to the mean on one scale (or items within that scale) consistently get high scores relative to the mean on another scale (or items within another scale) An individual's scores on reverse-scored items are positively related to their scores on the other items on the scale. An individual's scores on a scale (or items on that scale) are consistently linked with outcomes An individual's scores on the same scale (or items within that scale) are consistently high or low relative to the mean of that scale
An individual's scores on the same scale (or items within that scale) are consistently high or low relative to the mean of that scale
According to Spearman's General Factor Theory, intelligence ("g") is The combination of componential, experiential, and contextual intelligence An underlying factor that explains individual differences in intellect Specific, independent abilities that vary among individuals The first "primary" mental ability
An underlying factor that explains individual differences in intellect
According to the lecture, which of the following is a problem with the below response scale? Very untrue Untrue Slightly untrue Neutral Slightly true True Very true Has too many options Contains a neutral option Response scale is not balanced Answers are not all mutually exclusive
Answers are not all mutually exclusive
Sonya's new measure of volunteerism did not explain a significant amount of additional variance in life satisfaction (change in R2) beyond that explained by an existing volunteerism scale. This means that: The new measure of volunteerism IS significantly correlated with life satisfaction (alone) Any of the above could be true; we cannot tell from the information provided The new measure of volunteerism is NOT significantly correlated with life satisfaction (alone) The existing volunteerism scale is significantly correlated with life satisfaction (alone)
Any of the above could be true; we cannot tell from the information provided
Assignment revisions in the Scale Development Project... Can earn you back all of the points you lost on the assignment Can be turned in any time after the assignment has been graded Are possible for the introduction and existing measures sections Are possible for all SDP assignments
Are possible for the introduction and existing measures sections
Item characteristic curves are helpful because they are easier to understand than the discrimination index are easier to understand than the difficulty index show the graphical relationship of difficulty and discrimination Can help identify who the item best discriminates among (e.g., low vs. moderately performing individuals)
Can help identify who the item best discriminates among (e.g., low vs. moderately performing individuals)
A Q-sort format is helpful because It assesses complex thinking skills (like synthesis and application) It forces people to discriminate among options when they normally would all select the highest or lowest values It allows you to easily score many items quickly It promotes free expression and creativity
It forces people to discriminate among options when they normally would all select the highest or lowest values
Dr. Testopherson creates a new scale measuring anxiety. Unfortunately, his scale has very poor reliability: items do not seem to all measure the same thing and individuals get different scores each time they take it. How should Dr. Testopherson go about establishing his scale's validity? By asking experts to verify that the scale items are good examples of various aspects of anxiety By correlating scores on his scale with related (e.g., depression) and unrelated (e.g., GPA) constructs By determining whether people who score high on his scale tend to have a diagnosed anxiety disorder according to the DSM He really can't. Until his scale is reliable, it cannot be a valid measure of anything.
He really can't. Until his scale is reliable, it cannot be a valid measure of anything.
Bryce has a 5-item scale and decides to run an item analysis. He correlates participants' scores on Item 1 with the mean score of all five items on the scale (including Item 1). He finds an item-total correlation of .42. This correlation... shows that the scale is reliable shows that Item 1 is reliable shows that Item 1 is valid Is an OVERestimate; Bryce actually needs to run a corrected item-total correlation (excluding Item 1 from the scale mean)
Is an OVERestimate; Bryce actually needs to run a corrected item-total correlation (excluding Item 1 from the scale mean)
An advantage of the continuous rating scale (compared to a category or likert scale) is: It is easier to score It only requires that the endpoints be rated It allows fine grained distinctions without overly taxing participants It requires fine motor control
It allows fine grained distinctions without overly taxing participants
How is guessing reduced with matching items? Repeating answer options Providing a great number of premises Providing more response options Using two columns when formatting
Providing more response options
Which of the following levels of measurement provides the most information about a variable Nominal Ratio Ordinal Interval
Ratio
Jennifer is creating a scale to measure introversion. In her experience, people who are introverted prefer to be alone, so she creates an item on her scale that reads: "I prefer to be alone" (agree-disagree). This is an example of what type of strategy for choosing scale content?. Criterion-group strategy Factor analysis strategy Rational-content strategy Theory-based strategy
Rational-content strategy
Andrew answers 5 questions about self-esteem (each rated 1-5) and earns a total score of 19 on this self-esteem measure. 19 is an example of what kind of score? Raw score Norm-referenced score Criterion-referenced score Z-score
Raw score
If a measure is said to be consistent, you might conclude that the measure is _______________________. Standard Reliable Concurrent Valid
Reliable
What does a percentile (rank) score of 85 tell us about an individual? Scored higher than 85% of others taking the test Has mastered 85% of the material Tells us nothing about the individual Scored higher than 15% of others taking the test
Scored higher than 85% of others taking the test
Which of the following is required of the test taker when answering multiple choice, true-false, and matching questions? Selecting information Supplying information Applying knowledge Synthesizing material
Selecting information
Which of the following is required of the test taker when answering essay, short answer, and completion items? Supplying information Selecting information Applying knowledge Synthesizing material
Supplying information
The grid that served as a guide when constructing an achievement test is called _______. Table of Outcomes Objectives Table Table of Specifications Specifications Grid
Table of Specifications
A measure of how stable scores on a test are over time is an example of which of the following? Internal consistency Parallel forms reliability Test-retest reliability Interrater reliability
Test-retest reliability
Sarah now has her regression output. To determine whether the number of Disney videos watched within the last month (DisNum) accounts for additional unique variance in children's enjoyment (Enjoy) of a Disney World trip, over and above the child's age (Age), Sarah should look at The change in R squared in Model 2 The overall R squared value for Model 1 The overall R squared value for Model 2 The change in R squared in Model 1
The change in R squared in Model 2
Incremental validity assesses the extent to which a scale _______. measures what it is designed to measure is correlated with an expected outcome measures a construct consistently accounts for variance in an outcome, beyond that which can be explained by other measure(s)
accounts for variance in an outcome, beyond that which can be explained by other measure(s)
When a psychological test takes the form of self-reported answers to questions, it may also be referred to as a scale survey questionnaire all of the above
all of the above
Sally finds a corrected item-total correlation of .42 for an item on her scale. Sally should keep the item - it certainly measures the same construct as the other items keep the item - it shows that the scale is valid (measures what it says its measuring) think about revising or discarding the item - it doesn't correlate that well with the rest of the construct automatically discard the item -- it's just too poor of a fit with the rest of the scale
automatically discard the item -- it's just too poor of a fit with the rest of the scale
The change in the value of the multiple correlation when a new variable is added to a regression model can be used to provide evidence about The validity of a new scale The reliability of a new scale The novelty or usefulness of a new scale The creativity of a new scale
The novelty or usefulness of a new scale
To determine how much variance the combination of number of Disney videos watched within the last month (DisNum) and child's age (Age) accounts for in children's enjoyment (Enjoy) of a Disney World trip, Sarah should look at The overall R squared value for Model 1 The overall R squared value for Model 2 The change in R squared in Model 1 The change in R squared in Model 2
The overall R squared value for Model 2
When dividing scores into quartiles, each quarter should have The same NUMBER of scores The same STANDARD DEVIATION The same RANGE of scores The same NORM
The same NUMBER of scores
Cronbach's alpha essentially averages all possible split half reliability coefficients to estimate reliability. This is an improvement over a single split-half reliability estimate because It corrects for items within a measure being on different scales The number of items being correlated is smaller when you split your test in half None of the above The split-half reliability estimate depends on how you split your scale (e.g. odd-even vs. 1st half - 2nd half)
The split-half reliability estimate depends on how you split your scale (e.g. odd-even vs. 1st half - 2nd half)
The difficulty index and the discrimination index are Inversely related to one another The two main components making up item analysis Only calculable when there is a single right answer Both rated on a 0 to 1 scale
The two main components making up item analysis
Jennifer next reads that the psychological definition of introversion involves individuals who gain energy from reflection and lose energy during social interaction. So she also includes the item "spending time interacting with others tends to drain my energy" (agree-disagree). This is an example of what type of strategy for choosing scale content? Theory-based strategy Rational-content strategy Criterion-group strategy Factor analysis strategy
Theory-based strategy
What is NOT a reason that multiple choice questions are often preferred for assessing learning/achievement? They allow for creative and unique responses They are easy to score They can be used to measure learning outcomes at almost any level They lend themselves to item analysis
They allow for creative and unique responses
A pitfall with matching items is that _____ They are only practical when you can generate a large number of options They are hard to administer to a large number of people The questions are not independent They deemphasize writing ability
They are only practical when you can generate a large number of options
What benefit do True/False, Multiple Choice, Matching, and Completion items all share? They are easy to write They mainly assess basic knowledge and memorization skills You can fit many on a test which can increase content validity They are objective and easy to score
They mainly assess basic knowledge and memorization skills
When we calculate reliability, we know the observed score. What are the two unknown components of the reliability equation that we can only estimate (not directly measure)? Method and error scores Test-retest and interrater scores True and error scores Means and standard deviations
True and error scores
Sarah wants to know whether the number of Disney videos watched within the last month (DisNum) accounts for additional unique variance in children's enjoyment (Enjoy) of a Disney World trip, over and above the child's age (Age). When running a multiple regression to examine this question, Sarah should: *DV = Dependent Variable *IV = Independent Variable Use Enjoy as the DV; enter DisNum as the first IV (block1) and Age as the second IV (block 2) Use Enjoy as the DV; enter Age as the first IV (block1) and DisNum as the second IV (block 2) Use DisNum as the DV; enter Enjoy as the first IV (block 1) and Age as the second IV (block 2) Use Enjoy as the DV; enter Age and DisNum together in a single block (block 1)
Use Enjoy as the DV; enter Age as the first IV (block1) and DisNum as the second IV (block 2)
Because "intelligence" is often inferred from behavior on a test, it's measurement must be Well-grounded in theory Quantitative Assessed using factor analysis Group admininstered
Well-grounded in theory