Psych 155 test 2
Three different types of reliability
Test-retest Reliability Internal Consistency Scorer Reliability and Agreement
written test
a paper and pencil test in which a test taker must answer a series of questions
cohen's kappa
an index of agreement for two sets of scores or ratings
measurement error
variations or inconsistencies in the measurements yielded by a test or survey
generalizable
when a test can be expected to produce similar results even though it has been administered in different locations
practice effects
when test takers benefit from taking a test the first time (practice) because they are able to solve problems more quickly and correctly the second time they take the same test
item response theory (IRT)
a theory that relates the performance of each item to a statistical estimate of the test taker's ability on the construct being measured
item difficulty
the percentage of test takers who answer a question correctly
face validity
the perception of the test taker that the test measures what it is supposed to measure
test specifications
the plan prepared before test development or whose behavior is measured
test format
the type of questions on a test
differential validity
when a test yields significantly different validity coefficients for subgroups
intrascorer reliability
whether each clinician was consistent in the way he or she assigned scores from test to test
concurrent evidence of validity (concurrent method)
a method for establishing evidence of validity based on a test's relationships with other variables in which test administration and criterion measurement happen at roughly the same time
generalizability theory
a proposed method for systematically analyzing the many causes of inconsistency or random error in test scores, seeking to find systematic error that can then be eliminated
heterogenous test
a test that measures more than one trait or characteristic
testing universe
the body of knowledge or behaviors that a test represents
reliable test
a test that consistently yields the same measurements for the same phenomena
random responding
responding to items in a random fashion by marking answers without reading or considering the items
focus group
a method that involves bringing together people who are similar to the target respondents in order to discuss issues related to the survey
pilot test
a scientific investigation of a new test's reliability and validity for its specified purpose
test item
a stimulus or test question
practical test
a test in which a test taker must actively demonstrate skills in specific situations
discriminant evidence of validity
one of two strategies for demonstrating construct validity showing that constructs that theoretically should be related are indeed related; evidence that test scores are not correlated with unrelated constructs
response sets
patterns of responding to a test or survey that result in false or misleading information
test of significance
the process of determining what the probability is that a study would have yielded the observed results simply by chance
surveys
instruments used for gathering information from a sample of the individuals of interest
single-group validity
when a test is valid for one group but not for another group, such as valid for whites but not for blacks
qualitative analysis
when test develops ask test takers to complete a questionnaire about how they viewed the test and how they answered the questions
true/false
a test item that asks, "is this statement true or false"
field test
an administration of a survey or test to a larger representative group of individuals to identify problems with administration, item interpretation, and so on
convergent evidence of validity
one of two strategies for demonstrating construct validity showing that constructs that theoretically should be related are indeed related; evidence that the scores on a test correlate strongly with scores on other tests that measure the same construct
operational definition
specific behaviors that define or represent a construct
projective tests
tests that unstructured and require test takers to respond to ambiguous stimuli
testing environment
the circumstances under which a test is administered
cumulative model of scoring
the more the test taker responds in a particular fashion (either with "correct" answers or ones that are consistent with a particular attribute), the more the test taker exhibits the attribute being measured (e.g., multiple-choice questions)
experimental research techniques
research designs that provide evidence for cause and effect
individually administered surveys
surveys administered by a facilitator in person for respondents to complete in the presence of the facilitator
face-to-face surveys
surveys in which an interviewer asks a series of questions in a respondent's home, a public place, or the researcher's office
descriptive research techniques
techniques that help us describe a situation or phenomenon
reliability
the consistency with which an instrument yields measurements
subjective test format
a test format that does not have a response that is designated as "correct"; interpretation of the response as correct or providing evidence of a specific construct is left to the judgment of the person who administers, scores, or interprets the test taker's response
projective techniques
a type of psychological test in which the response requirements are unclear so as to encourage test takers to create responses that describe the thoughts and emotions they are experiencing; three projective techniques are projective storytelling, projective drawing, and sentence completion
interrater agreement
the consistency with which scorers rate or make yes/no decisions
alternate forms
two forms of a test that are alike in every way except for the questions; used to overcome problems such as practice effects; also referred to as parallel forms
classical test theory
No instrument perfectly reliable or consistent All test scores contain some error (X=T+E) • Test Length • Homogeneity • Test-Retest Interval • Effective Test Administration • Careful Scoring • Guessing or Faking
quantitative item analysis
a statistical analysis of the responses that test takers gave to individual test questions
correlation
a statistical procedure that provides an index of the strength and direction of the linear relationship between two variables
objective test format
a test format that has one response that is designated as "correct" or that provides evidence of a specific construct, such as multiple-choice questions
construct validity
an accumulation of evidence that a test is based on sound psychological theory and therefore measures what it is supposed to measure; evidence that a test relates to other tests and behaviors as predicted by a theory
construct
an attribute, trait, or characteristic that is abstracted from observable behaviors
content validity ratio
an index that describes how essential each test item is to measuring the attribute or construct that the item is supposed to measure
item nonresponse rate
how often an item or question was not answered
homogeneity of the population
how similar the people in a population are to one another
intrarater agreement
how well a scorer makes consistent judgements across all tests
experts
individuals who are knowledgeable about a topic or who will be affected by the outcome of something
categorical model of scoring
places test takers in a particular group or class (e.g., displays a pattern of responses that indicates a clinical diagnosis of a certain psychological disorder)
ipsative model of scoring
requires test taker to choose among the constructs the test measures (e.g., forced choice)
empirically based tests
tests in which the decision to place an individual in a category is based solely on the quantitative relationship between the predictor and the criterion
interscorer agreement
the consistency with which scorers rate or make decisions
validity coefficient
the correlation coefficient obtained when test scores are correlated with a performance criterion representing the amount or strength of the evidence of validity for the test
scorer reliability
the degree of agreement between or among persons scoring a test or rating an individual; also known as interrater reliability
content validity
the extent to which the questions on a test are representative of the material that should be covered by the test
parallel forms
two forms of a test that are alike in every way except questions; used to overcome problems such as practice effects; also referred to as alternate forms
multiple choice
an objective test format that consists of a question or partial sentence, called a stem, followed by a number of responses, only one of which is correct
content validity ratio
CVR = [(E - (N / 2)) / (N / 2)] As an example, say you assembled a team of 10 experts, seven of whom rated the product essential: CVR = [(7 - (10 / 2)) / (10 / 2)] CVR = [(7 - 5) / 5} CVR = 2 / 5 CVR = 0.40
test manual provide describe contain
Provides rationale for constructing the test history of the development process results of the validation studies Describes appropriate target audience instructions for administering and scoring the test Contains norms and information on interpreting individual scores
7 methods of testing reliability
Test-Retest Reliability Same test administered to the same people at two points in time Alternate Forms or Parallel Forms Two forms of the test administered to the same people Internal Consistency Give the test in one administration, then split the test into two halves for scoring Internal Consistency Give the test in one administration, then compare all possible split halves Inter-rater Reliability Give the test once, and have it scored (interval/ ratio level) by two scorers or two methods Inter-rater Agreement Give a rating instrument and have it completed by two or more judges Intra-rater Agreement Calculate the consistency of score for one scorer across multiple tests
concurrent evidence of validity
a method for establishing evidence of validity based on a test's relationships with other variables in which test administration and criterion measurement happen at roughly the same time
cut scores
decision points for dividing test scores into pass/fail groupings
validity
evidence that the interpretations that are being made from the scores on a test are appropriate for their intended purpose
confidence interval
a range of scores that the test user can feel confident includes the true score
two methods for demonstrating evidence of validity
The Predictive Method - Used when it is important to show a relationship between test scores and a future behavior Appropriate for validating employment tests The Concurrent Method - When test administration and criterion measurement happen at the same time - Appropriate for validating clinical tests that diagnose behavioral, emotional, or mental disorders and selection tests - Often used for selection tests because employers do not want to hire applicants with low test scores or wait a long time to get criteria data
criterion-related validity (evidence of validity based on test-criteria relationships)
evidence that test scores correlate with or predict independent behaviors, attitudes, or events; the extent to which the scores on a test correlate with scores on a measure of performance or behavior
norms
group of scores that indicate the average performance of a group and the distribution of scores above and below this average
validity - past vs present
Whether there is evidence supporting the interpretation of the resulting test scores for their intended purpose • In the past - three types of validity: content, criterion-related, and construct • Now view validity as a single concept. • "Evaluating the interpretation of test scores and accumulating evidence to provide a sound scientific basis for the proposed score interpretations."
test-retest method
a method for estimating test reliability in which a test developer gives the same test to the same group of test takers on two different occasions and correlates the scores from the first and second administrations
test plan
a plan for developing a new test that specifies the characteristics of the test, including a definition of the construct and the content to be measured (the testing universe), the format for the questions, and how the test will be administered and scored
order effects
changes in test scores resulting from the order in which tests or questions on tests were administered
item bias
differences in responses to test questions that are related to differences in culture, gender, or experiences of the test takers
item analysis
the process of evaluating the performance of each item on a test
survey objectives
the purpose of a survey, including a definition of what it will measure
random error
the unexplained difference between a test taker's true score and the obtained score; error that is nonsystematic and unpredictable, resulting from an unknown cause
influence reliability
1. Test itself poorly designed trick questions ambiguous questions poorly written questions reading level higher than the reading level of target population 2. Test administration not following administration instructions disturbances during the test period answering test takers' questions inappropriately extremely cold or hot room temperature 3. Test scoring not scoring according to instructions inaccurate scoring errors in judgment errors calculating test scores 4. Test takers fatigue illness exposure to test questions before the test not providing truthful and honest answers
predictive evidence of validity (predictive method)
a method for establishing evidence of validity based on a test's relationships with other variables that shows a relationship between test scores obtained at one point in time and a criterion measured at a later point in time
homogenous test
a test that measures only one trait or characteristic
content areas
the knowledge, skills, and/or attributes that a test assesses
criterion
the measure of performance that we expect to correlate with test scores
predictions using validity information
When a relationship can be established between test scores and a criterion, we can use test scores from other individuals to predict how well those individuals will perform on the criterion measure
evidence of validity based on relationship with external criteria
When test scores correlate with independent behaviors, attitudes, or events
instructional objectives
a list of what individuals should be able to do as a result of taking a course of instructions
subjective criterion
a measurement that is based on judgement, such as supervisor or peer ratings
population
all members of the target audience
reliability coefficient
• Correlation provides an index of the strength and direction of the linear relationship between two variables • Correlation coefficient = rxy • Reliability coefficient = rxx reliable tests will have positive signs
5 sources of evidence of validity
• Evidence based on test content • Evidence based on response processes • Evidence based on internal structure • Evidence based on relations with other variables • Evidence based on the consequences of testing
internal consistency
the internal reliability of a measurement instrument; the extent to which each test question has the same value of the attribute that the test measures
reliability
A reliable test is one we can trust to measure each person in approximately the same way every time it is used