Chapter 2 Test Psychological Assessments

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

A testtaker earns placement in a particular group with other testtakers whose pattern of responses is similar in some way.

Class scoring

Testtakers must supply or create the correct answer.

constructed-response format

A source of error attributable to variations in the test taker's feelings, moods, or mental state over time. a. Transient Error b. Odd-even Reliability c. Inter-item Consistency d. Measurement Error

Transient Error

(Coefficient Alpha) Higher correlation, more likely two halves are measuring the same thing. T/F:

True

Alternate forms is known as parallel forms, equivalent forms) The content does not have to be the same, same content tapped differently but equally. Correlate people's scores on both versions. Alternate forms are mostly needed in achievement tests, classroom tests Alternate Forms are more difficult than people think T/F:

True

A particular test situation or universe is described in terms of its _________ which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration.

facets

In a psychometric context, _____ refers to the extent to which a test is used in an impartial, just, and equitable way. a. halo effect b. fairness c. bias d. central tendency

fairness

The _____ is based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation. a. content sampling theory b. classical test theory c. item response theory d. generalizability theory

generalizability theory

A(n) _____ is a logical result or deduction. a. construct b. hit rate c. inference d. criterion

inference

The degree of agreement or consistency between two or more raters with regard to a particular measure is known as _____. a. homogeneity b. heterogeneity c. split-half reliability d. inter-scorer reliability

inter-scorer reliability

The process by which a measuring device is designed and calibrated and by which scale values are assigned to different amounts of the trait, attribute, or characteristic being measured is known as _________.

scaling

Identify the types of item formats. (Check all that apply.) a. selected-response format b. constructed-response format c. activity format d. experiment format

selected-response format constructed-response format

Identify an estimate of reliability that is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. a. parallel forms reliability b. split-half reliability c. test-retest reliability d. alternate forms reliability

split-half reliability

A company wants to determine the practical value of a personality development training module prior to its inclusion in an HR training program. In this scenario, the company is trying to understand the _____ of the module. a. authenticity b. capacity c. utility d. validity

utility

An index of _____ gives information on the practical value of the information derived from scores on the test. a. utility b. dependability c. reliability d. validity

utility

As applied to a test, _____ is a judgment or estimate of how well a test measures what it purports to measure in a particular context. a. probability b. validity c. universality d. intentionality

validity

As applied to a test, _____ is a judgment or estimate of how well a test measures what it purports to measure in a particular context. a. validity b. universality c. probability d. intentionality

validity

As error goes up, reliability goes down. As error goes down, reliability goes up. T/F:

True

How to determine if a test is good?

Ask if it is reliable, if it is valid, if it is practical (time/money spent vs, gained info)

An estimate is about predicting scores. T/F:

True

The true score when we are using the classical test theory. It is really referring to your actual value on this particular measure if there was no error. T/F:

True

Identify the component of an observed test score that has nothing to do with a testtaker's ability. a. efficiency b. error c. true variance d. true score

error

Testtakers must choose a response from a set of alternative responses.

selected-response format

According to the generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained which known as the _____. a. true score b. universe score c. true variance d. error variance

universe score

In the generalizability theory, a(n) _____ replaces that of a true score. a. error variance b. transient score c. true variance d. universe score

universe score

High r:

up, •than, •enough, rate, •above, too•, score, price,level, rank•

Low r in depth:

weak relationship between measure (test score) and what you are predicting, people are spread out, predict score on the line and you are more likely to be off, but predictions still better than chance.

Rank the five stages in the process of developing a test in order, placing the first step in the top position and the last step in the bottom position.

1. Test conceptualization 2. Test construction 3. Test tryout 4. Item analysis 5. Test revision

_____ is a judgment of how adequately a test score can be used to infer an individual's most probable standing on some measure of interest. a. Intercept bias b. Face validity c. Validity coefficient d. Criterion-related validity

Criterion-related validity

Someone takes a personality test, and finds that they relate to the results. They then are revealed the results and find that they fit into most of the descriptions, this is an example of?

Forer/Barnum Effect

Different Ways to measure reliability?

Test Retest (gold standard) Alternate Forms Interscorer

True or false: Item analysis tends to be regarded as a quantitative endeavor, even though it may also be qualitative. a. True b. False

True

How do you determine validity?

You look for accuracy/accurate and meaningful information. Be Careful: the Forer Effect

The percentage of people hired under an existing system for a particular position is known as the _____. a. base rate b. selection ratio c. productivity ratio d. utility rate

base rate

For psychometricians, a factor inherent in a test that systematically prevents accurate, impartial measurement is known as ________

bias

The term _____ refers to the inherent uncertainty associated with any measurement, even after care has been taken to minimize preventable mistakes. a. utility b. reliability c. true variance d. measurement error

measurement error

Which of the following is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice? a. split-half reliability b. test-retest reliability c. alternate forms reliability d. parallel forms reliability

split-half reliability

What are the three elements of a multiple-choice format? (Check all that apply.) a. a correct alternative or option b. binary-choice items c. several incorrect alternatives or options a stem d. premises and responses

(1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options

An IT solutions company hires 30 software testers and 22 of them are considered successful. In this case, the base rate is _____. a. 1.36 b. 0.0073 c. 0.73 d. 0.0136

0.73

An organization has 75 job positions available and 100 applicants. In this scenario, the selection ratio is _____. 0.075 13.33 0.75 1.33

0.75

Identify a true statement about an index of inter-item consistency. a. A measure of inter-item consistency is calculated from a single administration of a single form of a test. b. It is a sufficient tool to measure multifaceted psychological variables such as intelligence or personality. c. A measure of inter-item consistency is calculated from multiple administrations of a single form of a test. d. It is directly related to test items that measure more than one personality trait.

A measure of inter-item consistency is calculated from a single administration of a single form of a test.

Reliability Coefficient

A statistic that quantifies reliability. It ranges from 0 (not reliable/no consistency) to 1 (reliable)

Which informal rule of thumb is followed regarding the number of people on whom a test should be tried out? a. There should be a maximum of 20 subjects for each item on the test. b. There should be a minimum of 2 subjects for each item on the test. c. There should be a maximum of 50 subjects for each item on the test. d. There should be no fewer than 5 subjects for each item on the test.

An informal rule of thumb is that there should be no fewer than 5 subjects and preferably as many as 10 for each item on the test.

Standard Error of Estimate

Average distance that individuals tend to be from the regression line. The regression line is known as the Area of Best Fit

Tests are due for revisions when, 1. The stimulus materials look dated and current testtakers cannot relate to them. 2. The verbal content of the test, including the administration instructions and the test items, contains dated vocabulary that is not readily understood by current testtakers. 3. As popular culture changes and words take on new meanings, certain words or expressions in the test items or directions may be perceived as inappropriate or even offensive to a particular group and must therefore be changed. 4. The test norms are no longer adequate as a result of group membership changes in the population of potential testtakers. 5. The test norms are no longer adequate as a result of age-related shifts in the abilities measured over time, and so an age extension of the norms (upward, downward, or in both directions) is necessary. 6. The reliability or the validity of the test, as well as the effectiveness of individual test items, can be significantly improved by a revision. 7. The theory on which the test was originally based has been improved significantly, and these changes should be reflected in the design and content of the test.

Boop

How do you determine reliability? What do you look for?

Consistency/Is it consistent? Are the scores repeatable? Do people pick the same? Assuming personality stays the same, timeframe may matter

Anxiety is a construct that may be invoked to describe why a psychiatric patient paces the floor. What does this best describe? a. Construct, b. Construct Validity c. Utility d. Concurrent Validity

Construct

Intelligence is a construct that may be invoked to describe why a student performs well in school. What does this best describe? a. Construct, b. Construct Validity c. Utility d. Concurrent Validity

Construct

The theoretical, intangible way people vary describes? a. Construct, b. Construct Validity c. Utility d. Concurrent Validity

Construct

_____ is a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable such as a trait or a state. a. Incremental validity b. Face validity c. Base validity d. Construct validity

Construct validity

_____ describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample. a. Face validity b. Content validity c. Predictive validity d. Incremental validity

Content validity

_____ describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample. a. Face validity b. Predictive validity c. Incremental validity d. Content validity

Content validity

_____ relates more to what a test appears to measure to the person being tested than to what the test actually measures. a. Concurrent validity b. Face validity c. Predictive validity d. Content validity

Face validity

_____ refers to how uniform a test is in measuring a single concept. a. Homogeneity b. Face validity c. Halo effect d. Inference

Homogeneity

____________ are items functionally uniform throughout, while ________ are items that are functionally different

Homogeneous tests, Heterogeneous tests.

If a test has a higher degree of internal consistency it is _________ in items. If a test has a lower degree of internal consistency it is _______ in items.

Homogeneous, Heterogeneous

Which of the following is a limitation of the Taylor-Russell tables? a. The tables cannot interpret the variables of selection ratio and base rate. b. Identification of a criterion score that separates successful from unsuccessful employees can potentially be difficult. c. It is impossible to estimate the extent to which inclusion of a particular test in a selection system will improve selection. d. The relationship between the predictor (the test) and the criterion (rating of performance on the job) must be nonlinear.

Identification of a criterion score that separates successful from unsuccessful employees can potentially be difficult.

Construct

Informed scientific idea developed or hypothesized to describe or explain behavior.

What are five examples of Construct

Intelligence, Anxiety, Job Satisfaction, Personality, Bigotry, Clerical Aptitude, Depression, Motivation, Self-esteem, Emotional Adjustment, Potential Dangerousness, Executive Potential, Creativity, Mechanical Comprehension.

A testtaker's score on one scale within a test is compared to another scale within that same test.

Ipsative Scoring

Identify an accurate statement about the concept of "bias." a. In the context of psychometrics, it always implies prejudice and preferential treatment. b. It has the same meaning in every context that it is used. c. It can be detected using a variety of statistical procedures. d. It is not helped by active prevention during test development.

It can be detected using a variety of statistical procedures.

Identify an accurate statement about convergent evidence for validity of a test. a. It only comes from correlations with tests measuring related constructs. b. It comes from correlations with tests measuring identical or related constructs. c. It only comes from correlations with tests measuring identical constructs. d. It comes from correlations with tests measuring unrelated constructs.

It comes from correlations with tests measuring identical or related constructs.

Identify a true statement about validity as applied to a test. a. It is a judgment of how dependable a test is to measure the same construct consistently. It is a judgment of how consistently a given test measures a particular construct. It is a judgment that serves as a stimulus to the creation of a new standardized test. It is a judgment based on evidence about the appropriateness of inferences drawn from test scores.

It is a judgment based on evidence about the appropriateness of inferences drawn from test scores.

In the context of setting cut scores, identify a true statement about the method of predictive yield. a. It is a norm-referenced method for setting cut scores. b. It employs a family of statistical techniques and is also called discriminant analysis. c. It uses a bookmark to separate test items and set a cut score. d. It sheds light on the relationship between identified variables and two naturally occurring groups.

It is a norm-referenced method for setting cut scores.

Identify steps that should be taken in the test revision stage. (Check all that apply.) a. All items in the test must be rewritten if one item is not valid. b. The testing methodology must be changed if the items are weak. c. Items that are too easy or difficult should be revised. d. Items with many weaknesses should be deleted.

Items that are too easy or difficult should be revised. Items with many weaknesses should be deleted.

Where does error come from?

Measurement Error Sources of Measurement Error (random error) Systematic Error

The potential problems of the Taylor-Russell tables were avoided by an alternative set of tables known as the _____. a. Brogden-Cronbach-Gleser tables b. Naylor-Shine tables c. protocol tables d. predictive yield tables

Naylor-Shine tables

No r in depth

No relationship between measure (test score) and what you are predicting, random guess as good as any.

A _____ is a statistic that quantifies reliability, ranging from 0 (not at all reliable) to 1 (perfectly reliable). a. Utility Coefficient b. Reliability Factorial c. Validity Coefficient d. Reliability Coefficient

Reliability Coefficient

If you believe that you are measuring a characteristic that is sort of, unitary. Then you may think that your test items should be homogenous .. Heterogeneous may tap the physical, cognitive aspects of depression. If Heterogenous has different facets to it, you may have lower values.

Review

In the Split-Half, you are going to try to divide into two equally hard halves, correlate scores. It is useful if cost or practice effects would impact test-retest (esp if impacts some more than other) Tends to be lower because she/her can correct for that, but still issues with how to pick your halves...Led to Split test in half in a thoughtful way, helps us if we do not have an alternate version. A downside is when you split the test in half, you have fewer items. With restricted range, it tells us, that the smaller the range of scores, the harder it is to get a meaningful pattern in data. (Spearman Brown Formula helps) They learned to measure Internal Consistency through Coefficient Alpha which means of all possible split halves.

Review

Internal Consistency are different parts of the test working in the same way? This will give us information whether we are measuring a single, a concept, a single entity or two things at once. If we are measuring two or more things then the reliability coefficient will be lower because they are not always going to match up the way we expect if there are different things that we are testing out.

Review

Is the characteristic, or trait being measured presumed to be dynamic or static, that will impact what you expect on reliability after 6 months refer to tests retest as coefficient of stability

Review

More Different Ways to Measure Reliability: Internal Consistency and its subset which is Split Half Coefficient Alpha Inter-Item Consistency

Review

Reliability Coefficient - Between 0 (weak) and +1 (strong) - More error, less reliable (less error, more reliable) - We identified multiple possible sources of error. - Started looking at different ways to measure Reliability

Review

The standards are Reliability Coefficients. .9 or .95 Goal for most tests. As low as .7 accepted sometimes (what level of error tolerance is accepted) Research sometimes accepts/uses tests even lower Should find stats in the test manuals.

Review

You may then ask if the range of test scores is restricted or not? You must expect lower if restricted.

Review

_____ is defined as the process of setting rules for assigning numbers in measurement. a. Piloting b. Anchoring c. Scaling d. Scoring

Scaling

If X is used to represent an observed score, T to represent a true score, and E to represent error, then X=_____. a. T/E b. T-E c. T(E) d. T+E

T+E

Forer Effect/Barnum Effect

Tendency for people to give highly accurate ratings to descriptions of their personality, that are supposedly tailored specifically to them, but are in fact vague and general enough to apply to a large range of people.

Identify a true statement about pilot work. a. A test developer employs pilot work to determine how to conceive the idea of a test. b. Pilot study only includes physiological monitoring of the research subjects. c. Pilot study only includes open-ended interviews with research subjects. d. Test items are piloted to evaluate whether they should be included in the instrument.

Test items are piloted to evaluate whether they should be included in the instrument.

Identify a true statement about the item response theory-based methods for setting cut scores in a test. a. Cut scores are typically set based on testtakers' performance across selected items on the test. b. The setting of cut scores is independent of test items c. Testtakers must answer items that are deemed to be above some minimum level of difficulty. d. The setting of cut scores is independent of any expert opinion.

Testtakers must answer items that are deemed to be above some minimum level of difficulty.

Identify an accurate statement about the concept of fairness. a. The issues of fairness are mostly rooted in values. b. It can be answered with mathematical precision and finality. c. It can be perfectly measured and evaluated through statistical procedures. d. The issue of fairness mostly crops up in technically complex, statistical problems.

The issues of fairness are mostly rooted in values.

Identify a limitation of the Taylor-Russell tables. a. They cannot interpret the variables of selection ratio and base rate. b. They cannot estimate the extent to which inclusion of a particular test in a selection system will improve selection. c. The relationship between the predictor and the criterion must be linear. d. The relationship between the test and the rating of performance on the job must be cyclic.

The relationship between the predictor and the criterion must be linear.

Identify a condition that deems a test to be due for revision. a. The test contains vocabulary and stimuli that are easily understood by current testtakers. b. The stimulus materials and verbal content look dated, and current testtakers cannot relate to them. c. The size of the population of potential testtakers increases. d. Current testtakers are able to score high in the test.

The stimulus materials and verbal content look dated, and current testtakers cannot relate to them.

Which of the following must be done after a test has been constructed? (Check all that apply.) a. The item-difficulty index must be calculated using the item-score standard deviation. b. The test must be tried out on people who are similar in critical respects to the people for whom the test was designed. c. The item-discrimination index must be calculated using the correlation between the item score and the criterion score. d. The test must be tried out under conditions identical to those under which the standardized test will be administered.

The test must be tried out on people who are similar in critical respects to the people for whom the test was designed. The test must be tried out under conditions identical to those under which the standardized test will be administered.

Identify an accurate fact about the concept of criterion. a. A criterion is similar to a hit rate. b. There are no specific rules as to what constitutes a criterion. c. Time can never be used as a criterion. d. A criterion can be contaminated in nature.

There are no specific rules as to what constitutes a criterion.

In the context of testing, identify the disadvantages of using the Taylor-Russell tables in a utility analysis. (Check all that apply.) a. They do not indicate the likely average increase in performance with the use of a particular test. b. They unrealistically dichotomize performance into successful versus unsuccessful. c. They require the relationship between the predictor and the criterion to be nonlinear. d. They do not show the relationship between selection ratio and existing base rate.

They do not indicate the likely average increase in performance with the use of a particular test. They unrealistically dichotomize performance into successful versus unsuccessful.

In the context of utility analysis, which of the following is true of the Taylor-Russell tables a. They entail obtaining the difference between the means of selected and unselected groups to know what a test is adding to established procedures. b. They assist in judging the utility of a particular test by determining the increase in average score on some criterion measure. c. They provide an estimate of the extent to which inclusion of a particular test in the selection system will improve selection. d. They provide an indication of the difference in average criterion scores for a selected group as compared with an original group.

They provide an estimate of the extent to which inclusion of a particular test in the selection system will improve selection.

Identify a true statement about test-retest reliability estimates. a. This measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time. b. This measure is used when it is impractical or undesirable to assess reliability with two tests or to administer a test twice. c. This estimate is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. d. This estimate allows a test user to estimate internal consistency reliability from a correlation of two halves of a test.

This measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time.

Constructs are unobservable, presupposed (underlying) traits that a test developer may invoke to describe test behavior or criterion performance. T/F:

True

Error scores could be positive or negative. There is no way to be able to know someone's true score, nor a way to fully remove error variance. T/F:

True

In item selection, is a testing issue where items do not tap concept well, meaning they do not measure properly. Items are vague, people interpret and respond differently. T/F:

True

In systematic error, you are consistently off in the same direction. It keeps everyone from getting the same score, leading them in the same direction. (your temperature keeps going up or growing down) Does not impact reliability, it impacts validity. Measuring something else part of the total. T/F:

True

In the language of psychometrics, reliability refers to consistency in measurement. T/F:

True

In the real world, you do not know Error variance versus test Variance. So reliability will be estimated using correlation coefficients. T/F:

True

Interscorer reliability is another way to measure reliability. It is also known as interjudge. You are going to have the same people take the test but given by different administrators. You correlate the scores and see "does individual A get a similar score when Tester A or B gives it to them?" You want to test if people get the same score when provided by different people. You take the same exact test to see if the score remains constant/same. This helps you establish reliability, it lets you know that the administrators are well trained and your study materials/test manual was good. T/F:

True

Measurement error is the uncertainty associated with measurement, even after care has been taken to minimize preventable mistakes. These are what is leftover. T/F:

True

Overall variance that we observed in our observed scores, is going to be equal to the variance of these true scores plus the variance of our error. And so the more error variance we can get rid of, the more error we can get rid of, the more likely our observed variance is going to capture the true score variance. o^2 = o^2 th + o^2 e T/F:

True

Possible sources of random error are Item Selection (test issue) and Test administration Test scoring. T/F:

True

Random Error is unpredictable fluctuations of other variables in the measurement process (ie. noise) sometimes random error helps people score and is beneficial and sometimes it may hurt their score. There is no consistent pattern in this error. So it will be partially negative and positive. Lowers your overall reliability. Does not go any particular direction, it just expands your scores high and low. They cancel each other out over time, (there is no consistent increase or decrease in scores) but the larger the amount lower the reliability. T/F:

True

Reliability Math Example: V= 10 R = (10)/(10 + 1) 10/11 = .91 (good test) ~~ V= 10 R = (10)/(10+10) 10/20 = .5 error variability increased ~~ V = 10 R = (10)/(10+20) 10/30 = .33 reliability got worse

True

Reliability is the proportion of total variance attributed to true variance. T/F:

True

Reliability used to be based on the Classical Test Theory created by Spearman in 1904, it was used most of the 20th century. T/F:

True

Test Scoring can be objective errors (ex: the teacher accidentally works the correct answer wrong). Judgment errors, (ex: teacher does not think your explanation is good enough) This is helped by Standardization which will help reduce 2 and 3. T/F:

True

Test administration is a random source of error. Some are environmental such as temperature, lighting, distractions). Examinee state (ill/pain, meds, no sleep, internal distractions) Examinee Error (careless, temporary blank) Administration Error (wrong instructions, coaching, rapport issue)

True

Test-Retest is the golden standard of measuring reliability, in Test Restest a correlation coefficient is run. You give people a test one day, then give the same group of people the test at a different time. Usually close together 1 week from now, to see if the scores remain consistent or change. Issues with Test Retest is test security having the possibility of getting messed up. Change in the mental state.

True

The Classical Test Theory is the simplest way to understanding reliability. Statistically, this is the easiest way to calculate the reliability? It gives good information in telling us if our test is consistent and if we are measuring something meaningful T/F:

True

The alternate-forms and parallel-forms reliability of a test are an indication of whether two forms of a test are really equivalent. T/F:

True

The bigger your area variance, the smaller proportion the more your total variance is representing your true variance causing reliability to go up. Reliability is a measure of variability of true scores divided by the variability of observed scores divided by the variability of observed scores (true+error) (variance of T) _______________________________________ variance of T + error variance By definition reliability falls between 0-1 (never no error) As error variance decreases, denominator gets closer to Variance of T, numerator and denominator approach each other, fraction gets closer to one, reliability increases and vice versa.

True

The construct score is difficult to capture? T/F:

True

The construct score is theoretical, which means we do not have a perfect measure, so we will never get the construct score? The true score would be the actual score on that particular test if you had no error. T/F:

True

The error is the difference between your actual score and the observed score. T/F:

True

The individual and any given observed score is their true score and error. When combined into the population group, you can look at variance which is the measure spread. T/F:

True

The more error variants you get rid of, the more your coefficient will get to 1. T/F:

True

The more error you have, the bigger your denominator, the smaller the overall fraction. The lower your reliability. T/F:

True

There are two sources of measurement error, these are random error and systematic error. T/F:

True

To be valid, you need to have test descriptions that provide meaningful differences between people as and their out of test behavior? T/F:

True

True or false: In the context of the Angoff method, a strategy for setting cut scores that is driven more by data and less by subjective judgments is required in situations where there is major disagreement regarding how certain populations of testtakers should respond to items. a. True b. False

True

True or false: Utility estimates based on the assumption that all people selected will actually accept offers of employment tend to overestimate the utility of the measurement tool. a. True b. False

True

True or false: Writing large item pools is helpful because approximately half of the items in the item pool will be eliminated from a test's final version after revision. a. True b. False

True

Variance equals true variance plus error variance. T/F:

True

Where you land on a particular measure is the true score. T/F:

True

With the standard error of measurement, you are trying to figure out how close you are to the actual score. T/F:

True

Identify an accurate statement about predictive validity of a test. a. Usually some intervening event takes place before the criterion measure is obtained. b. It is another term for concurrent validity. c. The test scores are obtained after the criterion measures have been obtained. d. The intervening event is always a training period.

Usually some intervening event takes place before the criterion measure is obtained.

Inter-Item Consistency

What is the degree of correlation among all of the items. Looks to see if for relations among selected high or not. If high they are, it suggest that you are tapping in on a unitary concept.

In Classic test theory, there is an understanding that all measures have some error. So x is going to be our observed score. Which will reflect your actual true score/value plus some kind of error measurement. X= observed score T= True Value/Actual e= error

X = T + e

An organization is experimenting with a new personnel test for its employees. The organization is trying to gauge if the test is working as it should and if it should be instituted on a permanent basis. Which of the following is most likely to aid the organization in making this decision? a. an achievement test b. a locator test c. a discriminant analysis d. an expectancy table

an expectancy table

While eliminating or rewriting items in a test during the test revision stage, a test developer must _____. a. balance the strengths and weaknesses across items b. retain the easy items and delete the difficult ones c. make sure that the items being eliminated are invalid d. write a small item pool in the domain in which the test should be sampled

balance the strengths and weaknesses across items

The _____ approach is used by some diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify for a specific diagnosis. class scoring cumulative scoring ipsative scoring reliability scoring

class scoring

The _____ is also referred to as the true score model of measurement. a. generalizability theory b. item response theory c. domain sampling theory d. classical test theory

classical test theory

The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability known as the _____. a. test-retest reliability b. split-half reliability c. coefficient of equivalence d. coefficient of stability

coefficient of equivalence

The simplest way to determine the degree of consistency among scorers in the scoring of a test is by calculating a coefficient of correlation known as _____. a. coefficient of equivalence b. coefficient of inter-scorer reliability c. coefficient of generalizability d. coefficient of split-half reliability

coefficient of inter-scorer reliability

Face validity is a judgment _____. a. of how a test score can be used to infer a test user's standing on some measure b. about how a test measures the intelligence of the test user c. of how consistently a test samples behavior d. concerning how relevant test items appear to be

concerning how relevant test items appear to be

If test scores are obtained at about the same time as the criterion measures are obtained, measures of the relationship between the test scores and the criterion provide evidence of _____. a. nonsimultaneous validity b. concurrent validity c. predictive validity d. asynchronous validity

concurrent validity

David uses an intelligence test that measures individuals on a certain set of characteristics. However, the high scorers and low scorers on the test do not behave as predicted by the theory on which the test was based. Here, David needs to investigate the _____ of the test. a. ipsative validity b. construct validity c. summative validity d. anchor validity

construct validity

Evidence that test scores change as a result of some experience between a pretest and a posttest can be evidence of _____. a. base validity b. concurrent validity c. face validity d. construct validity

construct validity

Evidence that test scores change as a result of some experience between a pretest and a posttest can be evidence of _____. a. construct validity b. base validity c. face validity d. concurrent validity

construct validity

If scores on a test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same (or a similar) construct, it would be an example of _____. a. severity error b. convergent evidence c. discriminant evidence d. leniency error

convergent evidence

The rationale of the method of contrasted groups is that if a test is a valid measure of a particular construct, then test scores from groups of people who would be presumed to differ with respect to that construct should have _____. a. little variation between high scorers and low scorers in a group b. correspondingly different test scores c. extreme variation across high scorers in all groups d. similar test scores across all groups

correspondingly different test scores

In the context of concurrent and predictive validity, a(n) _____ is defined as the standard against which a test or a test score is evaluated. a. slope bias b. construct c. criterion d. intercept bias

criterion

In which of the following approaches should a learner demonstrate mastery of particular material before the learner moves on to advanced material that conceptually builds on the existing base of knowledge? a. prototype testing b. criterion-referenced testing c. pilot testing d. norm-referenced testing

criterion-referenced testing

A professor attempts to judge the validity of a test that tests the time management skill of a testtaker by using the score of the test to measure the individual's ability to handle time. The professor is judging the _____ of the test. a. criterion-related validity b. face validity c. content validity ratio d. central tendency error

criterion-related validity

In contrast to techniques and principles applicable to the development of norm-referenced tests, the development of criterion-referenced instruments _____. a. entails exploratory work with at least five groups of testtakers b. depends on the test scores of low scorers c. is independent of the objective of the test d. derives from a conceptualization of the knowledge or skills to be mastered

derives from a conceptualization of the knowledge or skills to be mastered

In the context of testing, a disadvantage of using the Taylor-Russell tables in a utility analysis is that they _____. a. require the relationship between predictor and criterion to be nonlinear b. do not consider the cost of testing in comparison to benefits c. do not dichotomize criterion performance d. overestimate utility unless top-down selection is used

do not consider the cost of testing in comparison to benefits

In the context of a test, a feature of the item response theory (IRT) framework is that _____. a. cut scores are set based on tessttakers' performance across all the items on a test b. each item is associated with a particular level of difficulty c. the setting of cut scores is independent of expert opinion d. the difficulty level of an item is independent of its cut score

each item is associated with a particular level of difficulty

One way a test developer can improve the homogeneity of a test containing items that are scored dichotomously is by _____. a. eliminating items that do not show significant correlation coefficients with total test scores b. using multiple samples in testing to increase validity shrinkage c. assuring testtakers that improving homogeneity is the end-all of proving construct validity d. using a single sample in testing to increase validity shrinkage

eliminating items that do not show significant correlation coefficients with total test scores

Statements of concurrent validity indicate the extent to which test scores may be used to _____. a. determine the efficiency level of the test users b. estimate an individual's perception of the validity of the test c. estimate an individual's present standing on a criterion d. determine the consistency of the test

estimate an individual's present standing on a criterion

Which of the following tables can provide an indication of the likelihood that a testtaker will score within some interval of scores on a criterion measure? a. Naylor-Shine tables b. expectancy tables c. protocol tables d. Taylor-Russell tables

expectancy tables

Identify an experience that is most likely to result in maximum change in test scores between a pretest and a posttest. a. a random walk b. commute to the workplace c. watching a sitcom d. formal education

formal education

Michael encountered a difficult question in his LSAT exam. By using his knowledge of law terminology, he was able to logically arrive at the answer. In this scenario, Michael used the _____ method. a. convergent evidence b. hit rate c. predictive validity d. inference

inference

The degree of correlation among all the items on a scale is known as _____. a. inter-scorer reliability b. inter-item consistency c. coefficient of equivalence d. coefficient of stability

inter-item consistency

The different types of statistical scrutiny that test data can potentially undergo are referred to collectively as _____. a. item-pool scoring b. item branching c. item analysis d. item-characteristic bias

item analysis

This stage invovles statistical procedures that assist in making judements about test items. a. test conceptualization b. test construction c. test tryout d. item analysis e. test revision

item analysis

Variables such as the form, plan, structure, arrangement, and layout of individual test items are collectively referred to as _____. a. index b. item bank c. scale d. item format

item format

An _____ is the reservoir or well from which items will or will not be drawn for the final version of a test. a. item pool b. item index c. item branch d. item format

item pool

Which of the following can help test developers evaluate how well a test or an individual item is working to measure different levels of a construct? a. item response theory (IRT) information curves b. classical test theory (CTT) information curves c. the item-validity index d. the item-discrimination index

item response theory (IRT) information curves

In the context of methods for setting cut scores, the method that entails the collection of data on the predictor of interest from groups known to possess, and not to possess, a trait, attribute, or ability of interest is known as the _____ method. a. bookmark b. predictive yield c. known groups d. item-mapping

known groups

A cut score is set based on a test that best discriminates the test performance of two groups in the _____. a. bookmark method b. item-mapping method c. predictive yield method d. known groups method

known groups method

From the perspective of a test creator, a challenge in test development is to _____. a. maximize the proportion of the total variance that is error variance and to minimize the proportion of the total variance that is true variance b. minimize the true variance and maximize the error variance c. maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance d. neutralize the true variance and error variance

maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance

A university administered a geometry test to a new batch of exchange students. All of the students received failing grades because they were unfamiliar with the language, English, that was used to administer the test. The test was designed to evaluate the students' knowledge of geometry but it reflected their knowledge and proficiency in English instead. This scenario is an example of _____. a. transient error b. odd-even reliability c. inter-item consistency d. measurement error

measurement error

The _____ provides evidence for the validity of a test by demonstrating that scores on the test vary in a predictable way as a function of membership in some group. a. method of paired comparisons b. method of contrasted groups c. membership-characteristic curve d. item-characteristic curve

method of contrasted groups

The _____ provides evidence for the validity of a test by demonstrating that scores on the test vary in a predictable way as a function of membership in some group. a. method of paired comparisons b. membership-characteristic curve c. method of contrasted groups d. item-characteristic curve

method of contrasted groups

Identify an example of a selected-response item format. a. essay format b. completion item format c. short answer format d. multiple-choice format

multiple-choice format

Identify an example of a selected-response item format. a. essay format b. multiple-choice format c. completion item format d. short answer format

multiple-choice format

Criterion-related validity is difficult to establish on many classroom tests by professors because _____. a. every criterion reflects the level of the students' knowledge of a particular material b. all classroom tests are informal tests c. all classroom tests are multiple-choice tests d. no obvious criterion reflects the level of the students' knowledge of a specific material

no obvious criterion reflects the level of the students' knowledge of a specific material

Identify the formula for the standard error of the difference between two scores when the squared standard error of measurement for test 1 (σmeas12) and the squared standard error of measurement for test 2 (σmeas22) is known. Page 172 in Edition 9 Psychological Assessments

o diff = 2^/o meas ^2/1 + o meas^2/2

Many utility models are constructed on the assumption that the _____. a. people who do not score well on a personnel test are unsuitable for the job profile b. people selected by a personnel test will accept the position that they are offered c. top scorers of a personnel test will easily acclimate to the work environment d. top scorers of a personnel test will be a perfect fit for the job profile

people selected by a personnel test will accept the position that they are offered

The term _____ refers to the preliminary research surrounding the creation of a prototype of a test. a. scaling b. planning c. pilot work d. priority work

pilot work

Measures of the relationship between test scores and a criterion measure obtained at a future time provide an indication of the _____ of a test. a. concurrent validity b. face validity c. predictive validity d. content validity

predictive validity

Concerns about content validity of classroom tests are _____. a. addressed by test developers by conducing a test analysis d. routinely addressed, usually informally, by professors in the test development process c. avoided by test developers as the process of addressing them is complicated d. considered difficult by professors to address in the test development process

routinely addressed, usually informally, by professors in the test development process

A numerical value that reflects the relationship between the number of people to be hired and the number of people available to be hired is known as the _____. a. selection ratio b. productivity gain c. base rate d. utility gain

selection ratio

Obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense). This is known as a. test-retest reliability b. split-half reliability c. reliability coefficient d. Utility

split-half reliability

The _____ is the tool used to estimate or infer the extent to which an observed score deviates from a true score. a. average proportional distance b. standard error of measurement c. coefficient alpha d. error variance

standard error of measurement

The standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests is known as the _____. a. standard error of the difference b. coefficient of stability c. standard error of measurement d. test-retest reliability

standard error of measurement

A statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant is called the _____. a. coefficient of stability b. coefficient of generalizability c. standard error of the difference d. reliability difference

standard error of the difference

High r in depth:

strong relationship between measure (test score) and what you are predicting. Nobody is that far off, predict score on the line and you will probably be close. Ex: Act score and GPA. Assessment you try and know a test score, when there is a strong relationship nobody is that far off from the regression line.

Unlike items in a selected-response format, items in a constructed-response format require testtakers to _____. a. conduct an experiment b. perform a skilled activity unrelated to the test c. select a response from a set of alternative responses d. supply or create the correct answer

supply or create the correct answer

In the context of a test, identify the uses of item response theory (IRT) information curves. (Check all that apply.) a. tailoring an instrument to provide high information or precision b. presenting test items on the basis of responses to previous items c. recognizing nonpurposive or inconsistent responding d. weeding out uninformative questions or eliminating redundant items

tailoring an instrument to provide high information or precision weeding out uninformative questions or eliminating redundant items

This stage involves conceiving the idea for a test? a. test conceptualization b. test construction c. test tryout d. item analysis e. test revision

test conceptualization

Item sampling and content sampling are sources of variance during _____. a. test construction b. test interpretation c. test scoring d. test administration

test construction

This stage involves writing and formatting test items and setting score rules? a. test conceptualization b. test construction c. test tryout d. item analysis e. test revision

test construction

In the interest of ensuring content validity of a test, _____. a. incremental validity must be identified b. irrelevant content should be used to further understanding of the construct c. test developers have a fuzzy vision of the construct being measured in the test d. test developers include key components of the construct being measured

test developers include key components of the construct being measured

In the interest of ensuring content validity of a test, _____. a. test developers have a fuzzy vision of the construct being measured in the test b. test developers include key components of the construct being measured c. incremental validity must be identified d. irrelevant content should be used to further understanding of the construct

test developers include key components of the construct being measured

This stage involves action taken to modify a test's content or format for the purpose of improving its effectiveness? a. test conceptualization b. test construction c. test tryout d. item analysis e. test revision

test revision

This stage involves administering a preliminary form of a test to a representative sample of testtakers? a. test conceptualization b. test construction c. test tryout d. item analysis e. test revision

test tryout

An estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test is known as _____. a. parallel-forms reliability b. split-half reliability c. test-retest reliability d. alternate-forms reliability

test-retest reliability

Which of the following can be used to set fixed cut scores that can be applied to personnel selection tasks as well as to questions regarding the presence or absence of a particular trait, attribute, or ability? a. the Brogden-Cronbach-Gleser formula b. the problem-solving model c. the response-to-intervention model d. the Angoff method

the Angoff method

Identify the tables that are used to obtain the difference between the means of the selected and unselected groups in order to derive an index of what a test is adding to already established procedures. a. the Taylor-Russell tables b. protocol tables c. the Naylor-Shine tables d. expectancy tables

the Naylor-Shine tables

Which of the following is a technique for setting cut scores that is most likely take into account the number of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores? a. the method of predictive yield b. the item-mapping method c. the method of contrasting groups d. the known groups method

the method of predictive yield

The Taylor-Russell tables provide an estimate of the percentage of employees hired by the use of a particular test who will be successful at their jobs. Identify the variables that the tables use to provide the estimate. (Check all that apply.) a. the average score on some criterion measure b. the selection ratio used c. the cut score d. the test's validity

the selection ratio used the test's validity

A condition that deems tests to be due for revision is that the _____. a. reliability as well as the effectiveness of individual test items has remained constant b. test developer has come up with new and improved items to be tested c. test tryout has been successful d. theory on which the test was originally based has been improved significantly

theory on which the test was originally based has been improved significantly

According to classical test theory, _____ is a value that genuinely reflects an individual's ability (or trait) level as measured by a particular test. a. true variance b. true score c. coefficient alpha d. error variance

true score

In the context of testing and assessment, the term _____ refers to the usefulness or practical value of testing to improve efficiency. a. authenticity b. rigidity c. reliability d. utility

utility

Test scores are said to have _____ if their use in a particular situation helps a person make better decisions—better, that is, in the sense of being more cost-effective. a. validity b. authenticity c. reliability d. utility

utility

A _____ can be defined as a family of techniques that entail a cost-benefit analysis designed to yield information relevant to a decision about the practical value of a tool of assessment. a. validity analysis b. psychometric soundness analysis c. reliability analysis d. utility analysis

utility analysis

When trying to determine whether the benefits of using a test outweigh the costs, a test developer must conduct a _____. a. validity analysis b. productivity gain analysis c. reliability analysis d. utility analysis

utility analysis

No r:

matter•, •how, •doubt, can•, means, •longer

Low r:

•enough, keep•, high, down, rate, price, among, score,cost, •pay, voice•


Kaugnay na mga set ng pag-aaral

Nutrition Ch. 19 Cooper and Gosnell

View Set

Algorithms and Data Structures Midterm

View Set

Chapter 1: Professional Orientation and Ethical Practice- Green Book

View Set

6th HISTORY: THE US EAST OF THE MISSISSIPPI RIVER ch. 4

View Set

Chapter 5 Bank, Chapter 4 Bank, Chapter 6 Bank

View Set