Tests and measurements (615) Test one application

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Katie and Kathy are roommates who share the same bathroom scale. Neither Katie nor Kathy is on a special diet to lose or gain weight. Each morning, they both weigh themselves. From day to day, it seems that each gains or loses two to three pounds. Some days Katie gains three pounds and Kathy loses two pounds. Other days Katie loses two pounds and Kathy gains three pounds. Every day their weights are different from their weights the previous day, and they cannot distinguish a pattern. Katie and Kathy decide to start weighing themselves on a scale at the wellness center. To their surprise, they neither gain nor lose weight from time to time when using the scale at the wellness center. Which one of the following best explains this situation? A. Their home scale has systematic error, and the wellness center scale is more accurate. B. Their home scale has random error, and the wellness center scale is more accurate. C. The scale at the wellness center has systematic error, and their home scale is accurate. D. The scale at the wellness center has random error, and their home scale is accurate.

B. Their home scale has random error, and the wellness center scale is more accurate.

The trait-irrelevant variability that can enter into test scores as a result of fortuitous factors related to the content of specific items included in the test is called.....? •Which type of test is prone to this error source?

Content sampling error •Tests for which consistency of results is desired Methods/estimate: •Alternative forms reliability •Split-half

What type of evidence of validity exists if you took an algebra test that required you to perform a representative sample of algebraic calculations? a. Validity based on its content b. Validity based on its relationship with a criterion c. Validity based on its relationship with a construct Face validity

Correct Answer: a Explanation: Because the items on the test (i.e., the test content) are representative of the algebraic calculations, we would say that the test has evidence of validity based on its content. While content valid tests are often face valid, remember that face validity refers to if the test "looks valid" and does not refer to what the test actually measures or its content. Thus, in this case, the best answer is evidence based on its content.

If test takers perceive a test as appropriate, they are referencing evidence of what? a. Face validity b. Reliability c. Validity based on content d. Validity based on relationship with a construct

Correct Answer: a Explanation: Face validity refers to whether test takers think that the test is measuring what it is supposed to be measuring. While face validity is often an important consideration when choosing a test, it is not a type of psychometric validity. There are many tests that may have significant evidence of validity for intended use, but do not appear face valid to the test takers.

Which one of the following types of attribute is most difficult to describe in terms of behaviors? a. Abstract b. Concrete c. Nonspecific d. Specific

Correct Answer: a Explanation: The textbook defines two types of attributes—concrete and abstract. Concrete attributes are easier to observe. Abstract attributes are much more difficult to describe or measure because it might not be clear exactly what behaviors are most important to be measured. For example, what is leadership? What are the specific behaviors that demonstrate leadership and how will we measure them? Obviously leadership is a vague and abstract concept. Because of this, it is much harder to collect validity evidence for such an abstract attribute.

What type of evidence of validity exists if a test developer finds that scores on a new employment test, designed to predict success on the job, correlate with employees' performance appraisal ratings? a. Validity based on the test's relationship with a criterion b. Validity based on the test's relationship with a construct c. Validity based on the test's content d. Face validity

Correct Answer: a Explanation: When a test score predicts a future outcome (often called a criterion), we say the test demonstrates evidence of validity based on its relationship with a criterion. This question describes a specific type of evidence called predictive evidence of validity because the test scores are significantly related to an important future outcome of employee performance, thereby making the test a valid selection tool.

If we demonstrate that a test allows us to identify individuals who are likely to become depressed, we have demonstrated evidence of validity based on the test's a. content. b. relationship with a criterion. c. relationship with a construct. d. appearance to the test taker.

Correct Answer: b Explanation: A criterion is an outcome that a test user is interested in predicting. For example, companies are interested in predicting the future performance of applicants, thus the criterion in this example is job performance. Therefore, criterion-related validity evidence seeks to determine the relationship between the test and job performance.

Evidence of validity based on a test's content is easiest for tests such as mathematical achievement tests that measure _______ attributes and more difficult for tests such as personality tests that measure _______ attributes. a. abstract; concrete b. concrete; abstract c. nonspecific; specific d. specific; nonspecific

Correct Answer: b Explanation: Both mathematics and achievement are well understood and are easier to define attributes. Therefore, they are concrete in nature. In contrast, personality is much more ambiguous and vague and therefore is best described as an abstract attribute.

Demonstrating evidence of validity is often logical rather than statistical for which one of the following? a. Face validity and validity evidence based on a test's relationships with a criterion b. Face validity and validity evidence based on a test's content c. Construct validity and validity evidence based on a test's relationships with a criterion d. Validity evidence based on a test's content and based on a test's relationships with a criterion

Correct Answer: b Explanation: Face validity concerns whether test takers and others view the test as measuring what it is supposed to measure. This is a subjective process and focuses on whether the test "looks" valid to the test taker or others. Also, evidence of validity based on test content involves logically and systematically showing that the test's content is representative of the construct being measured. As a result, neither forms of validity generally involve statistical analyses. However, criterion and construct evidence for validity almost always involve such analyses.

What is the first step to ensuring that a test demonstrates evidence of validity based on its content? a. Develop test specifications b. Define the testing universe c. Determine the content areas d. Determine the instructional objectives

Correct Answer: b Explanation: The first step in gathering evidence based on test content is to carefully define the body of knowledge or behaviors that a test represents, which is the testing universe. Defining the testing universe often involves reviewing other instruments that measure the same construct, interviewing experts who are familiar with the construct, and researching the construct by locating theoretical or empirical research on the construct. The purpose is to ensure that the test developer clearly understands and can clearly define the construct that he or she will be measuring. Evidence of validity based on test content requires that the test cover all major aspects of the testing universe (the construct(s) being measured) in the correct proportion.

What type of evidence of validity exists if a test developer finds that the scores on a new test of mathematical achievement correlate with the scores on another test of mathematical achievement? a. Validity based on the test's relationship with a criterion b. Validity based on the test's relationship with a construct c. Validity based on the test's content d. Face validity

Correct Answer: b Explanation: This is an example of a specific type of evidence based on the test's relationship with a construct. In this case, because the test correlates with a test measuring a similar or the same construct, we would say that this is convergent evidence of validity.

The content validity ratio for a test item can range from what to what? a. −1.00 to 0 b. 0 to 1.00 c. −1.00 to 1.00 d. 1.00 to 10.00

Correct Answer: c Explanation: Content validity ratios are sometimes used to demonstrate content based evidence of validity and are based on a survey of subject matter experts. They can range from between -1.00 and 1.00, where a value of 0.00 means that that 50% of the experts believed a test item to be essential

The current Standards (AERA, APA, & NCME, 2014) include discussion of five sources of evidence of validity. Which of the following is one of those sources? a. Construct validity b. Criterion-related validity c. Test content d. Face validity

Correct Answer: c Explanation: The Standards describe five sources of evidence for validity: Evidence based on test content Evidence based on response processes Evidence based on internal structure Evidence based on relations with other variables Evidence based on the consequences of testing.

What evidence exists for a writing test that requires a test taker to perform a representative sample of writing activities (for example, writing a poem, writing an essay, writing a term paper)? a. Validity based on the test's relationship with a construct b. Validity based on the test's relationship with a criterion c. Validity based on the test's content d. Face validity

Correct Answer: c Explanation: Evidence of validity based on test content requires that the test cover all major aspects of the testing universe (of the construct) in the correct proportion. Thus, if the test taker is performing representative samples of writing activities, then the test content does sample from the construct of writing.

What type of attribute does a test measures if the attribute can be described in terms of specific behaviors? a. Abstract b. Nonspecific c. Concrete d. Specific

Correct Answer: c Explanation: Some tests measure attributes that are relatively easy to observe and are highly specific. In these cases, we say the test is measuring a concrete attribute. These types of attributes are easier to measure and it is easier to collect evidence for their validity.

A valid test a. consistently measures whatever it measures. b. consistently measures multiple constructs. c. measures only one construct. d. allows one to make correct inferences about the meaning of the scores.

Correct Answer: d Explanation: A test in and of itself is neither valid nor invalid. Instead, validity concerns the inferences that are made from test scores. Hence, validity is not a property of the test. Rather, validity refers to the consequences and interpretation of test scores for their intended purpose. Furthermore, to establish validity we have to collect different types of evidence to show that are inferences are appropriate.

Which one of the following is NOT considered a traditional type of validity? a. Content b. Criterion related c. Construct d. Alternate forms

Correct Answer: d Explanation: Before the 1999 Standards, validity was viewed in three distinct categories: content validity, construct validity, and criterion (related) validity. In contrast, alternative forms is a method for assessing a test's reliability and is not directly related to validity.

What two approaches can we use to demonstrate evidence of validity based on a test's relationship with criteria? a. Proactive and retroactive b. Predictive and nonpredictive c. Content and construct d. Predictive and concurrent

Correct Answer: d Explanation: The textbook includes discussion of two specific types of evidence based on relationships with other variables (criterion). The first is predictive evidence of validity, which refers to when test scores are significantly correlated with an important future outcome. The second type is concurrent evidence of validity, which examines if the test scores are related to a criterion that is collected at the same time. The key difference between the two is the time when the criterion information is collected.

Error in scores that results from fluctuations in items across an entire test is called.....?

Inter-item inconsistency -In contrast to the content sample error emanating from the particular configuration of items included in the test as a whole

•Test-retest reliability is .89 •Interpretation? •What's next?

Interpretation: test retest estimate is good ; 11% due to error/time

Which type of test would yield a higher value for internal consistency: scores from a very easy achievement test of scores from a test of medium difficulty?

Medium difficulty will have higher reliability bc there would be more variability

Would it be appropriate to obtain a test-retest reliability coefficient over a 1 hour period for scores on an achievement test?

No, practice effects and carryover effects maybe too. Could end up tiring test takers too with fatigue and lower reliability.

Error that may enter into scores whenever the element of subjectivity plays a part in scoring a test is called...? •Appropriate measure used to estimate error?

Source of error- interscorer differences methods: Inter-rater reliability

Which one of the following formulas do test developers who wish to increase the reliability of a test use to estimate how many homogeneous test questions should be added to a test to raise its reliability to the desired level? Coefficient alpha KR-20 Spearman-Brown Pearson product-moment correlation

Spearman-Brown

Which type of test would yield more reliable scores- a classroom test or a standardized achievement test?

Standardized achievement test

A promotion test for firefighters requires them to be rated by two experts on their knowledge, use, and maintenance of safety equipment. How can their scores be checked for reliability?

To check their scores for reliability, the interrater/interscorer reliability method should be utilized. The reason for this is because the test requires them to be rated by two experts and interrater reliability assesses the consistency and agreement between the two raters/experts. This can be measured through percent agreement or kappa. It is important to calculate the reliability between the experts because people as observers are inconsistent and can misinterpret, get tired, distracted, etc.

Define X, T, and E

X= Observed score T= true test score E= error score

Can a test have two different reliability coefficients? What does that mean?

Yes, estimates different types of error, they arent the same What does that mean?: 4 methods estimate 4 different types of error

If a person's true score is 110 on a test with a standard error of measurement of 3.7 and a mean of 100, we would expect 95% of the person's test scores to fall within a) 102.75 - 117.25 b) 92.75 - 107.25 c) 90.75 - 120.25 d) 100-110

a) 102.75 - 117.25

When you are interested in the long-term stability of a measure, you should use the ______ method for estimating reliability. a) test-retest b) alternate forms c) split-half d) internal consistency

a) test-retest

Kate received a z score of 1 on a reading test. What do we know about Kate's performance, assuming that the reading test scores are distributed normally? a. She scored better than 84% of other students. b. She scored better than only 2/3 of the other students. c. She scored worse than only 2/3 of other students. d. She scored worse than 84% of other students.

a. She scored better than 84% of other students.

Given that a T- score = (z-score x 10) + 50, a z-score of 1.50 could be expressed as a T-score of a) 63 b) 65 c) 61 d) 51.5

b) 65

If the mean of a distribution is 7 and the standard deviation is 2, what is the z score that is equivalent to a raw score of 3? a. 2 b. -2 c. 3 d. 6

b. -2

Which best conveys the meaning of an inter-scorer reliability estimate of .90? a. Ninety percent of the scores obtained are reliable. b. Ninety percent of the variance in the scores assigned by the scorers was attributed to true differences and 10% to error. c. Ten percent of the variance in the scores assigned by the scorers was attributed to true differences and 90% to error. d. Ten percent of the test's items are in need of revision according to the majority of the test's users.

b. Ninety percent of the variance in the scores assigned by the scorers was attributed to true differences and 10% to error.

John received a z score of 0.5 on an exam. Peter received a T score of 60 on that same exam. What can be said about their relative performance on the exam? a. There is not enough information to compare John's and Peter's exam scores. b. Peter received a higher raw score than John on the exam. c. John received a higher raw score than Peter on the exam. d. The two test-takers actually received the same score on the exam.

b. Peter received a higher raw score than John on the exam.

If an item analysis suggests that an item in a test is poor, and if that item is removed from the test, the reliability of the shorter test is likely to be a) lower than the original reliability coefficient b) reliability would not change, but validity would increase c) higher than the original reliability coefficient d) none of the above

c) higher than the original reliability coefficient

The standard error of measurement is a function of which two factors? a) reliability of the test and range of test scores b) variability of test scores and range of test scores c) reliability of the test and variability of test scores d) variability of test scores and sample size

c) reliability of the test and variability of test scores

From which type of test administered to members of the general population would you LEAST expect the resulting data to be distributed normally? a. A test to measure the strength of one's hand grip b. A test to measure general intelligence and fund of knowledge c. A test to measure knowledge of psychometric principles d. A test to measure self-esteem

c. A test to measure knowledge of psychometric principles

A coefficient alpha over .9 may indicate that: a. the items in the test are too dissimilar. b. the test is not reliable. c. the items in the test are redundant. d. the test is biased against low-ability individuals.

c. the items in the test are redundant.

Interitem inconsistency results from _____________

content heterogeneity

Error variance for measures of internal consistency comes from: a. fatigue. b. motivation. c. a test-taker practice effect. d. heterogeneity of the content.

d. heterogeneity of the content.

Which source of error variance affects parallel- or alternate-form reliability estimates but does not affect test-retest estimates? a. fatigue b. learning c. practice d. item sampling

d. item sampling

Alternate forms reliability is .55 •Interpretation? •What's next?

forms are equally as similar as they are dissimilar. Would not use these

Internal consistency is .33 •Interpretation? •What's next?

not homogenous. This is very low, items are not highly interrelated

what types of tests are prone to the error source of time sampling error?

relatively stable traits or behavior tests are prone to this error source -Hinges on two assumptions •Whatever construct being measured is liable to fluctuate over time •Some constructs assesses through tests are either less subject to change or change at a much slower pace than others - use test re-test

When test reliability is high, the standard error of measurement is _______. As test reliability decreases, the standard error of measurement _________. •high; decreases •low; decreases •high; increases •low; increases

•low; increases

Which one of the following is important for both interpreting individual test scores and calculating confidence intervals? •Standard error of measurement •Pearson product-moment correlation •Test Variance •Spearman-Brown formula

- Standard error of measurement

Which one of the following is the appropriate method for estimating reliability for tests with homogeneous questions that have more than two possible responses? -Coefficient alpha/ Cohens D -Pearson product-moment correlation -Spearman-Brown formula -KR-20

-Coefficient alpha/ Cohens D

content heterogeneity method/how is it estimated?

-Inclusion of items or sets of items that tap different knowledge or psychological functions that differ from those tapped by other items on test internal consistency measures (Cohens d/coefficient alpha and KR-20)

Jon developed a math test for fourth graders, but he was not able to administer the test twice. What method can Jon use to estimate the reliability/precision of the math test? -Criterion related -Construct -Internal consistency -Test-retest

-Internal consistency

Marsha, a student teacher, wanted to check the reliability of a math test that she developed for her fourth graders. She gave the test to students on Monday morning and then again on Tuesday morning. On the first administration of her test, there was a wide variety of scores, but on the second administration, nearly all of the children made A's on the test. Marsha wondered, "Why did all the students make A's on Tuesday, but not on Monday?" Which one of the following would most likely account for this outcome? -Order effects -Practice effects -Measurement error Scorer error

-Practice effects

Researchers conducted two studies on the reliability of the Wisconsin Card Sorting Test (WCST) using adult psychiatric inpatients. In these studies, more than one person scored the WCST independently. What kind of reliability/precision were the researchers interested in establishing? -Test-retest reliability -Scorer reliability -Split-half reliability -Internal consistency

-Scorer reliability

When using the split-half method, an adjustment must be made to compensate for splitting the test into halves. Which one of the following would we use to make this adjustment? -Coefficient alpha -Pearson product-moment correlation -Spearman-Brown formula -KR-20

-Spearman-Brown formula

Which one of the following methods do we use to examine the performance of a test over time and provide an estimate of a test's stability? -Test-retest reliability -Split-half reliability -Score reliability -Alternative forms reliability

-Test-retest reliability

While ________ describes the degree to which questions on a test or subscale are interrelated, ________ refers to whether the questions measure the same trait or dimension. -homogeneity; coefficient alpha -coefficient alpha; homogeneity -heterogeneity; coefficient alpha -homogeneity; test-retest reliability

-coefficient alpha; homogeneity

As a rule, adding more questions that measure the same trait or attribute can _______ a test's reliability. -increase -decrease -overestimate -underestimate

-increase

When we talk about how each inch on a yardstick is the same length, we are talking about the yardstick's -reliability/precision. -internal consistency. -order effects -score reliability.

-reliability/precision.

Researchers administered the Personality Assessment Inventory (PAI) to two samples of individuals. First, they administered the PAI twice to 75 adults, with the second administration following the first by an average of 24 days. They also administered the PAI to 80 college students who took the test twice, with an interval of 28 days. In each case, the researchers were conducting studies to measure the PAI's -internal consistency (1 sample) -score reliability -split-half reliability -test-retest reliability

-test-retest reliability

If a test is perfectly reliable, - What should the correlation between T and X equal? - Which should be larger? The variance of X or the variance of T?

1 They should both be equal to each other if the test is perfectly reliable/no measurement error. If not, then the variance of X should always be larger.

A researcher wants to assess attitudes about quality of work life. She wants to be sure her instrument is reliable. Her instrument contains 20 statements that respondents rate from 1 to 5. She has designed her instrument to be homogeneous. What method(s) should she use?

1. She can use Cronbach's Alpha to measure internal consistency reliability. She will administer the survey to a sample of participants and collect their responses and then use Cronbach's alpha to assess the internal consistency reliability of the instrument. Cronbach's Alpha measures the extent to which all of the items in the instrument are correlated with one another which indicates how well they measure the same underlying construct (attitudes about quality of work life). If she gets an Alpha value above .70 then that suggests the items are reliable in measuring the intended construct. 2. Test-retest reliability She should administer the instrument to a group of participants, and then after a certain time interval she will administer the same survey to the same group of participants. She will then calculate the correlation between their responses on the two different dates. The reason for this is so that she can assess the consistency of responses over time. If the instrument is reliable, participants will have provided similar responses when presented with the same items on different occasions.

Instead of correlating observed and true scores, reliability can be assessed by correlating 2 observed scores. What is the logic behind this?

2 observed scores are made up of true score and error. Error is random so it cant correlate with anything. If observed score correlates it must be bc true score correlates.

What is the equivalent T score for an IQ score of 115?

60

Which of the following is a correct interpretation of a reliability coefficient of .80? (Choose all that are true) - 80% of the variance in the observed scores is due to true score differences - A person's true score is equal to their observed score 80% of the time - 20% of the differences in observed scores are due to error

A and C. Not B because it is possible that a person's true score is equal to their observed score less than 80% of the time due to measurement error. Reliability coefficient doesn't imply how often a persons observed score accurately represents their true score, it reflects variance in observed scores that can be attributed to true score differences.

A test developer is constructing a measure of critical thinking. The instrument consists of a number of anagrams and riddles—problems whose answers are not readily apparent until solved. The test score depends on the length of time required to solve each problem. How should the test developer estimate reliability?

Alternate forms. To estimate reliability, the test developer should assess with alternate/parallel forms reliability. Two different forms of the test should be developed that are identical in purpose but differ in content (both should contain a representative mix of anagrams and riddles). These tests will be administered to the same group of test takers and then the scores between the two are correlated. The reason for this is that it can eliminate problems found in test-retest (remembering the answers to the previous given riddles). This can be formulated through Pearson product-moment correlation.

A testing company has developed two versions of a math placement test. They want to investigate the extent to which students' scores are similar across the test tests. What type of reliability coefficient is most appropriate in this situation? How to collect the data?

Alternate/parallel forms, content sampling error, collect data by giving two forms to one group, concurrent= same time, use pearsons r to analyze - KNOW THIS FOR TEST QUESTION

An instructor has designed a comprehensive math exam for students entering community college. The exam is made up of multiple-choice questions that measure each of the following dimensions: reading formulas, carrying out math calculations, and solving word problems. He can only give his exam once, but he needs to know how reliable the test scores are. What should he do?

Answer: Internal consistency through split-half or internal consistency measures The instructor should use split-half reliability to establish reliability estimates of the math exam. The reasoning for this is because he can only give his exam once and since the questions are all multiple choice (one correct answer only for each) he should use split-half. To do this the instructor will divide the exam into two equal halves while ensuring each half represents all dimensions (reading formulas, carrying out math calculations, and solving word problems), then administer the exam to the students, calculate the total scores for each half for each student, and use the Spearman-Brown formula to estimate the reliability of the test based on the correlation between the two halves. He could also use the KR-20 formula since the questions are right or wrong(3 times one for each)


Ensembles d'études connexes

Med Surge I: Chapter 10 Quiz review for Exam IV

View Set

Cynics, Skeptics, Epicureans & Stoics

View Set

Software Engineering II (Q1 Baylor Song)

View Set

Medical Laboratory Science Review Harr 5.2 Chemistry - Blood Gases, pH, and Electrolytes (36-70)

View Set

Pharmacotherapeutics - Ch 26 - Plus/Davis Questions

View Set

Chapter 18- The Expansion of Europe

View Set

google certification exam (multiple choice)

View Set

Wrap up Notes and Module quiz 2 Introduction to Financial Accounting

View Set