Quiz #4

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Summary of Rating Errors - Text

*Minimize rating error by defining the behavior in as specific, concrete, and objective terms as possible!

Estimating Reliability (3)

- reliability cannot be calculated 1. Stability 2. Equivalence 3. Internal Consistency

Standard Error of Measurement - Text

*Although a test never yields a person's true score, test scores should be considered as being in the range containing the true score.

Standard Error of Measurement

- *review notes and slide - symbol: "Sem" - estimates the amount of variability in a test due to errors in measurement - higher reliability (rxx), standard error would be lower - lower reliability (rxx), standard error would be higher - higher reliability = lower standard error

Alpha Coefficient Formula

- *review slide and notes for Alpha Coefficient example rxx = K / K - 1 (1 - ∑s2trials / s2total) "K" = number of trials

Split Half Reliability Estimate

- *review slide and notes for Split Half Reliability example - roe = ("o" = odd; "e" = even) - lower correlation between odd and even half, lower reliability estimates - higher correlation between odd and even half, higher reliability estimate

Split Half vs. Alpha Coefficient

- *split half value will be higher than alpha coefficient value - however, alpha coefficient is used more - ex: .92 > .82 (refer to example in notes) - generally, split half > alpha coefficient

Construct Validity - Known Groups - Notes

- Administer basketballs skills test to: → Beginning basketball class → Intermediate basketball class → UNT Intramural champs → UNT Varsity team - ANOVA will show significant differences between the groups in a logical fashion - *review previous slide for example diagram → "if this is true, then this should happen"

Criterion Related Validity - Establishing the Relationship with a Criterion - Notes

- Criterion is a universally/generally accepted measure of an attribute → "the gold standard of measurements" → Ex: VO2max is the criterion measure of cardiovascular endurance → Ex: Death certificate is the criterion measure of cause of mortality - *review diagram on next slide → the further out the criterion, the harder to determine predictive validity

Construct Validity - Indirect Measurement - Notes

- Hypothesized structure of abilities or traits to measure the construct - Ex: used in psychological and intelligence test → Attitude Scales → Personality Inventories - Series of studies and a variety of statistical analyses → "if this is true, then this should happen"

Validity - Notes

-*review diagram on slide - summary of establishing validity → objectivity = interrater reliability → internal consistency = ANOVA, alpha coefficient, split half method, KR20/KR21 *"If it is reliable enough and relevant enough, it has sufficient validity."

Stability - Definition - Text

- a coefficient of reliability measured by the test retest method on different days - used frequently with fitness and motor performance measures but less often with pencil and paper tests - one of the most severe tests of consistency because the errors associated with measurement are likely to be more pronounced when the two test administrations are separated by a day or more

Interobserver Agreement (IOA) - Definition - Text

- a common way of estimating reliability among coders by using a formula that divides the number of agreements in behavior coding by the sum of the agreements and disagreements

Rating Scales - Defintion - Text

- a measure of behavior that involves a subjective evaluation based on a checklist of criteria - used in research to evaluate performance

Test Retest Method - Definition - Text

- a method of determining stability in which a test is given one day and then administered exactly as before a day or so later

Same Day Test Retest Method - Definition - Text

- a method of establishing reliability in which a test is given twice to the same participants on the same day - used almost exclusively with physical performance tests because practice effects and recall tend to produce spuriously high correlations when this technique is used with written tests - intraclass correlation is used to analyze trial to trial (internal) consistency

Alternate Forms Method - Definition - Text

- a method of establishing reliability involving the construction of two tests that both supposedly sample the same material - also called: "parallel form method" or "equivalence method" - the two tests are given to the same people - some time elapses between the two administrations - the scores on the two tests are then correlated to obtain a reliability coefficient - this method is widely used with standardized tests, such as those of achievement and scholastic aptitude - the method is rarely used with physical performance tests, probably because constructing two sets of good physical test items is more difficult than writing two sets of questions - the preferred method for determining reliability - the degree of relationship between two samples should yield the best estimate of reliability

Split Half Technique - Definition - Text

- a method of testing reliability in which the test is divided in two, usually making the odd numbered items one half and the even numbered items the other half - the two halves are then correlated - widely used for written tests and occasionally in performance tests that require numerous trials - downside is people may tire near the end of the test or easier questions may be placed in the first half - correct odd numbered questions are compared with correct even numbered questions

Known Group Difference Method - Definition - Text

- a method used for establishing construct validity in which the test scores of groups that should differ on a trait or ability are compared - ex: testing anaerobic power by comparing test scores of sprinters/jumpers and distance runners; sprinting and jumping require greater anaerobic power than distance running does, therefore, the tester could determine whether the test differentiates between two kinds of track performers; if the sprinters/jumpers score significantly better than the distance runners do, this finding would provide some evidence that the test measures anaerobic power

Flanagan Method - Definition - Text

- a process for estimating reliability in which the test is split into two halves, and the variances of the halves of the test are analyzed in relation to the total variance of the test

Semantic Differential Scale - Definition - Text

- a scale used to measure affective behavior in which the respondent is asked to make judgements about certain concepts by choosing one of seven interval between bipolar adjectives - the scale is based on the importance of language in reflecting a person's feelings

Rating of Perceived Exertion (RPE) - Definition - Text

- a self rating scale developed by Borg (1962) to measure a person's perceived effort during exercise - combines and integrates the many physiological indicators of exertion into a whole, or gestalt, of subjective feeling of physical effort - numbers range from 6 to 20 reflecting a range of exertion from "very, very light" to " very, very hard"

Cross Validation - Definition - Text

- a technique to assess the accuracy of a prediction formula in which the formula is applied to a sample not used in developing the formula - helps estimate shrinkage

Observer Expectation Error - Definition - Text

- an error that results when a rater sees evidence of certain expected behaviors and interprets observations in the expected direction - often stems from the halo effect and observer bias - observer expectations can contaminate the ratings because a person who expects certain behaviors is inclined to see evidence of those behaviors and interpret observations in the expected direction - in a research setting, observer expectation errors are likely when the observer knows what the experimental hypotheses are and is thus inclined to watch for these outcomes - the double blind experimental technique is ideal for controlling expectation errors (the observers do not know which participants received which treatments, they also should not know which performances occurred during the pretest and which occurred during the posttest)

Coefficient of Alpha Technique - Definition - Text

- a technique used for establishing the reliability of multiple trial tests - also called: "Cronbach alpha coefficient" - more versatile than other methods - can be used with items that have various point values, such as essay tests and attitude scales that have as possible answers: "strongly agree," "agree," and so on - involves calculating variances of the parts of a test - the parts can be items, test halves, trials or a series of tests such as quizzes - when the items are dichotomous (either right or wrong), the coefficient alpha results in the same reliability estimate as KR20 - when the parts are halves of the test, the results are the same as the Flanagan split halves method - when the parts are trials or tests, the results are the same as intraclass correlation - coefficient alpha is probably the most commonly used method of estimating reliability for standardized tests

Halo Effect - Definition - Text

- a threat to internal validity wherein raters allow previous impressions or knowledge about certain people to influence all ratings of those people's behaviors - ex: knowing that a person excels in one or more activities, a judge may rate that person highly on all other activities - negative impressions of a person tend to lead to lower ratings in subsequent performances

Expectancy Table - Definition - Text

- a two way grid that predicts whether individuals with a particular assessment score will attain some criterion score - used to predict the probability of some performance

Concurrent Validity - Definition - Text

- a type fo criterion validity in which a measuring instrument is correlated with some criterion that is administered concurrently or at about the same time - usually employed when the researcher wishes to substitute a shorter, more easily administered test for a criterion that is difficult to measure - choosing the criterion is critical in the concurrent validity method - if the criterion is inadequate, then the concurrent validity coefficient is of little consequence

T-Scale - Definition - Text

- a type fo standard score that sets the mean at 50 and standard deviation at 10 to remove the decimal found in z scores and to make all scores positive

Likert-type Scale - Definition - Text

- a type of closed question that requires the participant to respond by choosing one of several scaled responses - the intervals between items are assumed to be equal - usually 5 to 7 point scale - used to assess the degree of agreement or disagreement with statements and is widely used in attitude inventories

Predictive Validity - Definition - Text

- a type of criterion validity in which scores of predictor variables can accurately predict criterion scores - in trying to predict a certain behavior, a researcher should try to ascertain whether there is a known base rate for that behavior - if the base rate is very low or very high, a predictive measure may have little practical value because the increase in predictability is negligible - multiple regression is often used because several predictors are likely to have a greater validity coefficient than the correlation between any one test and the criterion - one limitation is that the validity tends to decrease when the prediction formula is used with a new sample (shrinkage)

Content Validity - Determining Relevance Logically - Notes

- also known as: "face validity" or "logical validity" - Logical analysis that a test is relevant to the objectives of the test - Requires: → Clear understanding of objectives - Beauty is in the eye of the beholder: → Panel of expert judges → Not acceptable to have only one person - First step is establishing validity for research purposes

Index of Discrimination - Definition - Text

- also known as: "item discrimination" - the degree to which a test item discriminates between people who did well on the entire test and those who did poorly - simplest way is to divide the completed tests into a high scoring group and a low scoring group and then use the formula

Intraclass Correlation - Definition - Text

- an ANOVA technique used for estimating the reliability of a measure - provides estimates of systematic and error variance - systematic differences among trials can be examined - through ANOVA, a tester can examine test performance from trial to trial and then select the most reliable testing schedule - components (3): 1. subjects 2. trials 3. residual sums of squares

Spearman Brown Prophecy Formula - Definition - Text

- an equation developed to estimate the reliability for the entire test when the split half technique is used to test reliability

Proximity Errors - Definition - Text

- an error that results when a rater considers behaviors to be more nearly the same when they are listed close together on a scale than when they are separated by some distance - often the result of over detailed rating scales, insufficient familiarity with the rating criteria, or both - ex: if the qualities "active" and "friendly" are listed side by side on the scale, proximity errors result if raters evaluate performers as more similar on those characteristics than if the two qualities were listed several lines apart on the rating scale - if the rater does not have adequate knowledge about all facets of the behavior, they may not be able to distinguish between behaviors that logically should be placed close together on the scale - thus, the different phases on behavior are rated the same

Kuder Richardson Method of Rational Equivalence - Definition - Text

- formulas developed for estimating the reliability of a test from a single test administration - the KR20 involves the proportions of students who answered each item correctly and incorrectly in relation to the total score variance - KR21 is a simplified, less accurate version of the KR20

Observer Bias Error - Definition - Text

- an error that results when raters are influenced by their own characteristics and prejudices - ex: a person who has a low regard for movement education may tend to rate students from such a program too low - ex: racial, sexual and philosophical biases are potential sources of rating errors - observational bias is directional because they are consistently too high or too low

Central Tendency Errors - Definition - Text

- an error that results when the rater gives an inordinate number of ratings in the middle of the scale, avoiding the extremes of the scale - several reasons are suggested as the cause - sometimes these errors are due to the observer's wanting to leave room for better future performances

Internal Consistency - Definition - Text

- an estimate of the reliability that represents the consistency of scores within a test - some common methods are: (1) same day test retest, (2) the split half method, (3) the Kuder Richardson method of rational equivalence and (4) the coefficient alpha technique

Stability

- demonstrates consistency across trials on different days - refer to example in notes - *recommended for physical performance tests

Equivalence

- demonstrates consistency between equivalent or parallel forms - refer to example in notes - *recommended for psychological and knowledge tests

Internal Consistency

- demonstrates consistency between parts of a test - used with multiple trial or item tests where the parts are summed to provide the total or criterion scores → ex: movement time tests; psychological tests - statistics → intraclass correlation (ANOVA) = analysis of variance → alpha coefficient → split half -*Review example of Split Half and Alpha Coefficient Methods in notes and on slide

Observed Score - Definition - Text

- in classical test theory, an obtained score that comprises a person's true score and error score - it is not known whether this is a true assessment of a person's ability or performance - measurement error can occur because of the test directions, the instrumentation, the scoring, or the person's emotional or physical state - in terms of score variance, the observed score variance consists of true score variance plus error variance - the coefficient of reliability is the ratio of true score variance to observed score variance

Error Score - Definition - Text

- in classical test theory, the part of the observed score that is attributed to measurement error - the reliability coefficient reflects the degree to which the measure is free of error variance

True Score - Definition - Text

- in classical test theory, the part of the observed score that represents the person's real score and does not contain measurement error - the goal of the tester is to remove error to yield the true score - because true score variance is never known, it is estimated by subtracting error variance from observed score variance

Reliability - Definition - Text

- pertains to the consistency, or repeatability, of a measure - scores from a test can be reliable yet not valid, but it can never be valid when it is not reliable - ex: weighing yourself repeatedly on a broken scale would give reliable results but not valid ones - test reliability is discussed in terms of: (1) observed score, (2) true score and (3) error score * A valid test is reliable: it yields the same results on successive trials.

Reliability Goals

- physical performance tests: ≥ .80 - written tests: ≥ .70

Reliability

- symbol: "rxx" - reliability theory: → observed score = true score + error score → true score (accurate/reliable); error score (measurement error) - review notes for formula example

Item Difficulty - Definition - Text

- the analysis of the difficulty of each test item in a knowledge test - determined by dividing the number of people who correctly answered the item by the number of people who responded to the item

Z-Score - Definition - Text

- the basic standard score that converts raw scores to units of standard deviation in which the mean is zero and standard deviation is 1.0

Expressing Reliability through Correlation - Text

- the degree of reliability is expressed by a correlation coefficient - ranges from 0.00 to 1.00 - the closer the coefficient is to 1.00, the less error variance it reflects and the more the true score is assessed

Logical Validity - Definition - Text

- the degree to which a measure obviously involves the performance being measured - also known as: "face validity" and "content validity" - the test is valid by definition - ex: a speed of movement test where a person is timed while running a specified distance has logical validity - occasionally used in research studies

Content Validity - Definition - Text

- the degree to which a test (usually in educational settings) adequately samples what was covered in the course - no statistical evidence supplied - test maker should prepare a table of specifications ("test blueprint") before making up the test - a second form of content validity occurs with attitude instruments - often a researcher wants evidence of independent verifications that the items represent the categories for which they were written - experts are asked to assign each statement to one of the instrument categories - these categorizations are tallied across all experts and the percentage of experts who agreed with the original categorization is reported - typically, an agreement of 80% to 85% would indicate the statement represents content validity

Construct Validity - Definition - Text

- the degree to which a test measures a hypothetical construct - usually established by relating the test results to some behavior - ex: anxiety, intelligence, sporting behavior, creativity and attitude (not directly observable traits, therefore measurement is a problem) - for an indication of construct validity, a test maker could compare the number of times a person scoring high on a test of sporting behavior complimented the opponent with the number of times a person scoring lower on the test did so - experimental approach occasionally used and correlation is used in establishing construct validity

Validity - Definition - Text

- the degree to which a test or instrument measures what it purports (supposed) to measure - can be categorized as: logical, content, criterion or construct validity - refers to the soundness of the interpretation of scores from a test, the most important consideration in measurement

Objectivity - Definition - Text

- the degree to which different testers can achieve the same scores on the same subjects - also known as: "interrater reliability" - the degree of objectivity can be established by having more than one test gather data

Interrater Reliability - Definition - Text

- the degree to which different tests can obtain the same scores on the same participants - also called: "objectivity"

Criterion Related Validity - Definition - Text

- the degree to which scores on a test are related to some recognized standard or criterion - measurements are validated against some criterion - two main types: (1) concurrent validity and (2) predictive validity

Shrinkage - Definition - Text

- the loss of predictive power when statistics are calculated on a different sample - is more likely when a small sample is used in the original study, particularly when the number of predictors is large - if the number of predictor variables is the same as "n," you can achieve perfect prediction - the problem is that the correlations are unique to the sample and when the prediction formula is applied to another sample, the relationship does not hold (the validity coefficient decreases substantially)

Interclass Correlation - Definition - Text

- the most commonly used method of computing correlation between two variables - also called: "Pearson r," or "Pearson product moment coefficient of correlation - it is a bivariate statistic, meaning that it is used to correlate two variables, such as when determining validity by correlating judges' ratings with scores on a skill test - not appropriate for establishing reliability because two values for the same variable are being correlated - when a test is given twice, the scores on the first test are correlated with the scores on the second test to determine their degree of consistency, however, the two test scores are for the same variable, so interclass correlation should not be used - three main weaknesses (3): 1. the Pearson r is a bivariate statistic, whereas reliability involves univariate measures 2. the computations of Pearson r are limited to only two scores, X and Y; cannot correlate multiple trials for example 3. does not provide a thorough examination of different sources of variability on multiple trials

The Quality of Measurements - Part 1 - Notes

- the need for reliable and valid measurements in the research process

Item Analysis - Definition - Text

- the process in analyzing knowledge tests in which the suitability of test items and their ability to discriminate are evaluated - two important facets of item analysis are: (1) determining the difficulty of the test items and (2) determining their power to discriminate among levels of achievement

Leniency - Definition - Text

- the tendency for observers to be overly generous in rating - less likely to occur in research than in evaluating peers - thorough training of raters is the best means of reducing leniency

Validity - Notes

- truthfulness = does the variable measure what it is supposed to measure? - is it relevant enough?

Three Perspectives (3)

1. All research studies need reliable and valid measurements of the independent and dependent variables in the study. 2. Researchers may need to demonstrate the reliability and validity of their measurements in their specific study = pilot work. 3. The goal of the research study is to develop reliable and valid variables OR demonstrate the variables can be measures in different situations or with different populations in a reliable and valid manner.

Criterion Related Validity - Applying the Correlation - Notes

1. Concurrent - Ex: Correlation between BMI (measure to validate) and % body fat - DEXA (criterion) 2. Predictive - Ex: Correlation between BMI at age 40 and total years of life - Ex: Multiple correlation of BMI, SBP, DBP, and Cholesterol at age 40 with longevity → multiple measures

Types of Validity Estimation - Determining Relevance (3) - Notes

1. Content - logic 2. Criterion Related - is this test related to some criteria? - Concurrent = same time frame - Predictive = future; ex: SAT test, GRE test 3. Construct - has a variety - used a lot in psychology

Construct Validity - Types of Evidence (3) - Notes

1. Convergent evidence = "this measure is related to this other measure" - Similar to concurrent - Related to similar variables → SCAT test moderately related to trait anxiety (example of specific test to measure anxiety in sport) 2. Discriminant evidence = "this measure is not related to this other measure" - Unrelated to different variables → SCAT should be unrelated to motivation (example) → would support null hypothesis = "X NOT related to Y" 3. Known groups → Discriminate between groups of subjects known to be different on the trait

Construct Validity - Types of Statistics and Analyses (5) - Notes

1. Correlation and multiple correlation for relationships 2. ANOVA or t-tests for group differences 3. Discriminant analysis for group differences 4. Factor Analysis - determine underlying structure - breaking down many techniques into underlying themes 5. Path Analysis - explanatory model - combination of factor analysis and correlations - "A→B→C→D" - Ex: structural equation modeling - *review next slide example → variables to measure: shooting, dribbling and passing → shooting: free throws test, layups test and perimeter test → dribbling: times course test 1st hand, timed course test 2nd hand → passing: wall accuracy test, speed test

Types of Validity (4) - Text

1. Logical 2. Content 3. Criterion 4. Construct

Steps to Insure Reliability (6)

1. Maximize individual differences - increases the s2 (variance) 2. Accuracy - numerical values - ex: 40 yard dash time - tenth of a second (better) vs. one second (worse) 3. Difficulty - too easy or too hard - ex: too easy = everyone does well with high scores and little differences - ex: too hard = everyone does bad with low scores and little differences 4. Reduce measurement error 5. Preparation of subjects and test administrators - instructions - demonstrations - attention to details - practice trials 6. Pilot study

Quality of Measurements (4)

1. Objectivity - degree of agreement between two or more scorers or judges on the value of a measurement (interrater reliability) 2. Reliability - degree of accuracy, precision, dependability, consistency and stability of measurement - assumes sufficient objectivity 3. Relevance - degree of relationship to stated objectives of the measurements - really does measure the objective 4. Validity - degree of measurements for intended purpose - measuring what it is supposed to measure *There is no measurement that is PERFECTLY objective, reliable, relevant or valid. *Review diagram in notes.

Types of Coefficients of Reliability (3)

1. Stability 2. Alternate Forms/Equivalence 3. Internal Consistency

Sources of Measurement Error (3)

1. Subject - inconsistent - unmotivated 2. Test Administrator - inconsistent - poor instructions or demonstrations 3. Test and Protocol - poor questions - uncalibrated equipment

Sources of Measurement Error (4) - Text

1. the participant - includes mood, motivation, fatigue, health, fluctuations in memory and performance, previous practice, specific knowledge, and familiarity with the test items 2. the testing - how clear and complete the directions are, how rigidly the instructions are followed, whether supplementary directions or motivation is applied 3. the scoring - the competence, experience, and dedication of the scorers and to the nature of the scoring itself - the extent to which the scorer is familiar with the behavior being tested and the test items can greatly affect scoring accuracy - carelessness and inattention to detail 4. the instrumentation - inaccuracy and the lack of calibration of mechanical and electronic equipment - the inadequacy of a test to discriminate between abilities and to the difficulty of scoring some tests

Chapter 11 - Part 1

Part 1

Chapter 11 - Part 2

Part 2

The Quality of Measurements - Part 2 - Notes

Part 2

Index of Discrimination/ Item Discrimination Formula - Text

Index of Discrimination = (nH - nL) / n

KR20 Formula

KR20 rxx = K / K - 1 (1 - ∑pq / s2total) - *Review slide and notes for KR20 example

Internal Consistency Reliability for Achievement Tests with Right and Wrong Answers

KR20 Formula: rxx = K / K - 1 (1 - ∑pq / s2total) - criterion score is the number of items correct for each subject - item analysis = "p" and "q" → K = number of items → p = proportion of correct answers on 1 item → q = proportion of incorrect answers on 1 item → s2total = variance of total correct for each subject total

KR21 Formula

KR21 rxx = K / K - 1 [1 - x̅(K - x̅) / KS2] - KR20 can be estimated using K, M and s2 - KR21 usually underestimates KR20

"Maxicon Principle"

Max = maximize systematic variance I = minimize error variance Con = control extraneous variance

Standard Error of Measurement Formula

Sem = Sx √1 - rxx

Item or Trial Analysis

rxx < .80 or .70 - how do you identify the trials or items that are major sources of measurement error? → item = total or trial - total correlation *example on next slide - trial 3 appears to be inconsistent = r3total = .54 (item total correlation) → source of unreliability

Split Half Formula

rxx = 2(roe) / 1 + roe

Formula for Reliability

rxx = s2true / s2total rxx: ranges between 0 to 1 (want it closer to 1)

Z-Score Formula

z = (X - M) / w


Kaugnay na mga set ng pag-aaral

Marketing 240 (Ch.10-12)- Rich Brown

View Set

Module 1 - Architectural Related Careers

View Set

Mastering Biology - Cellular membranes

View Set

Intro to Criminal Justice: Ch. 2, study guide

View Set

Nursing Leadership & Management NCLEX Practice Quiz #2

View Set

Chapter 3: Ethics in Business - Concepts

View Set

accounting exam 1 practice questions

View Set

Ch 9 SmartBook-Segementation, Targeting, Positioning

View Set