Tests & Measurements - Exam 1

¡Supera tus tareas y exámenes ahora con Quizwiz!

A group test:

Can be given to multiple people by one examiner

Reliability refers to:

Consistency

Professors Study Guide: Norm (scaled) scores vs Raw scores

- Raw Score (also called an obtained score) is: an unaltered measurement (85 correct out of 100 questions) Scaled (norm) Score is: the result of some transformation(s) applied to the raw score

Professors Study Guide: Coefficient of Determination

The coefficient of determination is the correlation coefficient squared and it tells the proportion of the total variation in Y that is explained by X. - So let's say, for instance, we have two variables, X and Y, and the correlation of them comes out at .6. So their coefficient of determination would be .6 squared which would be .36 and what that means is that 36% of the variation in the Y variable is explained by the X. Important to know.

From The Text: Standard Error of Measurement

standard error of measurement tells us, on the average, how much a score varies from the true score. In practice, the standard deviation of the observed score and the reliability of the test are used to estimate the standard error of measurement."

Professors Study Guide: Kappa Statistic

is a measure of agreement between two judges who each rate as set of objects using nominal scales. - Values of kappa may vary between 1 (perfect agreement) and -1 (less agreement than can be expected on the basis of change alone) - a value greater than .75 generally indicates "excellent" agreement, a value between .40 and .75 indicates "fair to good" ("satisfactory") agreement, and a value less than .40 indicates "poor" agreement - Cohen's kappa coefficient (κ) is a statistic which measures inter-rater agreement for qualitative (categorical) items. - It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance.

The first group tests of human abilities were developed for:

selecting soldiers to fight for the United State in World War I

In a normal distribution, the best measure of dispersion is the:

standard deviation

This is the only form of validity in which a panel of judges determines the validity:

Content validity

The value of face validity is:

When it is high it increases cooperation

The normal distribution (bell curve) is an example of the:

frequency distribution polygon

Professors Study Guide, Interval Scale:

interval: a scale that has properties of magnitude and equal intervals, but NOT an absolute zero interval example: celsius scale of temperature

Professors Study Guide: Issue of Homogeneity

method of homogeneity: and with that, that's examined using the internal consistency of a test with measures which we were learning about in reliability last week. Measures that are used there are coefficient alpha and KR20. The correlations between subtest scores and the total test scores also give us a measure of internal consistency.

Reliability coefficients range from:

a. -1.0 to 1.0 b. -10.0 to 10.0 *c. 0 to 1.0 d. 0 to 10.0

A __________ is a picture of the relationship between 2 variables.

scatter diagram

Which of the following is true?

A test can be reliable but not valid.

When the error variance is equal to 0, the reliability is equal to:

a. -1.0 b. 0 *c. 1.0

The examiner reads everyone's palms to see how well they can spell. It turns out that the examiner is very accurate. The face validity of this test is:

low?

Suppose your score is in the 87th percentile on a test. This means:

That 87% of the students got a lower score than yours

Projective Personality Tests:

either the stimulus (test materials) or the required response - or both - are ambiguous.

From The Text: Representative Sample

"A representative sample is one that comprises individuals similar to those for whom the test is to be used. When the test is used for the general population, a representative sample must reflect all segments of the population in proportion to their actual numbers."

From The Text: Reliability & Sample Size

"As the sample gets larger, it represents the domain more and more accurately. As a result, the greater the number of items, the higher the reliability. A later section of this chapter shows how a larger number of items increases test reliability."

From The Text: Reliability Analysis

"Our task in reliability analysis is to estimate how much error we would make by using the score from the shorter test as an estimate of your true ability."

From The Text:

"Practice effects are one important type of carryover effect. Some skills improve with practice. When a test is given a second time, test takers score better because they have sharpened their skills by having taken the test the first time."

Content validity is most frequently used with:

Achievement tests

All things other being equal, which test-retest reliability coefficient will have the lowest number?

a. Test-retest with a 2 week interval b. Test-retest with a 1 year interval c. Test-retest with a 5 year interval *d. Test-retest with a 10 year interval

Students who took Test X went home and studied the questions and then came back the next day and took Test X again. Studying the questions would most likely lead to:

A decrease in the test-retest reliability for Test X

In which of the following will you have the highest reliability for your newly created depression test?

a. Your sample consists only of those who are clinically depressed b. Your sample consists only of those who are moderately depressed c. Your sample consists of those who are a mixture of moderately and clinically depressed *d. Your sample consists of those who span the full range from non-depressed to clinically depressed

Which of the following is not an issue or problem when it comes to correlation and regression?

a. shrinkage b. restricted range c. the third variable explanation d. the correlation-causation problem *e. All of the above. They are all possible problems or issues.

Professors Study Guide, Ordinal:

ordinal: has properties of magnitude, but NOT equal intervals or an absolute zero. Allows you to rank individuals or objects but not to say anything about the meaning of the differences between the ranks. ordinal example: to rank members of a sample by height (1 is tallest, 5 is shortest, says nothing about the actual height differences between 1 and 5)

Professors Study Guide, Ratio Scale:

ratio: a scale that has all three properties: magnitude, equal intervals, and an absolute zero ratio example: speed of travel

Which technique is used to make predictions about scores on one variable from knowing scores on another variable?

regression

The first intelligence test was used to:

screen out children who were low in intelligence from schools

Regression analysis shows:

how change in one variable is related to change in another variable

Professors Study Guide: Types of Construct Validity (& definition of Construct Validity)

how the scores relate to a theoretical construct

Professors Study Guide, Nominal:

nominal scales are really not scales at all; their only purpose is to name objects. nominal example: football players uniform numbers

According to the text, structured personality tests:

- provide a statement, usually of the "self-report" variety, and require the subject to choose between two or more alternative responses such as "True" or "False"

All concepts are constructs.

False

Professors Study Guide: Types of distributions/dispersions

Frequency Distribution: - There's a normal distribution. It's called the bell shaped curve.

The Altruism Test was not significantly correlated with the Social Desirability Test. This evidence indicates that the Altruism Test has:

Good discriminant construct validity

In a negative correlation:

High scores on the X variable are associated with low scores on the Y variable and vice versa.

Application of the Spearman-Brown formula to the split-half reliability will:

Increase the overall reliability

The Fahrenheit temperature scale is an example of what type of scale of measurement?

Interval

The main disadvantage of the discriminant evidence method for determining construct validity is that:

It can determine the construct the test is not measuring but not determine the construct the test is measuring

__________________ considers the relationships among combinations of three or more variables.

Multivariate analysis

How many units are there on the stanine scale?

Nine

Parallel form test-retest reliability involves testing:

One sample with one version of the test at time one and the same sample with another version of the test at time two

You want to correlate the relationship between heart rate and blood pressure in 50 adults. What would be the most appropriate correlation coefficient?

Pearson Product Moment correlation coefficient

The more caffeine consumed the more anxiety felt is an example of a:

Positive correlation

The Rorschach Inkblot Test presents ambiguous stimuli to an individual who then provides their own personal perception. This is an example of what kind of test?

Projective

Two judges are rank ordering wines based on taste. What would be the most appropriate correlation to use to see how well they agree with each other?

Spearman Rho

You want to correlate the relationship between height and weight for 20 children. What would be the most appropriate correlation coefficient if their data is rank-ordered?

Spearman Rho

Professors Study Guide: Interpretation of standard deviations

Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are very close to the average. A high standard deviation means that the numbers are spread out.

Professors Study Guide: Standard Error of Measurement & how to apply it to a problem

The Standard Error of Measurement tells us, on the average, how much a score varies from the true score. - In practice, the Standard Deviation of the observed score and the reliability of the test are used to estimate the standard error of measurement.

Professors Study Guide: Measures of variability

The most common measures of variability are the: range, the interquartile range (IQR), variance, and standard deviation.

The correlation between X and Y will always be the same as the correlation between Y and X.

True.

According to the text, previous learning can best be described as:

Achievement

Validity refers to:

Accuracy

According to the text, the origins of testing can be traced to:

China

These two methods for determining validity have the most in common:

Concurrent criterion-related validity and the convergent evidence construct validity method

Professors Study Guide: Spearman-Brown Formula (know what it measures)

a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length

Of the following correlations, which indicates the strongest relationship between X and Y?

a. -.25 b. .50 *c. -.75 d. .60

If the correlation between high blood pressure and stroke is .4, it means that:

a. 40% of those with high blood pressure will eventually have a stroke b. 40% of the cause of stroke is due to high blood pressure *c. 16% of the variance of stroke can be accounted for by high blood pressure d. 9% of the variance of stroke can be accounted for by high blood pressure

Error variance as measured in the reliability coefficient is:

a. Patterned error *b. Random error

Professors Study Guide: Different types of validity & those associated with different types of tests (i.e. Intelligence, Achievement, etc.)

- Face Validity: as in it looks to be about right - Content Validity: the content of the test items are examined, mostly used with ACHIEVEMENT tests - Criterion Related Validity: how the scores relate to some criterion, most often used to validate APTITUDE tests. - Construct Validity: how the scores relate to a theoretical construct

Professors Study Guide: Types of Criterion-related Validity (& definition of Criterion Validity)

- Now this looks at how well the test scores correlate or correspond with criterion measures like other tests or behavioral measures. There are two types of Criterion Related Validity. That's CONCURRENT and PREDICTIVE. - Criterion Validity Evidence for a test, the relationship between a test score and some well-defined criterion. For example, the association between a test of job aptitude and the criterion of actual performance on the job is an example of criterion validity evidence. - Criterion Validity Evidence require one to predict some criterion score on the basis of a predictor or test score.

From The Text: Parallel Forms Reliability

"Parallel forms reliability compares two equivalent forms of a test that measure the same attribute. The two forms use different items; however, the rules used to select items of a particular difficulty level are the same."

From The Text: Carryover Effects

"When there are carryover effects, the test-retest correlation usually overestimates the true reliability."

When using split-half test-retest reliability for Test X:

Test X is given, then split in half and scored separately. The scores from the 2 halves are then correlated with each other

From The Lecture: validity coefficient

- So what is a validity coefficient? Simply, the correlation between the test score and the criterion score. A validity coefficient is rarely seen larger than .60, and actually .30 to .40 is considered quite high. - The validity coefficient squared so we've got a squared correlation here equals the percentage of variation in the criterion that the test score is able to tell us. Remember when we talked about the coefficient of determination? Well, so the rest would be 100 because we're looking at percentages, minus the coefficient of determination, which is validity coefficient squared, and that's the percentage that is unexplained by knowing the test score.

Professors Study Guide: Types of standardized scores

- Standard scores - Z scores - T scores (McCalls T) - Stanines - Percentiles (percentile rank is a measure of relative performance, not absolute performance)

In finding the best-fitting line, what is the residual?

the difference between the observed & predicted score

Professors Study Guide: Restrictions of tests/Pearson User Qualifications Requirements

The phrase test user qualifications refers to the combination of knowledge, skills, abilities, training, experience, and, where appropriate, credentials that the APA considers optimal for psychological test use... - QUALIFICATION LEVEL A: There are no special qualifications to purchase these products. - QUALIFICATION LEVEL B: Tests may be purchased by individuals with: A master's degree in psychology, education, speech language pathology, occupational therapy, social work, counseling, or in a field closely related to the intended use of the assessment, and formal training in the ethical administration, scoring, and interpretation of clinical assessments. OR Certification by or full active membership in a professional organization (such as ASHA, AOTA, AERA, ACA, AMA, CEC, AEA, AAA, EAA, NAEYC, NBCC) that requires training and experience in the relevant area of assessment. OR A degree or license to practice in the healthcare or allied healthcare field. OR Formal, supervised mental health, speech/language, occupational therapy, social work, counseling, and/or educational training specific to assessing children, or in infant and child development, and formal training in the ethical administration, scoring, and interpretation of clinical assessments. - QUALIFICATION LEVEL C: Tests with a C qualification require a high level of expertise in test interpretation, and can be purchased by individuals with: A doctorate degree in psychology, education, or closely related field with formal training in the ethical administration, scoring, and interpretation of clinical assessments related to the intended use of the assessment. OR Licensure or certification to practice in your state in a field related to the purchase. OR Certification by or full active membership in a professional organization (such as APA, NASP, NAN, INS) that requires training and experience in the relevant area of assessment.

In test construction, it is important to obtain a standardization sample:

to provide a reference sample to which the results of the new test can be compared

Professors Study Guide: Measures of central tendency

mean, the median and the mode - a central tendency can be calculated for either a finite set of values or for a theoretical distribution, such as the normal distribution

Professors Study Guide: Types of sampling

Domain Sampling: So domain is essentially all of the questions possible to test for or to measure some concept and here we're talking about sampling from that. Models of Reliability - Time Sampling (test-retest method) This way of measuring reliability can really only be done with characteristics that are measured that don't change over time, like personality traits or intelligence. Only stable traits can be examined with the time sampling method. And it's pretty simple. You administer the same test twice and you find the correlation between the two scores for each person. - Item Sampling (parallel forms method) "the most rigorous assessment of reliability" So if a person has gone to the trouble of creating two parallel forms, it's probably advisable to administer them at the same time because if you don't, then you're adding extra error due to time sampling. - Split-Half Method (KR20 Formula) (Coefficient Alpha) (Reliability of a Difference Score)

The standard error of the difference between two scores is 8 for IQ Test X and Achievement Test Y. In which of the scores below would the difference be statistically significant?

a. Test X = 123 and Test Y = 116 *b. Test X = 89 and Test Y = 98 c. Test X = 95 and Test Y = 102 d. All of the above

The correlation coefficient squared is known as the:

coefficient of determination

From The Text: Test-retest Correlation

"When you find a test-retest correlation in a test manual, you should pay careful attention to the interval between the two testing sessions. A well-evaluated test will have many retest correlations associated with different time intervals between testing sessions. Most often, you want to be assured that the test is reliable over the time interval of your own study. You also should consider what events occurred between the original testing and the retest. For example, activities such as reading a book, participating in a course of study, or watching a TV documentary can alter the test-retest reliability estimate."

From The Text: Comparisons

"it is not appropriate to compare an individual with a group that does not have the same characteristics as the individual."

Professors Study Guide: Origins of Test Measures (the test listed in Module 1)

"the earliest known testing, so to speak, that they can find was being done 4,000 years ago in China and those were tests that were given as measures for hiring and promotion of civil servants"

A correlation coefficient can take any value from:

*a. -1.0 to 1.0 b. 0 to 1.0 c. -1.0 to 0 d. 1.0 to 10.0

The more error variance a test has the:

*a. Less reliability b. More reliability c. Error variance is not related to reliability

Professors Study Guide: Section 9 of the APA Code

9.01 Bases for Assessments (a) Psychologists base the opinions contained in their recommendations, reports, and diagnostic or evaluative statements, including forensic testimony, on information and techniques sufficient to substantiate their findings. (See also Standard 2.04, Bases for Scientific and Professional Judgments.) (b) Except as noted in 9.01c, psychologists provide opinions of the psychological characteristics of individuals only after they have conducted an examination of the individuals adequate to support their statements or conclusions. When, despite reasonable efforts, such an examination is not practical, psychologists document the efforts they made and the result of those efforts, clarify the probable impact of their limited information on the reliability and validity of their opinions, and appropriately limit the nature and extent of their conclusions or recommendations. (See also Standards 2.01, Boundaries of Competence, and 9.06, Interpreting Assessment Results.) (c) When psychologists conduct a record review or provide consultation or supervision and an individual examination is not warranted or necessary for the opinion, psychologists explain this and the sources of information on which they based their conclusions and recommendations. 9.02 Use of Assessments (a) Psychologists administer, adapt, score, interpret, or use assessment techniques, interviews, tests, or instruments in a manner and for purposes that are appropriate in light of the research on or evidence of the usefulness and proper application of the techniques. (b) Psychologists use assessment instruments whose validity and reliability have been established for use with members of the population tested. When such validity or reliability has not been established, psychologists describe the strengths and limitations of test results and interpretation. (c) Psychologists use assessment methods that are appropriate to an individual's language preference and competence, unless the use of an alternative language is relevant to the assessment issues. 9.03 Informed Consent in Assessments (a) Psychologists obtain informed consent for assessments, evaluations, or diagnostic services, as described in Standard 3.10, Informed Consent, except when (1) testing is mandated by law or governmental regulations; (2) informed consent is implied because testing is conducted as a routine educational, institutional, or organizational activity (e.g., when participants voluntarily agree to assessment when applying for a job); or (3) one purpose of the testing is to evaluate decisional capacity. Informed consent includes an explanation of the nature and purpose of the assessment, fees, involvement of third parties, and limits of confidentiality and sufficient opportunity for the client/patient to ask questions and receive answers. (b) Psychologists inform persons with questionable capacity to consent or for whom testing is mandated by law or governmental regulations about the nature and purpose of the proposed assessment services, using language that is reasonably understandable to the person being assessed. (c) Psychologists using the services of an interpreter obtain informed consent from the client/patient to use that interpreter, ensure that confidentiality of test results and test security are maintained, and include in their recommendations, reports, and diagnostic or evaluative statements, including forensic testimony, discussion of any limitations on the data obtained. (See also Standards 2.05, Delegation of Work to Others; 4.01, Maintaining Confidentiality; 9.01, Bases for Assessments; 9.06, Interpreting Assessment Results; and 9.07, Assessment by Unqualified Persons.) 9.04 Release of Test Data (a) The term test data refers to raw and scaled scores, client/patient responses to test questions or stimuli, and psychologists' notes and recordings concerning client/patient statements and behavior during an examination. Those portions of test materials that include client/patient responses are included in the definition of test data. Pursuant to a client/patient release, psychologists provide test data to the client/patient or other persons identified in the release. Psychologists may refrain from releasing test data to protect a client/patient or others from substantial harm or misuse or misrepresentation of the data or the test, recognizing that in many instances release of confidential information under these circumstances is regulated by law. (See also Standard 9.11, Maintaining Test Security.) (b) In the absence of a client/patient release, psychologists provide test data only as required by law or court order. 9.05 Test Construction Psychologists who develop tests and other assessment techniques use appropriate psychometric procedures and current scientific or professional knowledge for test design, standardization, validation, reduction or elimination of bias, and recommendations for use. 9.06 Interpreting Assessment Results When interpreting assessment results, including automated interpretations, psychologists take into account the purpose of the assessment as well as the various test factors, test-taking abilities, and other characteristics of the person being assessed, such as situational, personal, linguistic, and cultural differences, that might affect psychologists' judgments or reduce the accuracy of their interpretations. They indicate any significant limitations of their interpretations. (See also Standards 2.01b and c, Boundaries of Competence, and 3.01, Unfair Discrimination.) 9.07 Assessment by Unqualified Persons Psychologists do not promote the use of psychological assessment techniques by unqualified persons, except when such use is conducted for training purposes with appropriate supervision. (See also Standard 2.05, Delegation of Work to Others.) 9.08 Obsolete Tests and Outdated Test Results (a) Psychologists do not base their assessment or intervention decisions or recommendations on data or test results that are outdated for the current purpose. (b) Psychologists do not base such decisions or recommendations on tests and measures that are obsolete and not useful for the current purpose. 9.09 Test Scoring and Interpretation Services (a) Psychologists who offer assessment or scoring services to other professionals accurately describe the purpose, norms, validity, reliability, and applications of the procedures and any special qualifications applicable to their use. (b) Psychologists select scoring and interpretation services (including automated services) on the basis of evidence of the validity of the program and procedures as well as on other appropriate considerations. (See also Standard 2.01b and c, Boundaries of Competence.) (c) Psychologists retain responsibility for the appropriate application, interpretation, and use of assessment instruments, whether they score and interpret such tests themselves or use automated or other services. 9.10 Explaining Assessment Results Regardless of whether the scoring and interpretation are done by psychologists, by employees or assistants, or by automated or other outside services, psychologists take reasonable steps to ensure that explanations of results are given to the individual or designated representative unless the nature of the relationship precludes provision of an explanation of results (such as in some organizational consulting, preemployment or security screenings, and forensic evaluations), and this fact has been clearly explained to the person being assessed in advance. 9.11 Maintaining Test Security The term test materials refers to manuals, instruments, protocols, and test questions or stimuli and does not include test data as defined in Standard 9.04, Release of Test Data. Psychologists make reasonable efforts to maintain the integrity and security of test materials and other assessment techniques consistent with law and contractual obligations, and in a manner that permits adherence to this Ethics Code.

In the following distribution: 5,9,8,1,8, what is the mode?

According to the text, structured personality tests:

Require the test taker to choose between two or more alternative responses

From The Text: Conceptualization and Assessment of Measurement Error

"Discrepancies between true ability and measurement of ability constitute errors of measurement. In psychological testing, the word error does not imply that a mistake has been made. Rather than having a negative connotation, error implies that there will always be some inaccuracy in our measurements. Our task is to find the magnitude of such errors and to develop ways to minimize them. This chapter discusses the conceptualization and assessment of measurement error. Tests that are relatively free of measurement error are deemed to be reliable, hence the name of this chapter. Tests that have "too much" measurement error are considered unreliable."

Professors Study Guide: Types of standardized scores

- Standard scores are used in norm-referenced assessment to compare one student's performance on a test to the performance of other students her age. Standard scores estimate whether a student's scores are above average, average, or below average compared to peers. They also enable comparison of a student's scores on different types of tests, as in diagnosing learning disabilities. - Standard scores: Test developers calculate the statistical average based on the performance of students tested in the norming process of test development. That score is assigned a value. Different performance levels are calculated based on the differences in student scores from the statistical average and are expressed as standard deviations. These standard deviations are used to determine what scores fall within the above average, average, and below average ranges. Standard scores and standard deviations are different for different tests. Many of the commonly used tests, such as the Wechsler Intelligence Scales, have an average score of 100 and a standard deviation of 15.

If you have a T score of 40, then you have a score that is:

1 standard deviation below the mean

Criterion Related Validity, Two Types:

1. CONCURRENT: Concurrent criterion related validity is a relationship between the test scores of the new test that we're looking at and the scores on the criterion, and it's done simultaneously. So the scores are available at the same time. 2. PREDICTIVE: predictive criterion related validity and it's the relationship between the test scores and the future scores on the criterion. - For example, what you might do is take some test, and using the GPA as its criterion, what would be the test of concurrent versus predictive criterion related validity? Well, if we're doing concurrent, we'd have the person filling out the test, we'd also be looking at their GPA at the same time. We'd be seeing how highly correlated they are. If we were doing predictive criterion related validity, we'd be testing them at one point, and then looking at their GPA down the road and that would be predictive; seeing whether that test score is predictive of their later GPA.

From The Text: Norms

Norms refer to the performances by defined groups on particular tests... used to give information about performance relative to what has been observed in a standardization sample. The norms provides a baseline by which to measure the results of the tests. Correlations are relevant to show whether there is a link between changes to a dependent variable from an independent variable, either positive, negative, or null. If there was no correlation whatsoever between the two variables than the independent variable can be dismissed as an influence on the dependent variable. If there is a correlation, then, depending on the regression analysis, one can make accurate predictions about how the use of a test will predict future outcomes, or how a particular psychotropic drug will affect behavior. Knowing this data can assist one in predicting whether the use of a particular psychological test will be valid when applied to a test sample or individual.

From the Lecture: Content Related Validity

So content related validity, it's done through a logical process; not a statistical one. It involves a systematic examination of the content in a test, so all of the items, in order to determine if the test covers a representative enough sample of the domain being measured. And how this is done is typically there's a panel of experts or judges who get together and decide whether the test items, each individual one, are a fair sample of all of the potential content and they also look at the wording and whether it's written in an appropriate reading level for the people that it's meant to test. Their job is to rate or match the items' relevance. So content trelated validity is most frequently used with achievement tests and also compared to some of the other types of validity we'll be talking about, it doesn't involve collecting data from test takers. Content validity is actually more qualitative than quantitative, but it is more objective than face validity, which is more subjective.

From The Text: Reliability Estimates

"These reliability estimates have various names, including interrater, interscorer, interobserver, or interjudge reliability. All of the terms consider the consistency among different judges who are evaluating the same behavior. There are at least three different ways to do this. The most common method is to record the percentage of times that two or more observers agree. Unfortunately, this method is not the best one, for at least two reasons. First, this percentage does not consider the level of agreement that would be expected by chance alone. For instance, if two observers are recording whether a particular behavior either occurred or did not occur, then they would have a 50% likelihood of agreeing by chance alone. A method for assessing such reliability should include an adjustment for chance agreement. Second, percentages should not be mathematically manipulated. For example, it is not technically appropriate to average percentages. Indexes such as Z scores are manipulable and thus better suited to the task of reliability assessment. The kappa statistic is the best method for assessing the level of agreement among several observers. The kappa statistic was introduced by J. Cohen (1960) as a measure of agreement between two judges who each rate a set of objects using nominal scales. Fleiss (1971) extended the method to consider the agreement between any number of observers. Kappa indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement. Values of kappa may vary between 1 (perfect agreement) and 21 (less agreement than can be expected on the basis of chance alone). A value greater than .75 generally indicates "excellent" agreement, a value between .40 and .75 indicates "fair to good" ("satisfactory") agreement, and a value less than .40 indicates "poor" agreement (Fleiss, 1981). The calculation of kappa is beyond the scope of this presentation, but interested readers can find more details in some good summary references (Warrens, 2015). An example of a study using behavioral assessment is discussed in Psychological Testing in Everyday Life 4.2. Behavioral observation is difficult and expensive. In the future, we expect that more of the observation will be done with new technologies and smart software. This is discussed in Psychological Testing in Everyday Life 4.3."

Tests & Measurements - Exam 1

Conjuntos de estudio relacionados

Math Vocab Chapter 5 Lesson 3

EARTH SCIENCE EXAM REVIEW- CLIMATE CHANGE

Aggregate Demand and Aggregate Supply

CNA252 / Chapter 5 / Network Security and Monitoring

Week 4 - Focus

IPCx Prof Quiz Update

interpersonal exam 3

chapter 20 review; cardiovascular

Chapter 24: The Child with Hematologic or Immunologic Dysfunction

Chapter 9: Nucleophilic substitution

I _______

My Nursing Lab questions on Rheumatoid Arthritis (RA)

Computer Quiz 10

PrepU GI Agents

Bio Test

A Night To Remember

Chapter 13: Planning & Implementing Change-Oriented Strategies

Somatic Symptom Disorders

Chapter 21: The Cardiovascular System-Blood Vessels and Hemodynamics

Unit 14: physical exams: physical exams; positions and draping