NCE - Appraisal & Assessment

Ace your homework & exams now with Quizwiz!

Observer bias may be reduced by:

- RATIONALE - Observer bias is the tendency that human beings have to see what they want to see, hear what they want to hear, and remember what they want to remember. The best way to control such tendencies is to carefully structure the interview process, use audiotape recordings, and reduce the amount of inferences required by the observer.

Reliability Coefficient

- RATIONALE - The reliability coefficient represents the proportion of observed variability in a set of scores that is due to true score variability. Any remaining variability (in this case, 10%) is due to measurement error.

VALIDITY_Types of Validity in TEST CONSTRUCTION

1. CONSTRUCT VALIDITY a. Convergent b. Discriminant 2. CRITERION VALIDITY a. Predictive b. Concurrent 3. CONTENT VALIDITY (non-statistical) 4. FACE VALIDITY (non-statistical)

Scales

>Thurstone Method- unidimensional -yields interval data (numbers mean something but are not in order); items are rated by group of judges- most judges agreed on (low SD) are on test- examinee rates the items. >Semantic Differential - report where they are on a dichotomous range between 2 affective polar opposites >Absolute Scaling - For absolute item difficulty for different ages; often used in achievement and aptitude tests. >Empirical Scale - Based upon how a certain criterion group responds in comparison to the normal sample; counselors vs all occupations >Rational Scale - All items related to theory; All items are correlated to one another and with total score; IQ tests

Z-score

Most common standard score Mean= 0 ; SD = 1 A z-score allows one to compare people's scores on the basis of standard deviation units. Each z "point" is one standard deviation. A Z-score quantifies the original score in terms of the number of standard deviations that that score is from the mean of the distribution. The formula for converting from an original or "raw" score to a Z-score is: z = (score - mean) / SD A z score of plus one or one SD would include about 34 of the cases in a normal population.

Normative VS Ipsative

A Normative interpretation is one in which the individual's score is evaluated by comparing it to others who took the same test. Also, each item is independent of all other items. Ipsative measures compare traits within the same individual, they do not compare a person to another persons who took the intrument. The ipsative test allows the person being tested to compare items. It does not reveal absolute strengths. It points out the highs and low that exist within a single indivudual.

Demand Characteristics

A term that refers to cues in the experimental setting that inform the study subjects of the purpose of the study or suggest what behaviors are expected of them. Demand characteristics affect the external validity of a study by potentially changing the behavior of the study subjects.

Appraisal TERMS

APPRAISAL - implies going beyond measurement to making judgments about human attributes and behaviors; the process of assessing or estimating attributes; is used interchangeably with evaluation. ASSESSMENT - processes and procedures for collecting information about human behavior. Assessment tools include tests, inventories, rating scales, observation, interview data, and other techniques. MEASUREMENT - general process of determining the dimension of an attribute or trait. INTERPRETATION - making a statement about the meaning or usefulness of measurement data according to the professional counselor's knowledge and judgment. TEST - the systematic measurement of a sample behavior

Correlation

An expression of a relationship between variables. Viewed on a scattergram. Expressed as number between -1.00 and +1.00 Correlations are BIVARIATE (there are 2 variables being compared). Correlations are MULTIVARIATE (more than 2 variables) Perfect relationship = Linear relationship The Pearson Product-Moment correlation r is used when both variables are Continuous (interval or ratio data) while the Spearman rho correlation is used when one or more variables are ordinal data.

Order Effects

ORDER EFFECTS are a threat to external validity in studies with a repeated measures design, or studies in which the same subjects are exposed to more than one treatment. Order effects occur when the sequence in which subjects are exposed to the different levels of the independent variable confounds the results of the study.

You are using scores on the GRE, MAT, and GMAT to predict GPA in an MBA program. What statistical technique are you using?

In multiple regression, the values of two or more predictor variables (in this case, GRE, GMAT, and MAT) are used to predict outcome on one criterion variable (in this case, GPA).

Kuder-Richardson

Kuder-Richardson developed the KR20 or Rationale Equivalence designed to measure a test's inter-item consistency (how all items relate to all other items on a test). The KR20 is best for all standardized tests and is the best assessment of all test reliability tools.

Non-Parametric Tests

Mann-Whitney U-test Wilcoxon Signed-Rank Test (for matched pairs) Solomon Kruskal-Wallis H-test Friedman Test

Reliability- Forms of Reliability

When the same test is administered to the same group of examinees on two different occasions, the correlation of the scores is known as the COEFFICIENT OF STABILITY. (AKA Test-Retest). A COEFFICIENT OF EQUIVALENCE is an alternate form of reliability. Two equivalent forms of a test are administered to the same group of examinees at about the same time and their scores are correlated. (AKA Parallel Form Measure). COEFFICIENT ALPHA is a type of internal consistency reliability coefficient. Determining a test's internal consistency reliability by the coefficient alpha involves giving a test once to a single group of examinees. A special formula is used to determine the degree of inter-item consistency. ** A reliability coefficient of 1.00 indicates a perfect score which has no error.

Reliability - Error Variance

X = T +-e (e can be positive or negative) Observed score (X) = True score (R) plus/minus Error Variance (e) ERROR is the difference between a person's true score and that person's observed score. SYSTEMATIC - methods are planned, orderly, and methodical (a test question that contains a typographical error and everyone who takes test has same error). UNSYSTEMATIC - occurrences are presumed to be random and don't fluctuate (typo on just one person's test; Reading instructions incorrectly to students, fatigue) Reliability does not measure systematic errors because they are constant and do not fluctuate.

Performance Test

measure interests, attitudes, and other noncognitive attributes of personality. Examples are projective tests and personality inventories.

Program Evaluation - A major purpose in program evaluation is:

to monitor and enhance accountability

Symbols

σ = standard deviation σ2 = variance µ = mean ∑ = The sum of

Reactivity

REACTIVITY occurs when research subjects respond to an independent variable in a particular way simply because they know their behavior is being observed.

A behavioral researcher interested in the effectiveness of a self-control procedure on the inattention of hyperactive children would use a "reversal" single-subject design to control for which of the following? A. History effects B. Reactivity effects C. Order effects D. Placebo effects

Single-subject designs, such as the reversal, or withdrawal, designs, involve analyzing the effects of treatments for one subject over time. They tend to be used in behavioral and medical research. The ABAB design, a type of reversal design, is one in which a treatment is applied, withdrawn, and applied again. A. In a reversal design, if the behavior being measured changes in the same way after both applications of the treatment, it is assumed that the treatment, rather than HISTORY (external, extraneous factors) is responsible for those changes.

Reactivity

The general category for any phenomena that affect the behavior of study subjects merely because they are participating in a study and know they are being observed. Some of these phenomena in this category are the Hawthorne effect, demand characteristics, and experimenter expectancy. Reactivity affects the external validity of a study

Stanine

The normal curve is divided into 9 (Not Equal) parts. Range=1-9; Mean= 5 ; s.d.= 2 The term stanine is a contraction of "standard nines." Stanines provide a single-digit scoring metric with a range from 1 to 9, a mean of 5, and a standard deviation of 2. Each stanine score represents a specific range of percentile scores in the normal curve. Stanines are useful when a researcher is interested in providing a "band" interpretation rather than a single score cutoff.

The terms "appraisal," "testing," and "assessment" can best be described as:

The terms "appraisal," "testing," and "assessment" are General terms that refer to the same thing. TESTING is only one of several methods of APPRAISAL that utilizes standardized methods of administration, scoring, and interpretation to ASSESS the evidence of normative data. APPRAISAL refers to the use of any procedure for the meaningful evaluation of human beings regarding any psychological characteristic or attribute. They are Not interchangeable terms and can be used with either individuals or groups.

Non-Cognitive Tests

There are no right/wrong answers. Faking may be a concern. Error types: halo, criterion, leniency, acquiescence, social desirability, ect Acquiescence- when a client always agrees with something Deviation (in reference to testing)- when an individual purposely, or when in doubt, gives unusual responses. Social Desirability- when the person puts the answer he or she feels is socially acceptable.

Reliability - Standard Error of Measurement

This deviation provides an indication of what an individual's true score would be if they took the instrument repeated times. The SEM is the standard deviation of an individual's repeated test scores when administered the same instrument multiple times. The SEM is inversely related to reliability in that the larger the SEM, the lower the reliability of a test. If the reliability coefficient is 1.00, the SEM=0. The SEM is often reported in terms of confidence intervals, which define that range of scores where the true score is thought to lie. Counselors can use SEM to determine the range scores 68%, 95%, or 99.5% of the time.

TRIN VRIN

True Response Inconsistency Scale (TRIN) Measures a client's pattern of inconsistent responding to pairs of item of opposite content. Variable Response Inconsistency Scale (VRIN) Measures a client's pattern of inconsistent responding to pairs of items nearly identical in content. Inconsistent responding may mean the client is not paying attention, not taking the task seriously, or doesn't comprehend the item meanings.

Validity Scales

True Response Inconsistency Scale - Distinguish between opposite answers Variable Response Inconsistency Scale - Measures pattern of inconsistent answers on similar questions

Validity (test construction) - CONTENT VALIDITY

Universe or domain. No correlation r to report. Evidence based on expert judgment of the degree to which the items adequately represent the construct domain of interest Taxonomy often determines weight of item Item Analysis: item easiness (i.e. spiral items) and item discrimination Do the items appear to represent the thing you are trying to measure? Does the set of items underrepresent the construct's content (i.e., have you excluded any important content areas or topics)? Do any items represent something other than what you are trying to measure (i.e., have you included any irrelevant items)?

Norm-referenced Tests VS Criterion-referenced Tests

Use norm referenced tests: simply to rank order the examinees along some achievement continuum; to compare the student's performance to an established norm; standardized tests. Norm-referenced scores (scaled scores) compare the scores of one student or group of students to another, or the same student or group at different points in time. Scores reported with a confidence band around the institutional mean scores. Criterion-Referenced Tests- Making test scores meaningful without indicating the test taker's relative position in a group. On a criterion-referenced test, each individual test taker's score is compared with a fixed standard, rather than with the performance of the other test takers. Sometimes called domain referenced.

Confidence Interval (Reliability)

Used to estimate the degree to which an obtained test score differs from a true test score. SEm estimate how repeated measures of a person on same instrument tend to be distributed around their "true" score. The true score is always unknown. SEm is directly related to reliability of a test. The larger the SEm, the lower the reliability. Using the 68% confidence level, for example, if a child receives an intelligence test score of 115 with a SEm of three (3) points, there is a 68% probability that the child's true score falls within the range of 112 to 118. It would not be appropriate to select the highest or lowest numbers within that range as the best estimate of the child's true score.

Vertical Test VS Horizontal Test

VERTICAL TEST would have versions for various age brackets or levels of education (math achievement test for preschoolers and a version for middle-school children. HORIZONTAL TEST measures various factors (math and science) during the same testing procedure.

Validity

Validity refers to how accurately an instrument measures a given construct. Validity is concerned with what an instrument measures, how well it does so, and the extent to which meaningful inferences can be made from the instrument's results.

Variance

Variance- how far on average the scores deviated from the mean. The variance is the standard deviation squared. The standard deviation is the square root of variance. The greater the SD, the greater the spread. If everyone scored the same on the test then the SD would be zero.

Test Battery

When a number of specific tests are used together to predict a single criterion. A test battery is considered a horizontal test, because several measures are used to produce results that could be more accurate than those derived from merely using a single source

Leniency Error

When a person is rated higher on an item than they should be.

Sten

a 10 division of the normal curve Mean = 5.5 ; s.d = 2)

Reliability - To maximize the reliability of a test, you would want test items that are _________ in terms of the domain being assessed by the test, and subjects who are _________ in terms of their true scores on the test.

homogeneous; heterogeneous The more homogeneous the scores of the examinees, the lower the correlation coefficient, meaning the lower the degree of association between two or more variables. Therefore, the test should have subjects who are heterogenous.

Base Rate

- RATIONALE - The term "base rate" refers to the percentage of individuals who are considered successful on the criterion before a predictor test is used. In an employment setting, it would refer to the percentage of individuals who are successful on a job before a predictor test is used.

Reliability (testing) - 4 way to measure reliability

1. Test-Retest Reliability (Stability): same group - same test 2 times This is BEST kind; 2 weeks is good time in between tests 2. Equivalent forms reliability: 2 forms of same test 3. Internal Consistency Reliability: The test is administered once and comparisons of items are conducted within the test. a) Split-half reliability - Test is divided into 2 halves. The correlation between 2 halves is calculated. Because you reduce length of test the reliability is reduced. You may apply the Spearman-Brown formula to see how reliable the test would be had you not split it. b) Measuring Inter-item Consistency. The more homogeneous the items, the more reliable a test. The Kuder-Richardson formulas are used if the test contains dichotomous items (true/false, yes/no). If the instrument contains non-dichotomous items (multiple choice, essay), the Cronbach Alpha Coefficient is applied. 4. Inter-scorer reliability: the consistency or degree of agreement between two or more scorers.

Shared Variance (Variance Accounted For) Two different tests are administered to 50 students. When the scores on the tests are correlated, a coefficient of .49 is obtained. This means that approximately ___% of the variability in the two tests is shared in common.

25% The square of .49 ( .49 X .49, or, rounding off, about .5 X .5) is about equal to .25. To find percentage of shared variance between 2 variables, simply square the correlation coefficient. The VARIANCE ACCOUNTED FOR is the square of the reliability. The square of .49 ( .49 X .49, or, rounding off, about .5 X .5) is about equal to .25. The unaccounted amount of variance is the variance accounted for subtracted from 100%. The COEFFICIENT OF DETERMINATION is another term for Variance Accounted For.

Purposes of Testing

4 Purposes of testing 1. Screening/Assessing client's problems - Beck Anxiety Inventory 2. Diagnosis/Define client's problems - MMPI-2-RF; Rorshach 3. Treatment Planning - Therapeutic goals 4. Project Evaluation - Over last month or last 6 months; Review progress over time

Percentile Rank In a normally-shaped distribution, the percentile rank equivalent off a z-score of +1 is approximately:

84% A percentile rank is a transformed score that indicates the percentage of scores falling below the corresponding raw test score. The mean represents the 50th percentile. A z-score of +1 will correspond to one standard deviation (about 34%) above the mean, or the 84th percentile.

Correlation - Types of Correlation

>PEARSON r: when BOTH variables are Continuous (measured on Ratio or Interval scales). Relationship between the 2 variables must be Linear. Most stable measurement of correlation. Frequently used to determine the reliability and validity of certain traits such as personality. Most common correlation coefficient. Example: scores on a pretest and posttest. >Spearman Rho: used to calculate a correlation coefficient when 1 or more variables are measured on an ordinal (ranking) scale. Example: Using GPA (continuous scale) to rank students (ordinal scale). >Point Biserial: when 1 variable is dichotomous (nominal), and 1 variable is continuous (interval or ratio). Example: Gender (male/female) (nominal-dichotomous) and GRE score (continuous). >Phi Coefficient: both variables are nominal dichotomies. Example: Gender (male/female) and one who smokes cigarettes (smoker/nonsmoker).

Scales

>Unidimensional Scale- only measuring one thing; you either have it or you don't; BDI; numbers assigned to only 1 dimension >Multidimensional Scale- doing good in english but not in math; is anxiety caused by depression or a chemical imbalance? >Likert Scale- offers a range, no comparing >Rating Scale- ranking a particular item; Scale of 1-10, 10 being best >Comparative Scale- this one or this one, worst to best; rank, sort >Guttman Scale- unidimensional - ordinal scaling method arranged as hierarchy; agree with everything below answer (if you answer yes to #4, then you also answered yes on #1-3 too)

Cognitive Ability Tests

APTITUDE TESTS: predict a person's capacity to perform some skill or task in the future; "intended to predict success in some occupation or training course"; GRE ACHIEVEMENT TESTS: measure what you know (NCE) INTELLIGENCE TESTS: measure ability to learn, solve problems, and understand increasingly complex information; WAIS, Stanford-Binet,

Validity - Threats to Internal Validity (design type error)

ATTRITION - when subjects drop out during course of study SELECTION - group differences exist due to a lack of random assignment. EXPERIMENTER EFFECTS - bias of investigator influences participant responses; administrator's behavior or appearance effects subject's performance (Halo effect, Hawthorne effect, Reactive effect) (USE MORE THAN ONE EXPERIMENTER FOR INTER-RATER RELIABILITY) SUBJECT EFFECTS - participants change their behaviors or attitudes based on their understanding of their role as participants. Participants will pick up cues (demand characteristics).

Percentile Rank

An individual's score can be compared to a group (norm group) already examined. The individual's percentile rank indicates what percentage of individuals in that group has scores above or bellow this individual. The percentage of area in a histogram that is to the left of the score referred. The is an area of the curve. A percentile indicates the percentage of people in the reference group who performed at or below the examinee's score. This score type is easily confused and unfortunately is widely misused, despite its popularity. Percentiles are an ordinal or rank-order scale of measurement, rather than an equal-interval scale. That means one cannot subtract or average percentile scores in order to represent growth or change.

Validity (test construction) - CONCURRENT

CONCURRENT VALIDITY (Criterion Validity) The results of the test are compared with other tests' results or behaviors (criteria) at or about the same time. Extent to which test may be used to estimate an individual's present standing on the criterion (status quo). For example: Scores of an art aptitude test may be compared to grades already assigned to students in an art class.

Validity (test construction) - CONSTRUCT Validity

CONSTRUCT VALIDITY Correlation r. Is extent to which a test measures an abstract trait or psychological notion theoretical construct like intelligence, self-esteem, feelings, artistic talent. Used for personality theory & non-cognitive tests. A construct is any trait you cannot "directly" measure or observe.

Variables - Types of Variables

CONTINUOUS Variable: measured on a scale that changes gradually as though there are divisions between the steps. Examples are temperature, distance, scores on GRE. If one were to compare two individuals in height... even if they are very very close it is always possible to find someone in between the two. DISCRETE Variables: are of a finite value and can assume only certain values. The number of puppies in a litter represents a discrete variable for that litter. DICHOTOMOUS Variables: are two levels that is either yes/no, in/out, religious/areligious, ect. Usually described as whole numbers whole continuous have divisions between the whole numbers.

Reliability > You can understand and assess reliability in the following ways:

CORRELATION - Degree of agreement between two sets of scores. It reveals how closely two things vary together and how well one predicts the other. Interpreted in terms of variance. STANDARD ERROR OF MEASUREMENT- Range of fluctuations of one's scores as a result of chance errors. How much a person's score would vary if tested again with the same test. This is used to develop confidence intervals around the test takers obtained score.

Coefficient of Determination (Variance accounted for) > Reliability

CORRELATIONAL PROCEDURES The Coefficient of Determination tells us how much of the variance in the scores of one variable can be understood or explained by the scores on a second variable. When 2 variables are related or correlated with each other, there is a certain amount of shared variance between them. The stronger the correlation, the greater the amount of shared variance, and the higher the coefficient of determination. The Coefficient of Determination (R Squared) is the square of the correlation coefficient. If we find a correlation of .80 between graduate school GPA and undergraduate GPA, then 64% (.80 squared) of the variance in graduate school GPA is associated with variance in undergraduate GPA. The remaining 36% is attributed to other factors such as motivation, study habits, course load, ect.

Regression Analysis Multiple Regression

CORRELATIONAL PROCEDURES When correlational analysis indicates some degree of relationship between 2 variables, we can use the information about one of them to make predictions about the other. Example: Using GRE scores to predict success in graduate school. The GRE scores are the predictor variable (X) and the GPA in graduate school (Y) is the criterion (the variable to be predicted). The potential for error is taken in account and SEM is used in process. Multiple Regression can be used to predict a criterion score from 2 or more predictor or IVs. An example would be if a graduate program used several variables, such as undergraduate GPA, GRE scores, ratings on recommendations, and scores on interviews, to predict success in graduate school. Another example would be insurance companies that establish rates based on age, sex, and driving record.

Validity (test construction) - CRITERION VALIDITY

CRITERION VALIDITY Evidence based on relations to other variables. Relating test scores with relevant criteria to which scores from a test can be used to predict performance on some criterion such as a test or future performance. 2 types: Predictive or Concurrent

Internal Consistency

CRONBACH'S COEFFICIENT ALPHA- used when scoring multi-point responses (not dichotomous). Takes into consideration the variance of each item. Alternative to Split-Half Method. SPEARMAN-BROWN FORMULA - to compensate mathematically for the shorter length of Split-half reliability. The formula is used to estimate the impact that lengthening or shortening a test will have on a test's reliability coefficient. Used when tests are dichotomous. KUDER-RICHARDSON FORMULAS- Used when tests are dichotomous. Alternative to Split-Half Method. KR-20: Heterogeneous measures multiple domains (estimate of all possible split-half reliabilities) KR-21: Homogeneous measures one domain

Confidence Band (Reliability)

Confidence Band refers to closeness between the true score and the degree of error. The Pearson r is used to determine the closeness between the true score and the degree of error. Confidence Bands are reported on a test profile to point out the differential validity of the scores. Confidence Bands reflect the error variance for any single testing or test.

Validity (test construction) - CONVERGENT & DISCRIMINANT (Construct Validity)

Convergent Validation occurs when there is high correlation between the construct under investigation and others. The idea is that you want your test (that your are trying to validate) to strongly correlate with other measures of the same thing. Discriminant Validation occurs when there is no significant correlation between the construct under investigation and others. Evidence that the scores on your test are not highly related to the scores from other tests that are designed to measure theoretically different constructs. You want it to correlate with other measures of that construct (convergent evidence) but you also want it NOT to correlate strongly with measures of other things (divergent evidence).

Standard Score

Descriptive statistics allows for the development of standard scores which are described and interpreted by using the appropriate mean and s.d. in relation to the normal curve. Any set of score in which a mean and standard deviation is known. Types: T-score, Z-score, Stanine, Sten

Error Variance (Reliability)

ERROR VARIANCE is closely associated with determining RELIABILITY. Different types of error influence the consistency or stability of a test score. The 2 most common types of errors are Systematic and Random/Unsystematic. Systematic - Constant, predictable, proportionate to true value; Systematic errors are often due to a problem which persists throughout the entire experiment. Example: The cloth tape measure that you use to measure the length of an object had been stretched out from years of use. Random/Unsystematic Error - Variable and unpredictable; e.g. test administration, the examinee is tired, sick, or has low motivation, examiner gave extra help, test scores are higher if a different question is asked) These types refer to the test itself, administration, and the examinee, all which cause the reliability to be affected. Upon any testing, the subject achieves an OBSERVED SCORE and this score is made up of a TRUE SCORE and the RANDOM / UNSYSTEMATIC ERROR.

SEM Example

GRE Verbal Score = 430 Mean = 500 Sd = 100 Reliability Coefficient = .91 SEM = 100 √ 1 - .91 = 100 √ .09 = 100 (.30) = 30 68% of time score would fall between 400 (-30) and 460 (+30) 95% of time score would fall between 370 (-60) and 490 (+60)... 2 standard errors of measurement (2x30=60). 99.5% of time her score would fall between 340 (-90) and 530 (+90).... 3 SEM (3x30=90)

You want to assess the effectiveness of a community intervention initiative aimed at educating young people about the dangers of drugs. Your assessment will most likely be:

Generally, the effectiveness of community-based programs are best assessed by performing EVALUATION RESEARCH (i.e., a program evaluation). Program evaluations can include aspects of the other three choices (single subject, experimental, longitudinal) - for example, you may incorporate a longitudinal research design to study the impact of a community program; however, you would do this in the context of a larger program evaluation.

Validity - Threats to Internal Validity (design type error)

HISTORY- unexpected events which occur during experiment that affect the DV (extraneous effects) MATURATION- biological and psychological changes in subject such as fatigue or hunger (temporal effects) TESTING- test content familiarity; taken test before (occurs in pretest/posttest studies) STATISTICAL REGRESSION- regression to mean for purpose of extremes INSTRUMENTATION - changes in the instrument can affect results (paper pencil, computerized)

Tests - Types

INTELLIGENCE TESTS: Measures ones ability for learning. Intelligence is the ability to think in abstract terms; to learn. Examples: Standford Binet, Wechsler Adult Intelligence Scale (WAIS-III), Wechsler Intelligence Scale for Children (WISC-IV), Cognitive Abilities Test, ACT, SAT, GRE, Miller Analogies Test (MAT) ACHIEVEMENT TESTS: measures the effects of learning or a set of experiences. Examples: Stanford Achievement Test, GED, College Level Examination Preparation APTITUDE TESTS: measure the effects of general learning and are used to predict future performance; also called ability tests; Examples: Differential Aptitude Test (DAT), O*NET Ability Profiler (formerly, General Aptitude Test Battery), Career Ability Placement Survey (CAPS)

Item Analysis > Item Difficulty AKA Item Easiness AKA p-value > Item Discrimination

ITEM DIFFICULTY - Calculated by taking the number of people tested who answered the item correctly / total number of people tested. It is a measure of the proportion of examinees who answered the item correctly; for this reason it is frequently called the p-value. In most tests, the item difficulty level is set at .5 (i.e. 50% of the examinees will answer correctly while 50% will not). Item difficulty ranges from 0.0 to 1.0. ITEM DISCRIMINATION - After the students are arranged with the highest overall scores at the top, count the number of students in the upper and lower group who got each item correct. Then determine the Difficulty Index by dividing the number who got it correct by the total number of students. Then determine the Discrimination Index which is calculated by subtracting the number of students in the lower group who got the item correct from the number of students in the upper group who got the item correct. Then, divide by the number of students in each group. The possible range of the discrimination index is -1.0 to 1.0; however, if an item has a discrimination below 0.0, it suggests a problem. When an item is discriminating negatively, overall the most knowledgeable examinees are getting the item wrong and the least knowledgeable examinees are getting the item right.

A counselor conducts a study to determine the effectiveness of ongoing group academic counseling on the GPA of a group off high school students. The study begins in September and ends in June. To which threat to the validity off an experiment is this study particularly susceptible: A. Regression B. Demand characteristics C. History D. Reactivity

In research, HISTORY effects refer to external factors that cause changes in the dependent variable. During the nine months, the high school students will experience a variety of other events (e.g., attendance in class, pressure from parents, etc.) besides academic counseling. These events, rather than the academic counseling, could be the cause of any observed changes in the students' GPA.

Validity - Incremental Validity Synthetic Validity

Incremental Validity - Used to describe that process by which a test is refined and becomes more valid as contradictory items are dropped. Incremental Validity also refers to a test's ability to improve predictions when compared to existing measure that purport to facilitate selection in business or educational settings. A test's incremental validity is the improvement in predictive accuracy that results when a new predictor test is used. Synthetic Validity - popularized by industrial organizational psychologists, especially when utilized for smaller firms who did not hire a large number of workers. The helper or researcher looks for tests that have been shown to predict each job element or component (e.g. typing, filing, ect). Tests that predict each component (criterion) can then be combined to improve the selection process.

Kurtosid Mesokurtic Leptokurtic Platykurtic

KURTOSIS- refers to the peakedness or flatness of a distribution MESOKURTIC- normal distribution/ normal curve LEPTOKURTIC- distribution is taller, skinnier, and has a greater peak than the normal curve. PLATYKURTIC- distributions are flatter and more spread out. The number of persons scoring very high, very low, and in the average range would be similar.

Normal Curve

Linear line is the base line of curve. Line referred to as z-line. A z-line has mean of 0 and SD of 1. Note +-1 sd = 68% of curve (34+34), +-2 sd = 95% of curve (34+34+13.5+13.5) and +-3 sd = 99% of curve (95+2+2). The normal curve can be established for any set of data when the mean and sd are known.

Measures of Central Tendency

MODE - the most frequent highest score - nominal date - unstable and rarely used MEDIAN - middle - ordinal, interval, and ratio data - least affected by skew - used to report the average score in a skewed distribution MEAN - average - interval and ordinal data - most affected by skew - always moves closest to skew

Validity - Threats to External Validity (the generalizability or representativeness of the experimental findings)

MULTIPLE TREATMENT- simultaneous application of multiple interventions (administering more than one treatment consecutively to the same subject) - ASSIGN EACH SUBJECT ONLY ONE TREATMENT REACTIVITY- subjects alter their behavior because they are aware that they are being observed. HAWTHORNE EFFECT - if the subject knows they are part of an experiment, their performance sometimes improves. Having knowledge of participation alters the subject's response (HAVE AN IRRELEVANT PLACEBO CONTROL GROUP) EXPERIMENTER EFFECT - administrator's behavior or appearance effects subject's performance (USE MORE THAN ONE EXPERIMENTER FOR INTER-RATER RELIABILITY)

Assessment may be:

Norm referenced: comparing individuals to others who have taken the test before. Norms may be national, state, or local. How you compare with others in more important that what you know. Criterion referenced: comparing an individual's performance to some predetermined criterion which has been established as important. Criterion referenced is sometimes called domain referenced. The NCE cut-off score is an example. For the CPCE, university programs are allowed to determine the criterion (cut-off score). Ipsatively interpreted: comparing the results on the test within the individual. For example, looking at an individual's highs and lows on an aptitude battery. Or when an individual's score on a second test is compared to the scores on first test. A maximal performance test may generate a person's best performance on an aptitude or achievement test and a typical performance may occur on an interest or personality test.

Tests - Types

PERSONALITY TESTS: Personality is the dynamic product of genetic factors, environmental experiences, and learning to include traits and characteristics. PROJECTIVES - these tests present a relatively unstructured task or stimulus; Rorschach, Thematic Apperception Test (TAT), Draw-A-Person Test INVENTORIES - MMPI, NEO Personality Inventory, Beck Depression Inventory, Myer-Briggs Type Indicator INTERESTS: preferences, likes, and dislikes of an individual and more broadly includes values. Example: Strong Interest Inventory, Self-Directed Search, Career Assessment Inventory, O*Net Interest Profiler

Validity (test construction) - PREDICTIVE

PREDICTIVE VALIDITY (Criterion Validity) Refers to the correlation between test scores based on the relationship between test scores collected at one point in time and criterion scores obtained at a later time. Extent to which future level of performance can be predicted from knowledge of prior test performance. Also know as "empirical validity". For example: The scores on GRE predict later grade point average.

Skewed distributions

Positively Skewed (right): Mode < Median < Mean Negatively Skewed (left): Mean < Median < Mode Positively Skewed - Majority of scores are at the left or low side of distribution. Graphically, the tail of the distribution would point to the right or positive side. A hard test and most scored poorly would be positively skewed. Negatively Skewed - Majority of score are at the right or higher end of the distribution scale. Graphically, the tail would point to the left. A test that was too easy and most scored highly would be Negatively Skewed. The MODE is the top of the curve; the MEDIAN is the middle score; the MEAN is pulled in the direction of the extreme scores represented by the tail. The x axis is used to plot the IV scores. The x axis is also known as the ABSCISSA. (e.g. income, year, price range, length of stay, type of advertising, method of teaching, size, duration, category, age group) The y axis is used to plot the frequency of the DVs. The y axis is also known as the ORDINATE; (e.g. amount of people, probability, number of observations)

Quartile

Quartile is a useful concept in statistics and is conceptually similar to the median. To further see what quartiles do, the first quartile is at the 25th percentile. This means that 25% of the data is smaller than the first quartile and 75% of the data is larger than this. Similarly, in case of the third quartile, 25% of the data is larger than it while 75% of it is smaller. For the second quartile, which is nothing but the median, 50% or half of the data is smaller while half of the data is larger than this value. Quartile Deviation is based on the range of the middle 50 percent of the scores. The middle 50 percent of a set of scores is called the 'inter-quartile range' and the 'quartile deviation' is simply half of this range. The inter-quartile range' (middle 50%) is bounded by the 75th percentile and the 25th percentile. These points are called 'quartiles' and are indicated by Q3 and Q1 respectively. The quartile or Q2, is the median.

Measures of Variability

RANGE (Span) - highest score minus the lowest score (can be rounded up one); Inclusive Range is highest minus lowest +1. STANDARD DEVIATION - most often used - describes how scores vary around the mean; used to describe the variability within a distribution of scores. VARIANCE - simply the square of the SD TO CALCULATE VARIANCE: 1- Find the mean. 2- Subtract the individual S's scores from the mean and square the difference of each. 3- Sum all the squared products. 4- Divide the sum of the squares by the number of scores.

Measure of Variability

RANGE, VARIANCE, STANDARD DEVIATION Variance = 25 SD = √25 = 5 The variance tells you (exactly) the average deviation from the mean, in "squared units." The standard deviation is just the square root of the variance (i.e., it brings the "squared units" back to regular units). The standard deviation tells you (approximately) how far the numbers tend to vary from the mean. (If the standard deviation is 7, then the numbers tend to be about 7 units from the mean.

Reliability question If internal consistency is of concern, what reliability coefficient will most likely be used?

RATIONALE- COEFFICIENT ALPHA is a type of internal consistency reliability coefficient. Determining a test's internal consistency reliability by the coefficient alpha involves giving a test once to a single group of examinees. A special formula is used to determine the degree of inter-item consistency.

Validity - Threats to External Validity (the generalizability or representativeness of the experimental findings)

ROSENTHAL EFFECT (Experimenter's expectancy)- the experimenter's beliefs about the individual may cause the individual to be treated in a special way so that the individual begins to fulfill the experimenter's expectation. HALO EFFECT - occurs when a trait which is not being evaluated (e.g. attractiveness) influences a researcher's rating on another trait (counseling skill). (NOTE BIASES) NOVELTY EFFECT - a new treatment produces positive results just because it is novel to participants; gains diminish over time. (EXTEND TREATMENT OVER TIME)

RUST-3

RUST-3 (Responsibilities of Users of Standardized Tests - Third Edition) -Addresses the issues of test user qualifications, technical knowledge, test selection, test administration, test scoring, interpreting test results, and communicating test results. -Purpose is to assist counselors and other educators in implementing responsible testing practices.

Approaches to Test Construction

Rational/Logical Construction (aka Theoretical Approach)- rely on reason and logic to create items instead of relying on data for statistical analysis when constructing items. AKA Theoretical Approach because test developers are theorizing that the items are related to the constructs they are attempting to measure. If we believe depression is related to being aggressive then all questions will be about aggression. There is no theoretical or rational reason to ask about anything else since aggression is related to depression. Empirical Construction (Criterion Construction)- items developed randomly; rely on data collection to identify items; items developed randomly. Bootstrap Approach/Sequential Method- combination of the rational and empirical approach; items are written based on theory (instead of randomly), and then empirical procedures are used to verify that the items measure the construct they are theorized to measure.

Reliability causes the ________ limits of validity.

Reliability causes the upper limits of validity. Reliability sets the upper levels of validity because a validity score cannot exceed the reliability score. Reliability is required to make statements about validity. In order to make causal assessments in a research situation, one must first have reliable measures (i.e., stable and/or repeatable measures). If the random error variation in a measurement is so large that there is almost no stability in the measurements, one cannot explain any cause! A researcher wants measures to be reliable and valid so that statements about causality will be appropriate, and to be able to generalize the findings.

Standard Error of Measurement Standard Error of Estimate

Reliability is to Standard Error of Measurement as validity is to Standard Error of Estimate. Standard Error of Estimate- derived from examining the difference between out predicted value of the criterion and the person's actual score on the criterion. The difference is called prediction error or residual. A large SEe indicates that we are typically not very accurate in our predictions of a person's eventual performance on the criterion measure.

Measurement

Results in answers to quantitative questions. The test score (quantity) is used to identify a number answer (usually expressed as raw score or percentage correct). Quality is to infer how well they did in comparison to a group.

SEM Example: A test of computer skill has a reliability coefficient of .75, a mean off 100, and a variance of 16. What is the test's standard error of measurement?

SEM = SD * √ 1 - r The square root of the variance is 4, the reliability coefficient (.75) to produce: SEM = 4 X √ (1 - .75) = 4 X √.25 = 4 X 0.5 = = 2

SEM Formula

SEM = SD * √ 1 - r SEM = (SD) multiplied by (Square Root of [1 minus the reliability coefficient]).

Scorer or Inter-ratter Reliability

Scorer or Inter-ratter Reliability compares the degree of difference in scoring or judgments given by more than one observer and is an important measure when interpretations are involved. An index of .80 reliability is most desired.

Correlation Coefficient

Shows the relationship between 2 sets of numbers. A correlation coefficient tells you nothing about cause ans effect, only the degree of the relationship. The Pearson Product-Moment Correlation Coefficient (r) is frequently used. In general, a larger number of observations will produce larger correlation coefficients and a smaller number of observations will produce smaller correlation coefficients. To test whether a correlation coefficient is statistically significant, the researcher starts with the null hypothesis that there is no relationship between the 2 variables AKA the correlation coefficient equals zero. The alternative hypothesis is that there is a statistical relationship between the 2 variables and that the correlation coefficient is not equal to zero. So you are testing whether the correlation coefficient is statistically different from 0.

Spearman Brown formula

Spearman-Brown formula — a statistical formula used for correcting the split-half reliability coefficient (because of the shortened test length created by splitting the full length test into two equivalent halves) Formula that is used to estimate the impact that lengthening or shortening a test will have on a test's reliability.

SEM- Confidence Interval - What is the confidence interval and how does the SEM relate to it?

Statements about an examinee's obtained score (the actual score that is received on a test) are couched in terms of a CONFIDENCE INTERVAL — a band, interval, or range of scores that has a high probability of including the examinee's "true" score. Depending on the level of confidence one may want to have about where the "true" score may lie, the confidence band may be small or large. Most typical confidence intervals are 68%, 90%, or 95%. Respectively, these bands may be interpreted as the range within which a person's "true" score can be found 68%, 90%, or 95% of the time. It is not possible to construct a confidence interval within which an examinee's true score is absolutely certain to lie. A person scores a 92 on a test. The test's SEM=5.0. Chances are about 2 in 3 (68%) that the person's score falls between 87 and 97. And 95% of the time the person's score would fall within the range of 82 and 102.

Testing VS Assessment VS Appraisal

TESTING is the process of measuring variables by means of devices or procedures designed to obtain a sample of behavior. A TEST can be defined as a systematic method of measuring a sample of behavior. Test format refers to the manner in which test items are presented. The term "MEASURE" merely connotes that a number or score has been assigned to the person's attribute or performance. ASSESSMENT is the gathering and integration of data for the purpose of making an educational evaluation, accomplished through the use of tools such as tests, interviews, case studies, behavioral observation, and specially designed apparatus and measurement procedures. APPRAISAL can be defined as the process of assessing or estimating attributes.

Pearson r

The Pearson r sets both the upper or lower levels of validity because a validity score cannot exceed the reliability score. Investigates relationships among variables. It only measures correlation no cause/effect relationships. Person r is used for interval and ratio measures. Spearman Rho is used for ordinal data.

PRCS - When conducting a behavioral assessment, a counselor would use the PRCS to:

The Psychological Response Classification System (PRCS) is an alternative to the DSM. It is designed to classify responses instead of people. Responses on the PRCS are divided into five broad categories: motor, biological, cognitive, emotional, and social.

SEM

The SEM is another measure of reliability. It is also referred to as Confidence Band. How much a person's score would vary if tested again with same test; used when reliability is measured for a single testing for an examinee; used to develop confidence intervals around an individual's obtained score Every test has its own unique value of SEM which is calculated in advance. The Standard Error of Measure or (SEM) refers to the probability of a Type I error and represents an individual's true score by revealing how far an individual's score varies from the mean over time. A test with a low SEM has higher reliability and a test with a high SEM has lower reliability.

Validity - Internal Validity refers to:

The certainty of a cause and effect relationship between the independent and dependent variable

Item Discrimination

The degree to which an item differentiates correctly among test takes in the behavior that the test is designed to measure. Is to determine if the item can discriminate the learner from the non-learner. The item analysis results can be utilized to arrange the test in a spiral form. Positively Discriminating Item - Is answered correctly most often by individuals who perform well on the test Negatively Discriminating - Is answered correctly by those who perform bad on test

Square Roots List

√.01 = .1 √.04 = .2 √.09 = .3 √.16 = .4 √.25 = .5 √.36 = .6 √.49 = .7 √.64 = .8 √.81 = .9 √1 = 1 √ 4 = 2 √ 9 = 3 √ 16 = 4 √ 25 = 5


Related study sets

Chapter 02 Competitiveness, Strategy, and Productivity

View Set

Wonderlic - Assessment Test #1, Assessment Test #2, Assessment Test #3, Assessment Test #4, Assessment Test #5, Assessment Test #6, Assessment Test #7

View Set

Psychology questions - Capparelli, Mezo

View Set

HCS-D Certification Practice Questions

View Set