Assessment Exam 2 Class Notes

Ace your homework & exams now with Quizwiz!

types of validity

face, content, criterion-related, construct

distribution of test scores

look at the scores visually to see if it is skewed to see if need to add easy or hard questions

strength of relationship as it exists in nature

lot of relationships that just exist naturally as natural phenomenon-IQ scores and reading scores-generally go together (stronger relationship leads to higher r)

observed score=

score you got on test

true score=

score you should have gotten on the test

purpose of screening=

see who needs further testing for diagnosis

sensitivity=formula

successes (AKA true positives)/(successes + false positives)

criterion-related validity

the efficiency of a test in predicting behavior under a special circumstance.

sensitivity=

% of successes

indices of item discrimination

- computer generated statistics - Over 50 indices exist

power tests and reliability

- measures competence with no effect of speed. Item difficulty is graded so some items are easy and some very difficult.-all of these types of reliability apply to it

for reliability what correlation number do you want to be higher than for screening

.80

selecting cut scores want what percentage of specificity and sensitivity

80%

whole interval recording

An observer marks each interval on a recording sheet whenever a behavior occurs throughout the entire interval.

auditory perception

Auditory discrimination Auditory figure-ground

examples of behavioral inventories

Connors test for ADHD-adult and kid version; ASRS-Adult ADHD self-report scale; SNAP-IV: teacher and parent rating scale for ADHD, ODD, conduct d/o; Wender Utah rating scale for ADHD

specific-group norm approach

Develop a test within one culture and administer it to groups from different cultures Tests that break out norms-adult norms, women norms, norms for different ages/ethnicity; can break it out in any way need to if stats show that there are between group differences

Various definitions of intelligence have emphasized at least one of the following components:

Origin Structure Function

group administered intelligence tests

Otis-Lennon School Ability Test Multidimensional Aptitude Battery—II Wonderlic Personnel Test

Cattell

Raymond Cattell proposed that intellectual abilities could be divided into two broad categories: 1. Fluid abilities 2. Crystallized abilities

individual screening intelligence tests

Slosson Intelligence Test (SIT-4) Wechsler Abbreviated Scale of Intelligence

formula for reliability r=

TSV/(TSV+EV)

interaction of reliability and validity

Test scores cannot be more valid than reliable

Multiple Factor Models of Intelligence which people

Thurstone, Cattell

visual perception assessments cover

Visual discrimination Visual figure-ground Visual closure Visual-motor integration

what is most popular intelligence test

Weschler

scatterplots

can show a linear relationship between the measures on X and Y

use standard error of measurement to calculate

confidence intervals-shows range where the true score falls between

reliability=

consistency

Horn-Catell Fluid/Crystallized Theory of Intelligence

crystallized vs. fluid intelligence, broad visualization vs. broad speediness

2 key factors impacting Pearson r

curvillinear relationship and restricted range

how can get high reliability and validity with tests

do item analysis; - High reliability and validity can be built into a test in advance through item analysis. It improves the quality of a item pool, but is sample dependent.-b/c items are behaving themselves-

2 cautions of indirect behavioral assessments

halo effect and central tendency error

standard error of measurement is how

how accurately you can be confident in consistency of score-reliability-way of helping fix the problem that score that you got may not be your true score No test is perfect so no scores you get are perfect How perfect test is-how reliable it is-then score you got on test is true score when reliability is 1 or perfect-doesn't really happen

broad speediness

how quickly can solve problems

selection definition

individuals are either accepted or rejected-from pool of applicants-most decisions around admissions use lots of criteria for making these types of decisions-not just one shot deal

multiple cutoff method

means that the counselor must establish a minimally acceptable score on each measure under consideration, then analyze the scores of a given client or student and determine whether each of the scores meets the given criterion.-have multiple predictors but have to reach minimal score to be selected in-like what qualifications would need to have in order to be picked to go to Mars-want people to be high on all characteristics and would have high chances of getting into program-would have to be a doctor, physically and mentally healthy-how can these things all combine to help us get more accurate prediction of it-what is minimal cutoff score have to meet on all tests

unified construct model of validity

more holistic-display evidence that shows it is valid-omnibus term

methods of direct behavioral assessment

narrative recording, interval recording, self-monitoring, behavioral interview

fluid intelligence

nonverbal problem solving

· Prof. Counselors Can use Trait Approaches in 6 Primary Ways

o 1. Understanding the client o 2. Making differential diagnoses o 3. Establishing empathy and rapport o 4. Giving feedback and insight o 5. Anticipating the course of therapy o 6. Matching treatments to clients

functional behavioral assessment (FBA)

o When behavioral assessment is used to identify a function o Required in schools for children who have a special education ruling

easier to replicate interrater reliability for what type of rating

objective vs. subjective rating-like 2 people grade MC vs. essay test and see if get same results/scores for people

relationship between reliability and validity in choosing tests

tests can generate no more valid scores than is reliable/consistent; want to choose instruments with more reliability b/c can be more accurate too

internal consistency estimates based on

the average correlation among items within a test or scale; there are various ways to obtain internal consistency:

definition of construct validity

the extent to which a test may be said to measure a theoretical construct or trait. Some theoretical constructs easier to measure than others Spent decades trying to measure intelligence-much easier to establish construct validity for than distractibility

when to use Cronbach's alpha method

used for multi-scaled item response formats-this is usually how do that

when to use Kuder-Richardson Formula 20 method

used for dichotomous item response formats

% correct classifications formula=

true positives+true negatives/total

4 quadrants of true positives, false negatives

upper left=false negatives, upper right=true positives, bottom left=true negatives, bottom right=false positives

how to make tests better

use item analysis

validity=

usefulness

3 types of interval recording

whole interval, partial interval, and momentary time sampling

positive slope

• "uphill" dots; a positive linear relationship (positive correlation) between the two variables X and Y

effect size estimates for Pearson r

• .10 = small • .30 = medium • .50 = large

negative slope

• downhill" dots; a negative linear relationship (negative correlation) between the two variables X and Y

overall with assessment we are assessing constantly

"The expert is constantly in the process of assessing: Such assessment occurs naturally, almost without conscious reflection, in the course of working" (Gardner) We are always assessing automatically to see how client is responding to what I am saying-want to make sure getting right feedback from client to make things better-we do assessment throughout the counseling process-use that info to make sure that your next responses or direction you go will improve things

negative skew with test scores

(Negative skew (left)-Not a sufficient number of difficult items to discriminate at the ceiling. Higher end piling.-add more difficult questions to pull to normal-some tests want to look like this because the grading scale fits with this (skewed left)

positive skew with test scores

(Positive skew(right)-Not a sufficient number of easy items to discriminate at the floor. Lower end piling.-too many hard questions so add some easier items to make it more normal\ (skewed right)

variability of scores

(if 10 0s r=0)-low variability-if everyone gets same score then correlation will be very small-if no variability it can't predict anything b/c everyone getting same score; if have lots of variance in scores then better at predicting scores; IQ and reading-correlation usually .60 or higher-if only do it with upper 2% of population then so little variability between it that it truncates scores so won't be able to predict reading scores well-when you truncate scores will drop correlation down substantially-look at this when evaluating reliability and validity of scores

SD measure for SEM

+ 1 SEM = 68% chance (or 2:1)-if making 100 decisions accurate for making 68/100 decisions-wrong 1/3 of a time-don't want to be at this level-unacceptable + 2 SEM = 95% chance (or 19:1)-this is usually where we want to be b/c range is not that huge-make accurate decisions 19/20 decisions-this is what probability range usually interested in + 2.58 SEM = 99% chance (or 99:1)-that is very high degree of probability-also means range of scores is huge + 3 SEM = 99.7% chance (or 997:10)

standard error of measurement is a way we can do what?

- Another measure of reliability; but more accurately the confidence we have in an individual's score

what is internal consistency influenced by

- Content sampling. - Heterogeneity of the behavior domain sampled.-homogenous items-only has depression items on it-going to hang together and be consistent in how items measure each other -How many items do you have-more items more consistency going to be able to have

differences for test-retest reliability are due to...

- Difference is due to: 1. Uncontrolled testing conditions (e.g., distractions). 2. Changes in condition of examinee.-some students had tutor and others did not so showing up as differences even though not necessarily fair -Administer test and wait certain amount of time and then give it again-longer time lower correlation tends to be-have to designate amount of time elapsed

factor analysis

- FA is a refined statistical technique for analyzing the interrelationships of behavioral data. - Each item is treated as a different "test" and they are statistically clustered according to relationships with each other—called factors. uses statistics to determine the degree to which the items contained in two separate instruments tend to group together along factors that mathematically indicate similarity, and thus a common meaning

if combo of speed and power tests

- Most tests rely on a combination of speed and power.-if combo of speed and power tests-like of power test but there is a time limit-if combo then assume like a power test

practice effects and test-retest reliability

- Practice effects can confound rtt:-you may remember what items are second time-I remember this question-replicating your responses 1. Examinees may recall items and respond exactly as before. The same pattern of right and wrong answers can lead to a spuriously high correlation. 2. Once the examinee grasps the concept, the solution can be repeated quicker and with greater ease.-once grasp concept or do it once usually do it faster second time

definition of alternate-form reliability

- Same person, give 2 forms that are equivalent forms of same test, over two occasions. Must have two equivalent forms of the same test.-short form and long form are not equivalent forms; don't see this much but given in academic achievement tests where have second version of tests -Item content is not identical but measuring same thing in same way at same level of difficulty-parallel forms Form A correlated on scores for Form B

considerations with item difficulty

- Several methods are available for determining item difficulty. -Problem with teacher made test-is they don't see how items performed well and just end up curving it-did you get the instruction needed to adequately answer the questions-were the test questions legitimately drawn from domain of behavior-or were questions just horrible questions that did not clarify what looking for Figure out which ones need to be revised or thrown out then reliability and validity scores gets higher and higher so revise test to make it better

appropriate reliability of speeded tests may involve:

- Test-retest. - Alt-Forms (if two forms exist). - Split-half with modifications of time rather than items.-did they answer same number of questions in intervals of time and see if consistent answering of questions over time-could they answer couple in this one and then distracted and then not answering very many questions-if consistent have same number of items in time intervals from beginning to end

what the constant does in multiple regression

- The constant helps adjust the equation to its new scale (i.e., stanine, standard score, etc.)-brings right side and left side of equation to same-linear combo of 3 variables and this y-intercept(the constant) so that in future don't have to collect criterion and plug them into this formula

item discrimination involves

- The degree to which an item differentiates (discriminates) correctly among respondents in the behavior the test is designed to measure. -High performing should get item correct in greater proportion than the lower performing student - Item discrimination is most helpful in item pool selection when developing the test. -Helps you see how items behaving themselves and see if need to be thrown out or revised - Often the item/total score correlation is an easy and available method.

general rule with percentage passing

- The general rule is to choose items with a moderate spread of difficulty (usually .15 or .20 to .70 or .80 with a couple of .90s as early items for confidence builders) with an average (or median) difficulty level of .50. - A special case in item difficulty involves multiple choice response formats. It is desirable to set the item difficulty coefficient higher (easier items) than .50. Ex: 5 choice format .69 4 choice format .75-this is what want it to be for standardized test usually If you've got a teacher made test don't want your average P value to be lower than .80 because want people to get about an 80 so that they are doing well-want to have variation of items so can tell who knows the information and who does not know the information

how to do split-half reliability

- The responses are split into comparable halves (usually by odd/even or matched items of difficulty).-split items into 2 halves and sum the halve scores and correlate those-but since halved it then can drive correlation coefficient down Because the two split tests are one-half the size of the original test, the Spearman-Brown prophecy formula is applied to predict the results when the test is returned to its full-size.

clinical judgment

-model we are using with BPS-integrate a bunch of info almost everyone does it this way but often least accurate b/c doesn't use power of stats-can build into it higher degrees of accuracy by using a statistical decision making model-allows you to make more accurate decisions if use stats to help you with it-more helpful b/c not just going with your gut When have opportunity to use stats to help you make decisions do it -least accurate of these 3 b/c not based on stats; can add rigid adherence to cut scores to help things-leads you to more accurate decisions -when you switch sides in decisions more likely to lead to inaccurate decisions

more on clinical judgment

. Clinical judgment is not a statistical system per se. The evaluator relies on judgment, past experiences, and theoretical frames to interpret and integrate findings from different tests. [We will cover this next week during the case example].-you are putting all info together and inferring based on that info

want differences (D0 of what or higher on MC tests

.40

for reliability what correlation number do you want to be higher than for diagnosis

.90

best D value can get is if

1 person misses it and were in low group D=.20

3 major types of aptitude tests

1. Aptitude tests designed for scholastic admission decisions 2. Multi-aptitude batteries 3. Measures of special abilities

doing item analysis

1. Item difficulty-spread in difficulty levels of items-if too easy then no variability; if too hard and no one gets them right then what is the point of them 2. Can have really hard question to separate out the gifted people-so that only the highest achieving people can get it right 3. If testing for mastery then can have super easy one-if didn't get that one right then there is something really wrong-if everyone gets it right except one person then know that that one person needs help-lets them know that got the most critical information right and understand that

Solovey's Five Domains of Emotional Intelligence

1. Knowing one's emotions 2. Managing emotions 3. Motivating oneself 4. Recognizing emotions in others 5. Handling relationships

with diagnosis info should be gathered from 4 sources

1. Life outcomes (L) 2. Observer ratings (O) 3. Self-report ratings (S) 4. Test data (T)

with criterion-related validity • Can involve any measure/test that is helpful for the test to be compared—for example:

1. achievement or other tests 2. job/task performance-job interview, internship evals 3. psychiatric diagnosis, 4.ratings (be teachers or trained observers), 5.previously available tests that also measure that thing

5 types of projective personality assessments

1. association techniques (Rorsharsh Ink Blot) 2. Picture-story construction technique (Thematic Apperception Test TAT) 3. verbal completion test (sentence completion) 4. choice arrangement techniques (letting child choose what toys to play with) 5. production-expression techniques (House-Tree-Person)

3 groups of historical perspectives for multiple intelligences

1. classicists 2. revisionists 3. radicals

3 primary methods of using multiple tests to make decisions

1. clinical judgment 2. multiple regression equation 3. multiple cutoff scores

2 forms of criterion-related validity

1. concurrent criterion related validity 2. predictive criterion-related validity

criterion-related validity issues

1. criterion contamination 2. selecting validation criteria 3. validity generalization

7 ways to gather evidence for construct validity

1. developmental changes 2. internal consistency 3. correlations with other tests measuring exactly the same construct 4. factor analysis 5. experimental interventions 6. convergent validity evidence 7. discriminant validity evidence

2 methods of accomplishing item analysis-2 parts to it

1. difficulty 2. discrimination

2 cautions with clinical assessment

1. hypothesis confirmation bias 2. self-fulfilling prophecy

2 forms of classification consistency (decision reliability)

1. mastery vs. nonmastery 2. Cohen's k: the proportion of nonrandom consistent classifications

cross cultural testing issues of item fairness

1. popular approach 2. specific-group norm approach 3. specific-culture approach

what 3 things influence a correlation coefficient

1. sample size/number of scores 2. strength of relationship as it exists in nature 3. variability of scores

3 ways to obtain internal consistency

1. split-half method 2. Cronbach's coefficient alpha 3. Kuder-Richardson Formula 20

6 types of bias in assessment

1. test bias 2. examiner bias 3. interpretive bias 4. response bias 5. situational bias 6. ecological bias

5 types of reliability

1. test-retest 2. Alternate-form 3. Internal consistency 4. Scorer reliability (interscorer reliability) 5. Interrater reliability-not a true form of reliability

optimal % of people in upper and lower range for extreme groups method

27%

achievement tests with special needs students

A diagnosis of LD can be made if the student demonstrates a significant discrepancy between ability (normally assessed using individualized intelligence tests) and achievement (normally assessed using individualized achievement tests) in one or more of the following seven academic areas: Oral expression Listening comprehension Basic reading skill Reading comprehension Written expression Mathematics calculation Mathematics reasoning

example of content validity

A diagnostic math test: concepts, calculation, and problem-solving.

informal career assessment measures

A forced-choice activity asks individuals to choose one of two quite different options or to rank three or more activities. In a card sort, an individual is given a stack of cards, all related to career choice (i.e., work value, work task, skill). The person can either rank the cards (depending on the number and time available) or sort them into three categories. These categories are usually things that are very important, somewhat important, and not important. It is essential to make sure that the entire range of choices individuals are likely to select is included. A structured interview consists of a professional counselor's asking the client questions aligned with a theory.

assessing career development and maturity

A number of inventories have been developed to assess the maturity of a client's thoughts and career development progression. Most of these instruments strive to help clients clarify and analyze beliefs and values so that career goals can be developed, agreed to, and pursued.

interpreting reliability coefficients

A reliability coefficient is directly interpreted as a % of variance (all other instances involve the squaring of correlations).

diagnosis definition

AKA classification; • always involving two or more criteria—matches the individual with the best alternative-go into DSM and look for categories that best match symptom display-diagnosis is basically same thing as classification-diagnosis by definition is more than 1 types of assessments or criteria that you are using to match them-more spectrum oriented now

radical of multiple intelligences

About how you are smart Radicals deny g exists and believe that "a human intellectual competence must entail a set of skills of problem solving enabling the individual to resolve genuine problems or difficulties...create an effective product and...(find or create) problems thereby laying the groundwork for the acquisition of new knowledge (Gardner, 1993)." Set of skills for problem solving that differ across individuals that help us find what problems are, and solve problems-purpose of education-teaching individuals to solve problems according to their individual strengths and weaknesses 5 factors can form g-global intelligence-they say get rid of g and just talk about the factors themselves-when get rid of g then not talking about intelligence as single construct but how people can show intelligence in several ways

why assess achievement

Achievement tests are the most commonly used assessment devices in schools. Achievement tests measure an individual's learning acquired through structured, education-based experiences. Until a student is taught something (exposure), it cannot be assumed the student is incapable of learning it (mastery). Thus, great effort is expended to ensure that achievement test scores have content validity. Test items must align with classroom exposure and instruction to arrive at valid achievement scores.

advantage of multiple cutoff method

Advantage of it: leads to high quality of decisions b/c standard cutoff is minimal acceptable criteria since have to meet cutoff in all areas

all evidence for construct validity is dependent on what

All evidence is sample dependent-when use different groups of individuals will get different results-if showing robust results over many studies heightens confidence that actually measuring what want to measure

momentary time sampling

An observer marks each interval whenever a behavior occurs at the beginning or end of the interval.

Differences among aptitude, intelligence, and achievement tests

Aptitude and intelligence tests are used to predict future performance, while achievement tests are used to measure what a student or client has already learned. Intelligence tests tend to have more global predictive applications, while aptitude tests are designed to predict a narrow range of skills and abilities.

assessing intellectual developmental disorder

Assessment involves obtaining information in three primary areas: 1. Sociocultural history 2. Intellectual functioning 3. Adaptive functioning Essential features include significant impairment in intellectual and adaptive functioning Significant impairment has traditionally been defined as a performance falling at least two standard deviations below the mean on an individualized, standardized, culturally appropriate, and psychometrically sound diagnostic test

bivariate correlation

Bivariate correlation is a measure of relationship between two variables. • This can be graphically represented in a scatterplot, a diagram with the predictor variable on the horizontal axis and criterion variable on the vertical axis.

Holland's Theory of career assessments

Career interests are believed to be largely an expression of an individual's personality type Six types: realistic, investigative, artistic, social, enterprising, conventional People search for environments that will allow them to use their skills, express their values, and implement their problem-solving styles effectively Personality and environment interact to produce work-related behavior, such as vocational choice, job tenure, satisfaction, and achievement

confidence intervals allow us to...

Confidence intervals allow us to report an individual score in the context of the score's reliability. For example, it is not accurate to state that a participant's IQ is a standard score of 110, because all measurements have error. Based upon the normal distribution of score errors and knowing the reliability of a set of test scores, it is possible to report a range of scores within which the true score probably lies, at a certain level of probability.-this statistical probability is referred to as the standard error of measurement (SEM)

importance of validity

Construct validity-important in clinical decision making-constructs can never be known 100%-like can never really know what depression is-provide as much evidence about scores derived from instrumentation so can have confidence in scores -Most important construct in measurement b/c want to make accurate decisions about lives of individuals we work with -All about the score-score validity not about test but about the scores themselves-the scores are valid not the test is valid

details of construct validity

Construct-is more big picture-use other sources of info more statistically oriented not looking to experts necessarily-never finished establishing construct validity only estimating it-never have 100% overlap-can never get past the sources of error-keep providing more and more evidence of construct validity

controversies and justice issues related to validity

Controversies in Assessment: "Teaching to the Test" Inflates Scores • "Teaching to the test" means that the focus of instruction becomes so prescribed that only content that is sure to appear on an exam is addressed in instruction. If this occurs, test scores should rise. • Whether test scores are inflated in this instance is a matter of content mastery. • Test publishers, state education departments, and local educators must work collaboratively to develop test items that adequately sample the broad content domain and standards.

specific-culture approach

Develop and norm the test for use with only one individual culture IQ test that has general info questions-when adapt it for use in other countries

difference between observed and true score is

Difference between observed and true score is estimate of error is occurring Compute standard error of measurement as computation for this error

common multi-aptitude batteries

Differential Aptitude Test O*NET Ability Profiler Armed Services Vocational Aptitude Battery

direct behavioral assessment

Direct observation and recording of the target behavior as it occurs

examples of construct validity

Distractibility-how we go about getting that evidence use less direct means-teacher/parent ratings Locus of control-how in control do you feel about circumstances in your life-you are in control or external events in control-easy to explain but difficult to measure-hardest to establish construct validity for this cosntruct Ex: Intelligence, distractibility, locus of control.

major problem with use of test batteries

Each method uses a series of tests, each of which is supposed to add predictive validity toward criterion. Major problem with use of test batteries lies in determining an equitable way to combine the scores (weighing) and examiner bias.

understanding elements of percentage passing table

Error=how many got it wrong out of all the people in the class P value=passing rate-how many got it right H=high group L=low group D=difference (H-L)

example of doing multiple regression

Ex: Diagnosis of AD/HD using rating scales completed by the teacher (RS-T), mother (RS-M), and teenager (RS-C). AD/HD Diagnosis = (A weight)(RS-M) + (B weight)(RS-T) + (C weight)(RS-C) + (constant) = (.33)(19) + (.21)(16) + (.41)(26) + (2.99) = 23.28 The examiner concocts a decision range: <15 = Not AD/HD; 15-18 = borderline/rescreen; >18 = AD/HD Thus, the 23.28 becomes the predicted criterion score (which in this case is in the "AD/HD" range).

examples of reliability correlations

Ex: rtt = .90 = 90% of true score variance (TSV) is accounted for by the trait and 10% due to error variance (EV) as defined by the type of reliability. Ex: for other correlations (validity coefficients or simple Pearson rs) .90 indicates 81% of true variance accounted for by the two variables and 19% due to error and other factors.

examiner bias

Examiner's beliefs/behaviors can influence scores in way that advantages some and disadvantages others Ex: giving test to ELL student so I'm going to talk louder and slower-that is going to create a different type of administration procedure-reaction to it-treating me disrespectfully-change their results-put them at disadvantage b/c of way I am treating person

interpretive bias

Examiner's interpretation of test results that can provide unfair advantages to some and disadvantages of others-if give people the benefit of the doubt that can artificially raise their score-if think they know it and give them points anyway-have to make sure counterbalancing-advantage them once and then don't do it again so that it works out to an average on the end-don't keep giving them a higher score b/c think they know it b/c will artificially raise their score

who proposed multiple intelligences

Gardner

Gardner's 8 intelligences

Gardner rejected the existence of g and identified eight distinct intelligences that aid in an individual's adaptation to the environment: 1. Verbal-linguistic 2. Logical-mathematical 3. Spatial 4. Musical 5. Bodily-kinesthetic 6. Interpersonal 7. Intrapersonal 8. Naturalist

assessing giftedness

Gifted children display consistently superior intellectual, leadership, mechanical, figural, visual, or creative abilities. Strict adherence to IQ tests for identification of the gifted was problematic because many brilliant people have strengths and weaknesses in their cognitive profile and these do not always average out to a very superior IQ.

ecological bias

Global systems-in career assessment values-cultural values may be different with career counseling-don't give out lot of personal info with that-contextual problems that people run into

commonly used achievement tests

Group-administered multi-skill achievement test batteries or surveys: TerraNova-3 Iowa Test of Basic Skills, Form C Individual achievement multi-skill test batteries: Woodcock-Johnson Tests of Achievement—Fourth Edition Wechsler Individual Achievement Test—Third Edition Individual and group-administered single-skill achievement tests for reading: Nelson-Denny Reading Test, Forms G and H Slosson Oral Reading Test—Revised Individual and group-administered single-skill achievement tests for mathematics: KeyMath-3 Diagnostic Assessment Slosson-Diagnostic Math Screener Individual and group-administered single-skill achievement tests for written expression: Test of Written Language—Fourth Edition Slosson Written Expression Test Tests of English Language Proficiency: Secondary Level English Proficiency Test Test of English as a Foreign Language Michigan English Language Assessment Battery

examples of depression inventories

HAM-D, CESD-revised, Zung

intrapersonal intelligence

Having an understanding of yourself, of knowing who you are, what you can do, what you want to do, how you react to things, which things to avoid, and which things to gravitate toward (Gardner, 1997) "We are drawn to people who have a good understanding of themselves b/c those people tend not to screw up because know what can and cannot do and how to get help when need it"

considerations with split-half reliability

If get correlation less than .80 wonder how many items would need to add to scale to get over the .80 criterion this formula will tell you that-r 12=.70 times 2/1+.70 Not used much anymore b/c gives experimenter too much control over splitting items-splits items in way that makes the correlation look higher if choose a certain number of items

Bias and Justice issues related to reliability

Importance of Reliability • Reliability is a necessary, but not sufficient, condition in the validation process • Unreliability shrinks the observed effect size through attenuation Researchers and Test Users Can Reduce Measurement Error and Improve Reliability By • Writing items clearly • Providing complete and understandable test instructions • Administering the instrument under prescribed conditions • Reducing subjectivity in scoring • Training raters and providing them with clear scoring instructions • Using heterogeneous respondent samples to increase the variance of observed scores • Increasing the length of the test by adding items that are ideally parallel to those that are already in the test • The general principle behind improving reliability is to maximize the variance of relevant individual differences and minimize the error variance

historical conceptualizations of intelligence

In the late nineteenth century, Sir Francis Galton and James McKeen Cattell believed in the importance of sensory acuities and capabilities as indications of intellectual prowess. Alfred Binet believed distinct thinking abilities were integrated into a general ability that was called on when solving problems. David Wechsler adapted subtests from the Army Alpha and Beta tests to develop a test to measure the intelligence of individual adults. Jean Piaget believed that learning was a consequence of an individual's interacting with the environment and encountering dilemmas that required mastery through a reorganization of thought. Charles Spearman proposed the general factor theory (g), which states that a general factor stands at the center of one's cognitive capacity, and specific factors are related to the general factor and help explain nuances and specialized characteristics observed in individuals.

verbal/nonverbal IQ for Input

Input-how we take in info-helps ID learning strengths and weaknesses Visual Auditory Kinesthetic Tactile

revisionists

Intelligence as Information Processing Where most people are today Revisionists hold that the emphasis may be more on process than structure or what individuals are doing when they exercise their intelligence. They accept that a g factor has to be incorporated into any theory of the structure of intelligence (Herrnstein & Murray, 1994). G does exist but lots of different ways to get to "g"-bunch of things make up your intellect and have strengths and weaknesses among those-most people up and down where have strengths and weaknesses-not usually at the average G is made up of lots of components-can be lots of different number of things that make up g-your g may be very different from other peoples all have different configuration of the facets

classicists

Intelligence as a Structure Beginning of IQ testing movement Classicists seek to identify the components of intelligence. For practical purposes, they accept that g lies at the center of the structure in a dominating position (Herrnstein & Murray, 1994). Bell-curve Believed in g-global intellectual capability-that was your intelligence-what born and die with-your score-classify individuals that way

why assess intelligence

Intelligence testing is undertaken to estimate a client's ability: To comprehend and express verbal information To solve problems through verbal or nonverbal means To learn and remember information To assess information processing and efficiency

Aptitude tests used for admission decisions

Intelligence tests are often used as a measure of scholastic aptitude. Scholastic aptitude assessments such as the SAT-I or ACT allow direct comparability across students in a manner that levels the playing field.

assessing interests

Interest inventories can help clients explore new academic or career possibilities, differentiate among various alternatives, and/or confirm a choice that has already been made Interests are often categorized as: Expressed Manifested Inventoried

guidelines for assessing interests

Interest inventories measure likes and dislikes, not abilities. They may suggest what a client will find satisfying in a career or work setting, but they do not indicate how successful the client will be in that setting. Interests tend to become progressively more stable with age, but they are never permanently fixed. Interest inventories are subject to response set and faking. General interest inventories have limited value for clients who are trying to make fine distinctions within a broad career category, such as medicine Interest inventories may not be appropriate for clients with certain emotional problems, including depression, as the clients may be more likely to respond negatively to items. Interest inventories may be susceptible to bias, particularly sex bias. Interest inventories provide just one piece of information about a client's interests.

test-retest reliability involves

Involves repeating the identical test on a second occasion: The coefficient is simply the correlation between scores on the first and second administrations-usually this is pretty high-the lower you go more error there is due to consistency over time

more on multiple cutoff method

Involves the establishment of a minimum cutoff score for each test in the battery. Thus, any individual with a single score that falls below the minimum score on a single test is eliminated from consideration. Only those individuals who exceed the minimum score on all areas are accepted. B/C no test is perfect then don't often rigidly stick to multiple cutoffs

interrater reliability

Is NOT a true form of "reliability"; more a form of concurrent validity. -b/c using 1 rater compared to another rater one is being used as criterion for other person Scores from two sets of raters (e.g., mothers, teachers) are correlated and reported as r. -how consistently does teacher 1 and teacher 2 rate the same individual 2 people rating behavior-2 different people rating of target participant-2 parents/guardians, 2 teachers-about behaviors not scoring the test

Thurstone

Louis L. Thurstone proposed that a collection of mostly independent primary abilities underlay intelligence, rather than the global general factor and multitude of specific factors proposed by Spearman. Thurstone identified seven primary mental abilities: Verbal comprehension Number Word fluency Spatial Reasoning Memory Perceptual speed

example of decision theory with 1 test

Making decisions using a single test-always a screening procedure for this - Ex: Teacher Rating Scale (selection test) versus Clinical Diagnosis (criterion) for AD/HD. Conduct a study of 100 normal and clinical children by administering a behavior rating scale, then making a clinical diagnosis. Plot the results on a graph as follows: Have 2 points of data: teacher rating scale (selection test) and the clinical diagnosis

controversies with aptitude assessments

Many parents and students debate whether or not it would be beneficial to retake the SATs to attempt to obtain a higher score. In order to make this decision, we need to know: If any kind of intervention has occurred since the previous administration of the SAT-I that may have helped the student to expand his understanding or mastery of the domain of knowledge being tested or application of those skills What the student's score on the first administration of the SAT-I was: statistical regression predicts that the farther you score from the mean, the greater the probability that a second examination will yield a score closer to the mean Essentially, the high scoring student has little to lose and the low scoring student has everything to gain by retaking the SAT-I Multiple-choice questions have been accused of punishing intelligent, creative thinkers, trivializing the complexities of the learning process, and rewarding good guessers. The data do not support the criticism that good guessers can achieve significantly higher scores on a multiple-choice test, because they would have to guess correctly on several to perhaps dozens of questions to make a difference in their scores.

measures of special abilities

Measures of special abilities arose from the philosophy that an accurate match between employee ability and specific employment task demands would lead to greater productivity for the company and greater satisfaction for the employee. Special ability tests cover wide-ranging areas of interest, including mechanical, clerical, musical, and artistic aptitudes.

where will see content validity

Most common with achievement academic test-where there is little disagreement that a math question is a math question-much greater agreement-match with learning objectives

Multi-aptitude batteries

Multi-aptitude batteries provide professional counselors with the tools to identify student and client aptitudes Differential prediction relates to how well a group of people's test scores help indicate who will be more or less successful at certain academic or occupational tasks in the future; aptitude batteries are not very good at achieving this very desirable result

how to calculate number of people in upper and lower range for extreme groups

Multiply number of people took test by 0.27 for 27%; Optimum level (to preserve reliability) is the upper and lower 27%.-top 27 performing and bottom 27%-so take 27% of N to get number of students in high and low group

when interpreting direct reliability do you square the correlation

NO-do not square the correlation

neurocognitive d/os

Neurocognitive disorders include Alzheimer's disease, Parkinson's disease, vascular disease, traumatic brain injury, Huntington's disease, and HIV infection.

neurodevelopmental d/os

Neurodevelopmental disorders include intellectual developmental disorder, language impairment, speech sound disorder, autism spectrum disorder, attention-deficit/hyperactivity disorder, dyslexia, dyscalculia, disorder of written expression, developmental coordination disorder, and chronic motor or vocal tic disorder.

neuropsychological assessment

Neuropsychological assessment involves the behavioral evaluation of brain dysfunction. A comprehensive neuropsychological evaluation involves assessment of orientation, attention, visual perception, auditory perception, tactile perception, visual memory, auditory memory, complex memory, verbal expression, academic skills, visual-motor skills, concept formation and reasoning, executive functioning, and motor performance.

benefit of internal consistency

Only requires a single administration, therefore no error due to occasion.

examples of correlation coefficient

Peer sociogram-group dynamics-list top 3 people you want to play with at recess-can see how many times got selected-how many times does each person get selected-how do I view myself in social relationships and how do I do with the group Positive b/c slope is positive-uphill-pretty big magnitude b/c Xs pretty close to regression line-tight knit-assume it is a linear relationship-assume can draw line of best fit in it-not curvilinear Until do computation-SPSS, excel does it- Compute r=sum of xtimesy divide by n times SDx times SDy

professional counselors must be familiar with and understand...to be effective advocates

Professional Counselors must be familiar with Specific tests used What each test measures How to administer, score, and interpret the test results To be an effective advocate, professional counselors must understand Educational and civil rights laws Psychometrics The domain of behavior assessed by a test What scores on achievement and other tests may imply regarding student eligibility

reliability helps to determine what

Reliability is the similarity in scores. Therefore, it helps determine the error variance (any condition irrelevant to the test). Interpret at true score variance/error variance

relationship between reliability coefficients and validity

Reliability sets upward bound for validity-if choose scores with low reliability will be hard to see high validity that it will get above the reliability coefficient value

formula for standard error of measurement (SEM)=

SEM=SD square root of (1-rtt); rtt=reliability test-retest

Commonly used admission tests

Scholastic Assessment Test Preliminary Scholastic Assessment Test/National Merit Scholarship Qualifying Test American College Testing Assessment Miller Analogies Test Graduate Record Examination

emotional intelligence

Self-motivation Persistence in the face of frustration Impulse-control Delay of gratification Mood regulation Prevention of distress from swamping the ability to think-not being overwhelmed Empathy Ability to hope Kids with ADHD struggle with this also kids with trauma struggle with this too Developmentally linked too-practice this and get better as age

Achievement tests can also be categorized as:

Single-level or multilevel Norm-referenced or criterion-referenced Group administered or individually administered Screening or diagnostic instruments

social justice and multicultural issues and reliability

Social Justice and Multicultural Issues Test authors should determine whether the reliabilities of scores from different groups vary substantially, and report those variations for each population for which the test has been recommended

squaring correlation coefficient

Squaring correlation coefficient=coefficient of determination or determination of variance B/c correlation coefficient isn't interval or ratio it is ordinal-not equal distance between correlational values so need to square r-Venn diagram examples-always square the correlation coefficient to find overlap or what we know about a relationship Correlation is NOT causation-can be mediating variable at play Pearson product moment correlation coefficient (r-assumes interval or ratio data

standard error of estimate (SEE)

Standard error of estimate (SEE) is derived from examining the difference between our predicted value of the criterion and the person's actual score on the criterion How well does our test predict the criterion-if the correlation is very high then standard error of estimate is low-inverse relationship between them • Similar to what SEM is for reliability • Shows margin of error expected in predicted criterion as a result of the test's internal validity (high or low)

Evaluation Standards

Standards Published in AARC • Competencies in Assessment and Evaluation for School Counselors • Standards for Multicultural Assessment • Standards for Assessment in Mental Health Counseling • Standards for Assessment in Substance Abuse Counseling • Standards for Career Assessment • Assessment Standards for Couples, Marriage, and Family Counselors Standards by ACA • Standards for Qualifications of Test Users • ACA Position Statement on High Stakes Testing

good kind of bias people don't mention

Statistical Bias: how do tests help us make determinations about who is lower/higher on construct-supposed to be built into every test-the good kind of bias

tests measuring interests

Strong Interest Inventory Skills Confidence Inventory Self-Directed Search Kuder Interest Inventories Abilities Explorer Campbell Interest and Skill Survey Career Assessment Inventory: Vocational Version and Enhanced Version Career Occupational Preference System Jackson Vocational Interest Survey Kuder Skills Assessment

limitations of alternate-forms reliability

Subject to practice effects. - Changes in nature of task may be necessary because of repetition.-Form A to Form B-some tasks difficult to replicate

review of correlation coefficient

Super important in assessment b/c reliability and validity reported in terms of correlation coefficients Ranges from +1.00 to -1.00 +=positive-2 variables go in same direction -=negative-inversely proportional Decimal=magnitude-how big it is

Whiston's four categories: of achievement tests

Survey achievement batteries Individual diagnostic achievement tests Criterion-referenced and minimum-level skills tests Subject area tests

applying the multiple regression equation

Takes several test scores (usually from hundreds of examinees) into account and weights them based on previous and current data to yield a predicted criterion score. -Usually looking at larger group of individuals-weights them based on how best predict criterion score -Computer gives you these weights-decides how important is this variable in predicting criterion score-how important is the teachers, mothers, self-report score; takes into account number of items too

considerations with using parents and teachers as raters for reliability measures

Teachers: see student in same context usually so usually pretty much on same page-.70 Parents: see child in same context so really agree too-.80 Parent compared to teacher: .50 correlation-2 different contexts so less agreement Parents may not be able to see if child is only child b/c can't see if behavior is deviant compared to normal b/c don't have other children to compare child too but teachers do have other students to compare to b/c in classroom of other kids

controversies in achievement assessment

Test Developers Dictate What Students Must Know or Learn Developers of achievement tests use several methods to select items that measure the domain of knowledge being assessed: Curriculum and textbook reviews Reviews of previously available tests Consultation and evaluation of experts in the given content area Good curriculum evaluation starts with well-defined standards, which are then implemented through an effective curriculum (including benchmarks, instructional objectives, and instructional activities) and appropriately assessed The key is for the test or assessment program to align perfectly with the curriculum, and for the curriculum to align perfectly with the standards. Test items now align more precisely with state standards, and the burden is on school systems and individual teachers to develop and implement an effective curriculum.

Similarities among aptitude, intelligence, and achievement tests

Tests that measure aptitudes, intelligence, and achievement all assess verbal abilities, numerical and quantitative skills, and reasoning

range of values of Pearson r and what the values mean

The Pearson r can take on values from -1.00 to +1.00. • A positive r indicates a positive linear relationship (i.e., directly related; as scores on X get higher, so do scores on Y ). • A negative r indicates a negative linear relationship between X and Y (i.e., inversely related; as scores on X get higher, scores on Y decrease). • The decimal indicates the magnitude or strength of the relationship. • The closer the absolute value of r is to 1.0, the stronger the correlation. • On a scatterplot, the closer the dots are to the straight line (called a line of regression), the stronger the relationship. • When r = 0, there is no linear relationship between X and Y. • The scatterplot usually looks circular or without a linear pattern. • The extreme positive value r = 1.0 indicates that there is a perfect positive correlation between X and Y. • All scatterplot dots fall on a straight line with a positive slope. • The extreme negative value r = -1.0 indicates that there is a perfect negative correlation between X and Y. • All scatterplot dots fall on a straight line with a negative slope.

tests measuring values and life role salience

The Values Scale—Second Edition Minnesota Importance Questionnaire O*NET Work Importance Profiler Life Values Inventory Salience Inventory

spatial intelligence

The ability to represent the spatial world internally in your mind...(sailor, airplane pilot, chess player, sculptor)...If you are spatially intelligent and oriented toward the arts, you are more likely to become a painter (sculptor, architect). Similarly certain sciences...(anatomy, topology)... emphasize spatial intelligence(Gardner, 1997). Chess players 6 moves ahead, surgeon, artistic perspective too

musical intelligence

The capacity to think in music, to be able to hear patterns, recognize them, remember them, and perhaps manipulate them. People who have strong musical intelligence don't just remember music easily—they can't get it out of their minds, it's so omnipresent (Gardner, 1997). Do you think musically-have great understanding of how music works=Mozart-can't turn it off you heard everything simultaneously-you think this way the notes are words that are popping into your brain Some need sheet music and some don't with professional musicians-all in their heads

linguistic intelligence

The capacity to use language, your native language, and perhaps other languages, to express what's on your mind and to understand other people. Poets really specialize in linguistic intelligence, but any kind of writer, orator, speaker, lawyer, or a person for whom language is an important stock in trade (Gardner, 1997). Greater facility for learning languages-my mom Express what's on your mind so that others can understand what you are saying and experiencing-connecting with others and expressing points

body kinesthetic intelligence

The capacity to use your whole body or parts of your body—your hand, your fingers, your arms—to solve a problem, make something, or put on some kind of production. The most evident examples are people in athletics or the performing arts, particularly dance or acting (Gardner, 1997). Athletes, actors, dancers, musicians with fine motor skills

Three Decision-Making Models Generally Used to Determine a Discrepancy

The grade-equivalent discrepancy model (outdated) The standard score point discrepancy model (common) The standard score point discrepancy with statistical regression model (becoming more common) Currently, a fourth model is emerging that features curriculum-based measurements (CBMs) or curriculum-based assessments (CBAs) more tightly aligned with local curricula to determine learning objectives that a student has and has not mastered; this information is then used to determine degree of deficiency and plan for educational interventions to remediate those deficiencies

naturalist intelligence

The human ability to discriminate among living things (plants, animals) as well as sensitivity to other features of the natural world (clouds, rock configurations) (Gardner, 1997). In our consumer society, this may involve discrimination among cars, sneakers, kinds of make-up, etc. Categorizing-how easily do people make sense out of categories-can tell just by looking at it know what brand is of shoes Chefs good at this-what goes good with what Hair design, makeup-know goes what goes with what to make it perfect

purposes of career assessment

The primary purpose of assessment in career counseling is to facilitate self-exploration and self-understanding. To aid clients with career planning, professional counselors use formal and informal assessments to help clients learn about their interests, abilities, values, and personality. Assessment tools also are used to help evaluate where clients are in the career development or decision-making process.

individiual diagnostic tests of intelligence

The various Wechsler tests (WAIS-IV, WISC-V, WPPSI-IV) Stanford-Binet Intelligence Scales—Fifth Edition

classical theory explanation of standard error of measurement

This is classical test theory-almost everybody uses coefficient alpha instead of rtt b/c alpha is usually higher than rtt so makes us think that scores more accurate than actually are If reliability is 1.00 then there is no error and SEM=0 never really see this so there is always error If reliability is 0.00 then SEM=SD-then that would be a lot of error Want reliability to be high to shrink error, SEM and give us smaller range within which range of true score lies Helps us predict in certain probability of scores

behavioral interview

This is conducted to identify a target behavior, to analyze environmental variables affecting the behavior, and to plan, implement, and evaluate an intervention

difference between content and construct validity

This is more specific and systematic examination of it-more specific about the content of it not overall construct itself

purposes of achievement assessment

To make high-stakes decisions about students To assess the effectiveness of a school's curriculum To assess the progress or academic development of individual students To identify learning problems in individual students To make placement or program eligibility decisions

to obtain the variance accounted for by a reliability coefficient you...

To obtain the variance accounted for by a reliability coefficient, you do not square the coefficient. This is the only exception to that rule.-reliability coefficient of .90 what is proportion of true score variance=90%; error variance is 10%

challenges with construct validity

Trying to make global determination about how well the test measures what it proports to measure as a construct Domain of behavior-draw down items from it-construct trying to measure from it How well do those items measure the construct of depression Construct is always an unknown never know exactly what construct is b/c lots of different theories about that construct-each test measures the construct in a slightly different way

nature and theories of intelligence

Typically intelligence tests measure verbal abilities, abstract visual reasoning, and quantitative skills. There is also general agreement that speed and efficiency of problem-solving capacities are characteristic of individuals with higher levels of intelligence.

interpersonal intelligence

Understanding other people: It's an ability we all need, but it is at a premium if you are a teacher, clinician, salesperson, or politician. Anybody who deals with other people has to be skilled in the interpersonal sphere (Gardner, 1997). Employers want employees that can work in teams and get along with others-care about this way more with grades-more looking at this type of intelligence

logical-mathematical intelligence

Understanding the underlying principles of some kind of a causal system, the way a scientist or a logician does; or can manipulate numbers, quantities, and operations, the way a mathematician does (Gardner, 1997) Logical-system for solving problems-causal system-if/then-there is a logical system in place and once you figure out that system you can solve any problem-hypothesis testing

validity information tells us...

Validity information tells us what the test measures and how well the test measures it. -Provides indirect/direct evidence about how accurate Validity indicates the degree to which test scores measure what the test claims to measure

controversies in intellectual assessment

Various patterns or combinations of cognitive skills, while perhaps resulting in the same estimate of overall intelligence, may lead to very different results in terms of how problems are solved. Psychometricians and statisticians continue to debate whether intelligence can be meaningfully represented by a single global score or is global with multidimensional refinements. The most commonly used tests of intelligence, the Wechsler scales, put less emphasis on theoretical considerations and focus more attention on clinical usefulness and applications.

how the weights help with multiple regression

Weights help to adjust scores according to number of points (items) and predictive validity.-how predictive is it and how many items are in the test

special applications for screening tests

When administering a test sometimes want it to be sewed based on the purpose of the test-are you trying to see who is the at-risk population-who is in lowest 25% vs. not in lowest 25% so design the test questions that way - When a test is specifically designed for screening and identification purposes, the developer should utilize items whose difficulty level values come closer to the selection ratio

problems with clinical judgment

When have opportunity to use stats to help you make decisions do it -least accurate of these 3 b/c not based on stats; can add rigid adherence to cut scores to help things-leads you to more accurate decisions -when you switch sides in decisions more likely to lead to inaccurate decisions -problem is that people can make different decisions based on the same scores-so apply the criterion cutoffs so it standardizes decisions people making

response bias

When individuals turn toward certain type of response-acquiescence-tend to rate self in middle rather than on extremes-tend to respond in certain way

situational bias

When testing conditions, based on cultural elements impact performance-if don't view importance of timeliness in same way that is in Western culture then that could impact things-if come from place where concept of time is more fluid-not that hurriedness-can influence test scores too

convergent validity evidence•

When you correlate your test with another test that measures similar constructs-should get robust correlations and if I do that is evidence of convergent validity-they are converging on similar construct • Don't have to necessarily be measuring same construct-.50 for convergent validity

test bias

Whenever properties of test cause individuals or groups of individuals to score lower or higher on test than the average scores for total population b/c of some kind of built in advantage/disadvantage built into test Ex: bobsled-in Jamacia never had snow-types of biases that make way into society and advantage or disadvantage others

assessing values and life role salience

Whereas interests refer to what a person likes to do, values define what an individual thinks is important. Values have been found to correlate more highly with work satisfaction than interest. Individuals define values differently, so it is essential for professional counselors to encourage clients to clarify the meanings they assign to particular values.

internal consistency

a measure of score reliability (see reliability). - The criterion is the total score of the test itself. Each item is correlated with the total score. Also a necessary, but not sufficient condition. • homogeneity of item content-necessary but not sufficient-how well do items measure each other

Thematic Apperception Test (TAT)

a projective test in which people express their inner feelings and interests through the stories they make up about ambiguous scenes · Children's version-using animals instead of people · Make up a story about these pictures: has beginning, details, lesson, ending · Code for: conflict · Whenever look at story after story and have themes that cut across-see problems everywhere and conflict-maybe describe drunken rage and mother and son comforting them-this was what was on their mind at the time

screening definition

a rapid rough selection process, often not followed by further procedures—the decision is typically worded pass/refer (fail is not in vogue anymore)-most just pass/fail if don't pass then get screened again and then see if need deeper level diagnostic testing-ex: hearing test at school where pass/fail

why is alternate-form reliability a useful form of reliabliity

a. Temporal stability (similar to rtt) b. Consistency of response to different item samples.-should get same score from one administration to next Easier to achieve when more objectively derived scores than if subjective - If the alternate forms are administered in immediate succession, the result shows reliability across forms only, not occasions.-taken temporal stability out of mix and looking at just content differences - If not on immediate succession the results indicate variance due to occasion (just like test-retest) plus content differences.

placement definition

all are accepted, but subsequently assigned to appropriate "treatments" to maximize effectiveness of outcomes; the decision is based on a single score-put into different categories based on results of test

information processing model

all have ways info gets to us and way able to give out responses: start with input then go to integrate info then output

speed tests and reliability

all items are easy. The difference in scoring is solely due to how quickly the individual works the items.-different people respond to different number of items-person that does well usually don't miss items on test and gets more items done than person who doesn't do so well-how many did I get done in 2 mins

reliability coefficients reported as

all reported as r- r is generally less than 1.00 the lower it goes the higher the amount of error in measurement of those scores

multiple regression

allows for several variables to be weighted in order to predict some criterion score.-multiple predictors of one criterion-each of those is more or less accurate and weighed more or less heavily

reliability of criterion-referenced tests involves

classification consistency (decision reliability)

2 most common measures of reliability

coefficient alpha, test-retest

integration involves

how we can solve problems and integrate info to solve problems Recall-lower level Translation Apprehension Extrapolation Application-higher level Analysis Synthesis Evaluation

output involves

how we express ourselves Motor-write something out or do something to solve problems Verbal

standard error of estimate is what

how well you can predict criterion for validity

how to use split-half method

i. Odd-even method or matched random subsets method ii. The Spearman-Brown prophecy formula must be applied

"Gold standard"

imparted on some tests to refer to like the "best" test but really not-in reality it would be that which overlaps 100% with construct itself-no test really get to there

how difference (D) is related to percentage passing

inverse relationship; when percentage passing goes down then D value goes up and as it goes up D goes down

definition of test-retest reliability

involves repeating the identical test on a second occasion. The coefficient is simply the correlation between scores on the 1st and 2nd administrations.

intake interview

is done to collect relevant information about a client's history and background in order to quickly ascertain the effects past events may have on the client's current situation:

percentage passing

item difficulty is determined by simply calculating the % of persons who responded correctly. What proportion of the people passed the item-% of people that got it right Want moderate spread but that depends on what the test is trying to measure If difficulty level is .50 then 50% got right and 50% got it wrong 80% is (p = .80) 80% answered correctly 10% is (p = .10) 10% answered correctly

criterion contamination

knowledge of the preliminary test result may influence measurement of the criterion-if person who is collecting criterion has knowledge of pre-test score that can contaminate collecting criterion-if you send them off with pre-test score in making that decision-that can help them to sway individual on their decision-want to hold them blind to information-getting evaluated-if person evaluating you has evaluating you again and again-they know what last years evaluation was and that can sway it-don't want people collecting criterion to have any knowledge of pre-test

when to use alternate-form reliability

like when have 2 versions of a test in academic achievement testing

when to use coefficient or Chrobach's Alpha

more multi response scales-Likert scales-many response choices

sample size/number of scores

more people in study will suppress correlation coefficient-if only have 20-30 will end up being higher correlation than if had more people-high n leads to lower correlation-want to run them with moderate amount of people in study-want about 30 people

rationale for family assessment

o Assessment can provide a rich source of information that can be used to develop initial hypotheses about the nature of the problem, the causes of the problem, family members' varying perceptions of the problem, and potential areas of strength. o Assessment also provides clinicians with baseline data by which progress in counseling can be measured. o The process can help families to view the presenting issue from a systems perspective rather than as an individual family member's problem. o Including formal and informal assessment methods in family counseling helps clinicians to avoid bias.

what is behavioral assessment

o Behavioral assessment is defined as the identification of meaningful response units and their controlling variables for the purposes of understanding and altering behavior

strengths of trait approaches

o Easy to administer, score, and interpret o Robust predictive validity o Most trait inventories are norm referenced, allowing comparisons o They focus on normal, healthy personality functioning, allowing us to understand strengths and protective factors in addition to weaknesses

what is personality

o Experts disagree on a definition of personality, what comprises it, or how best to measure it o Personality is an intrinsic, adaptive organizational structure that is consistent across situations and stable over time

family systems theory

o Families are interacting systems composed of interdependent members o Problems are viewed as relationship issues associated with the system itself rather than with specific individuals o The family, not the individual, is the unit of change o Family counselors focus on assessing the family system, with a focus on four primary areas: § Family relationships § Patterns § Structure § Level of functioning

what is assessed in family assessment

o In general, relationships, interactions, and family dynamics o The presenting issue o Family composition o Family process o Family affect o Family organization o Strengths and resources o Goals for change o May also assess family members' personality characteristics, coping and adaptation strategies, values, stressors, life cycle stages, and daily routines

informal assessment methods for families

o Interviews (the most common method) o Drawing and art assessments o Family Circle o Kinetic Family Drawing o Joint family drawings mapping activities sculpting activities

trait approaches to personality assessment

o It is helpful to consider traits and states as two ends of the same continuum. o Traits are enduring, statistically derived dimensions used to explain personality characteristics, while states are generally more transient or situation-dependent facets of personal adjustment. o Most structured personality assessment deals with the identification of the more enduring personality traits to understand and predict human behavior.

issues that make family assessment challenging

o It is likely that family members will view issues differently and will bring differing perspectives to the table. o Some family practitioners view empirically based, structured methods of assessment as static measures that do not capture the dynamic process of family interactions. Many of the variables assessed, such as communication styles, family roles, and levels of cohesion, are fluid and likely to fluctuate. o There is no unified theory of family functioning, no consensus about the definition of healthy or dysfunctional family relationships, and no agreement about the key processes that need to be assessed. o Many of the formal assessment measures were developed for research rather than for clinical practice and many of the measures are based on inadequate norming samples that may not be clinically relevant and that were based on samples of predominantly Caucasian participants.

weaknesses of trait approaches

o Little explanation has been offered as to why traits exist, how they develop and become differentiated over time, or the degree to which each is genetically determined or environmentally influenced. o Trait approaches are sometimes criticized for being redundant in nature. o Different models predict different numbers of primary traits.

characteristics of qualitative methods that make them useful

o More informal than standardized assessment, allowing for greater professional counselor and client flexibility o Actively involve the clients and lead readily into counseling interactions o Interpretations tend to be open-ended and divergent o May be modified, both in content and in interpretation, to meet the needs of the family being assessed o Emphasize the concepts of learning about oneself within a developmental framework o Can serve as interventions, thereby reducing the distinction between assessment and counseling

· Some Commonly Used Structured Personality Assessment Inventories

o NEO Personality Inventory—Third Edition o 16 Personality Factors Questionnaire o Myers-Briggs Type Indicator—Form M o Jackson Personality Inventory—Revised o Piers-Harris Children's Self-Concept Scale—Second Edition

qualitative family assessments

o Nonstandardized, nonquantitative approaches to evaluating families: examples include structured exercises, creative activities, genograms, timelines, card sorts, and a host of other open-ended activities o Not meant to diagnose or categorize families, but to help professional counselors learn from families and help families increase their understanding of themselves

methods of assessing families

o Observational methods (live or taped simulated situational examinations of family interactions) o Graphic representations of relationships (genograms, kinetic drawings) o Measures of temperament, character, or personality (personality inventories and checklists, such as the Myers-Briggs Type Indicator) o Techniques to assess levels of marital satisfaction, quality, or happiness o Family adaptation measures o Stress and coping appraisals (life events, expectations, disruptions) o Parenting and family skills o Other areas (e.g., sexual functioning)

commonly used couples assessments

o Prepare/enrich inventories o Marital Satisfaction Inventory—Revised o Dyadic Adjustment Scale o Myers-Briggs Type Indicator o Facilitating Open Couple Communication, Understanding and Study—Third Edition o Premarriage Awareness Inventory o RELATionship Evaluation o Taylor-Johnson Temperament Analysis

indirect behavioral assessment methods

o Self-report, Informant report, behavioral checklists and rating scales

main focus for behavioral assessments

o The counselor focuses on the function of particular behaviors that are within the client's voluntary control rather than a diagnosis o The professional counselor must consider environmental variables affecting the behavior o Antecedents and consequences and characteristics of behavior, such as function, magnitude, frequency, rate, duration, and latency are often measured in behavioral assessment

purpose of personality assessment

o To help the professional counselor and client understand the client's various attitudes, characteristics, interpersonal needs, and intrinsic motivations in order to gain insight into current events, activities, and conflicts and also to generalize this understanding to new situations clients will encounter on their own, both now and in the future. o Personality assessment has the same purposes as most other types of assessment, including screening, diagnosis, treatment planning, and outcomes evaluation.

How to do Kuder-Richardson Formula 20

only applicable to tests whose items are scored as right or wrong or some other all-or-none system.for dichotomously scored-T/F, Right/wrong are only choices-but for MC-actually dichotomously scored b/c either right or wrong

extreme groups

partition scores into highest/lowest/and middle Establish upper and lower performing groups, then contrast the performance of these groups on the individual items or against a criterion score. - Helpful in determining deficiencies in test items (particularly in teacher-made tests using multiple choice).

special cases of reliabliity

power vs. speed tests (both disallow perfect scores)

how do we know its in the range of scores with SEM

put normal curve around that score-how big or small curve is depends on reliability coefficient of it-probability that score could be a little bit lower than or higher than the score that you got

symbol for alternate-form reliabliity

r ab

coefficient of determination

r squared • The Pearson r is a measure of the linear relationship between two variables, but it is also used to determine the degree to which the individual differences in one variable can be associated with the individual differences in another variable. • The square of the correlation coefficient, referred to as coefficient of determination (r 2), indicates what proportion of the variance in one of the variables is associated with the variance in the other variable. • Graphical illustration of the coefficient of determination (r 2) is provided with Venn diagrams, where each circle represents the variance of a variable. • Let the focus in interpreting r 2 be on the degree to which the variance in Y is associated with the variance in X (i.e., Y is a dependent criterion variable and X an independent predictor variable). • The larger the overlap between two circles, the higher the proportion of the variance in Y associated with the variance in X.

symbol for test-retest reliability

r tt

formula for calculating reliability coefficient

r=TSV/(TSV+EV); ratio of true score (similarity) to total variance (similarity + error); TSV=true score variance; EV=error variance

Bell-Curve study

race and intelligence-Henstein & Murray, 1994-very political and controversial-state that there are differences btw group diffs based on racial groups within the U.S.-white students are above average on intellectual ability and Asian students a bit higher than that; Hispanic-standard deviation lower, Black students lower than Hispanic students-not b/c of racial differences but b/c of SES-Black students more overly represented in lower SES-when compare Black middle class to White middle class students and differences disappear-these are SES diffs not racial differences

standard error of measurement is for

reliability

Spearman-Brown Prophecy Formula

rxx=2r12/(1+r12)

reliability always refers to the

scores not the test itself; refers to the scores obtained with a test and not to the instrument itself

classification consistency (decision reliability)

shows the consistency with which classifications are made, either by the same test administered on two occasions or by alternate test forms;

broad visualization

spatial reasoning

inter-scorer and inter-rater reliability are influenced by

subjectivity of scoring. The higher the correlation, the lower the error variance due to scorer differences, and the higher the inter-rater agreement

content validity

systematic examination of test content to determine its representiveness of the behavior domain. -Not going to see this as frequently -Do the items measure what they say they measure -Call experts in that topic and make a determination to see if that item measures that area or not -Expert related validity-asking experts if it matches the concept that you designed the test to measure -Gets more cloudy when talking about personality or career related issues -Asking people what they think about these items

central tendency error

tendency to respond with moderate descriptions rather than toward the extremes of a rating scale

using 1 test to make decisions

use screening procedure to predict criterion score will get in future-maximize hits and minimize misses; linear regression; setting cutoff score

standard error of estimate is for

validity

scorer reliability (interscorer reliability)=

variance due to scoring errors.-do the scorers on the test agree-getting same score across multiple scorers-goes down with subjective essay questions-more objective test the higher interscorer reliability-about scoring test - These studies are conducted by simply having two or more independent scorers score a set of protocols. Results are correlated and reported as r. -Could have everyone fill out GAD-7 and then take sample of papers and have others score all papers to see if they got raw scores as you did-do 2 or 3 scorers get same raw score

crystallized intelligence

verbal comprehension skills

item discrimination

want the item to be able to discriminate between people that think it is easy/hard-can separate out people into groups for those high or low on the construct

popular approach

way it is usually done Use items common to many cultures and validate the test against criteria from each specific culture Develop a single test that is used across cultures so try to make it able to be used with people from different cultures

classical test theory of validity

what our examinations are going to be aligned with-types of validity-sources of validity

intelligence testing is synonymous with

with the terms cognitive ability testing and mental ability testing

short-term acquisition and retrieval (SAR)

working memory auditory and visual

o Four main variables that maintain or reinforce the performance of target behaviors:

§ 1. Attention § 2. Tangible § 3. Escape § 4. Sensory stimulation

cautions regarding direct behavioral assessment

§ An observer may be biased. § An observer may unintentionally change the operational definition or criterion of a behavior due to habituation. § Clients may change a behavior if they know they are being observed.

advantages of indirect behavioral assessments

§ Easy § Inexpensive § Not time consuming § Practical assessment tools can provide valuable and accurate insight into client behaviors from naturalistic settings

types of family mapping activities

§ Genogram § Family mapping § Ecomap

features of behavioral goals and objectives

§ Measurable § Observable § Positive § Doable

pseudoscience approaches to personality assessment

§ Physiognomy § Phrenology

projective approaches to personality assessment

§ Projective assessments present clients with unstructured, ambiguous stimuli and allow a virtually unlimited range of potential responses, based on the assumption that essential information about a client's personality characteristics, needs, conflicts, and motivations will be transferred onto ambiguous stimuli § Based on the psychoanalytic notion of the unconscious and many Freudian concepts-unconscious things coming out-it was just what was on their mind at that time § Projective techniques are better used as clinical tools rather than as tests, per se § If someone is being open and honest with you and straightforward then don't need projective tests § Need projectives when don't think someone is being totally honest with you-don't want them to know want them to know what the "right" answer is so come straight out of their experiences gives you more info about them

some commonly used projective techniques

§ Thematic Apperception Test § Children's Apperception Test § Roberts Apperception Test for Children—Second Edition § House-Tree-Person § Kinetic Drawing System for Family and School § Forer Structured Sentence Completion Test

weaknesses of projective techniques

§ They are expensive to administer, score, and interpret. § Subjective scoring and interpretive procedures make results difficult to replicate. § Scorer reliability, test-retest, and internal consistency coefficients tend to be unacceptably low, as is projective score validity. § Most projective tests have either absent or inadequate norms. § Projective techniques are susceptible to outside influences, such as examiner characteristics, examiner bias, and variations in administration directions. § It is impossible to study psychoanalytic theory, given its emphasis on unconscious psychological processes.

Strengths of projective techniques

§ They are great icebreakers and rapport builders because they are perceived as nonthreatening. § Clients are not limited in the number or type of responses they can make. § Responses are more difficult to fake than for structured tests. § They may have valuable cross-cultural applications. § Complex, multidimensional themes may emerge and provide valuable insights into the client's personality § Just something to use for hypothesis testing

5 factor model

§ Traits defined as dimensions of individual differences in tendencies to show consistent patterns of thoughts, feelings, and actions § Neuroticism, extraversion, openness, agreeableness, and conscientiousness

halo effect

§ the tendency to rate a high-performing student as well-behaved regardless of actual behaviors observed

partial interval recording

· An observer marks each interval whenever a behavior occurs at least once anytime in the interval.

self-monitoring

· Clients observe and record their own behavior. · This is an effective way to monitor infrequent behaviors and internalizing problems.

narrative recording

· The professional counselor records what is observed anecdotally: o Antecedents o Behavior o Consequences o Function

Rorschach Inkblot Test)-

· With Rorschach-often do by seat of the pants method-form an impression from responses-not very scientific way-bad reputation around reliability and validity-takes a long time to learn how to interpret and score it-comes from psychoanalytic paradigm-what is in your unconscious mind that we infer what you are thinking in unconscious self o What do you see and what parts of it makes it look like that to you-asking for more detail- o Pattern of responses helps you make diagnostic assessments

4 ways to reduce bias in assessment

• 1. Choose assessments that are appropriate to use with multicultural populations.-don't use assessments that have been shown to be biased for one cultural group or another • 2. Use instruments that provide norms for the specific client population that is being assessed.-if going to use instrument with inherent bias provide contextual information when doing interpretation-like Beck Depression Inventory-using same cut score even though men score usually lower than women so why use same cut score • 3. Provide assessment instruments that use the most clear and understandable language for the client population.-easy to read and understand, no confusion about what supposed to be doing on test • 4. Consider how age, color, culture, disability, ethnic group, gender, language, religion, sexual orientation, and socioeconomic status affect test administration and interpretation.

guideline for interpreting size of Pearson r

• A guideline for interpreting the size of the Pearson r is based on its absolute value (sign ignored) as follows: • .90 to 1.00 = very high correlation • .70 to .89 = high correlation • .50 to .69 = moderate correlation • .30 to .49 = low correlation • .00 to .29 = very low (if any) correlation

limitations of interviewing

• A limitation of interviewing is that it has lower levels of reliability and validity than more standardized inventories

classical model of reliability

• A person's true score is the score the individual would have received if the test and testing conditions were free from error.-error free score-but no test is perfect • Systematic error remains constant from one measurement to another and leads to consistency.-good variance-actual a positive thing b/c helps to maintain consistency random error exists and impacts reliability too

how does alternate forms reliability counteract practice effects

• Alternate forms reliability counteracts the practice effects that occur in test-retest reliability by measuring the consistency of scores on alternate test forms administered to the same group of individuals.-don't remember what response was to item on first time-gives you completely different set of items so helps with that • R subscript AB

nature and interpretation of Pearson r

• Although a scatterplot of the relationship between two variables, X and Y, is useful, more accuracy is needed to determine the presence (or absence) of a linear relationship between X and Y, its direction (positive, negative), and strength. • For variables that are interval or ratio in nature, such information is provided by the Pearson product-moment correlation coefficient (Pearson r). • Pearson r summarizes the relationship between two variables as a single number.

things described in MSE

• Appearance, attitude, and behavior • Cognitive capabilities • Speech and language • Thought content and process • Emotional status • Insight and judgment

confidence intervals interpretations

• At the 95% level of confidence we say, "We have confidence that if the student took 100 alternate forms of this test, his/her true score performance has a 95% probability of falling within the calculated confidence interval."-now providing a range-range of percentiles and interpretive ranges - At the 95% level of confidence we say, "We have confidence that if the subject took 100 alternate forms of this test, his/her performance would fall within the calculated confidence interval 95% of the time[NA1] ." [NA1]New thing to add to test interpretations for Exam 2

Considerations to Take Into Account When Using a Particular Test

• Become familiar with estimates of test score reliability. • Consider the size and makeup of the samples used in reliability and validity studies. • The norming samples should be representative of the clients with whom you plan to use the test. • Examine any test you use for biased items. • Use caution in applying scores from tests that place a client at a disadvantage due to linguistic differences. • Ethnicity can be a source of response variation in testing. • Do not assume that the name of the test or scale accurately reflects the actual meaning of the test score. • Use more than one test or scale to increase the accuracy of assessment.

what is clinical assessment

• Clinical assessment is defined as the measurement of clinical symptoms and pathology in the human condition. • Clinical assessment is probably not necessary when a client seeks counseling for self-growth or a personal or interpersonal problem not amenable to clinical diagnosis, but personality tests can be helpful in deepening understanding.

more description on factor analysis

• Coin sort machine-through lots of interitem correlations identifies items that are correlated to each other drops them on latent trait/variable/factor; does same thing with items dissimilar; items of anxiety, depression, schizophrenia then will separate into different factors-helps us to develop factors and rename subscales • Simplifies how we make sense out of individual score b/c grouped into subscales which are more internally consistent-more reliability and then more validity b/c all grouped together • These items across these 4 factors=model-model helps us interpret scores on into the future-add up these 7 scores and that becomes anxiety subscale, depression subscale and use that subscale score to predict criterion in future • Knowing the factors that go with the test then can help with treatment planning-if high on somatic but not cognitive side of anxiety-that can shape what direction go in for treatment

concurrent criterion-related validity

• Comparison of the test and criterion at the same point in time-would be higher correlation than predictive b/c no time in between so nothing can happen over time • Example: Is John learning disabled?-right now at this point in the time

predictive criterion-related validity

• Comparison of the test today with the criterion at some future point in time-b/c comparing test scores today with some criterion in the future • Example: Is John likely to become learning disabled?-at some point in the future

behavioral rating scales and inventories

• Conners 3 • Attention Deficit Disorders Evaluation Scale—Third Edition • Behavior Assessment System for Children—Third Edition ACTeRS

Clinical and Personality Test Content and Interpretation Is Developed Using

• Content validation • Theory • Empirical-criterion keying • Factor analysis

how do you get criterion-related validity evidence

• Criterion-related validity is derived from comparing scores on the test to scores on a selected criterion; it is the efficiency of a test in predicting behavior under a special circumstance • Generally provided as evidence based on criterion • Comparing scores on test developing and compare to scores on another test criterion 1. Can collect criterion at same time or at some point in future 2. Difference between 2 is time that elapse between test itself and when collecting criterion itself

Phase 2: Treatment Phase B

• Data relative to the intervention is gathered during this phase. • Professional counselors attempt to institute the least intrusive intervention that can be of reasonable benefit to the client. In some cases, it is important to obtain therapeutic gains as fast as possible and to be slightly more intrusive (i.e., client has self-injurious behavior). • More intrusive interventions may also be appropriate when it is difficult to correctly identify aspects of treatment that are critical to success. • Consider what would be in the best interest of the client. If no treatment effects are evident, the professional counselor should revise the protocol or try another approach. • In the case of ineffective treatments in a within-series design, the failed intervention becomes another baseline phase. The design is then changed to an A-B-C, where phase C is the additional treatment phase. • If symptoms deteriorate during a treatment phase, treatment should be stopped to determine if symptoms improve, stabilize, or continue to deteriorate. • If the client shows improvement during the treatment phase, the professional counselor should focus on evaluation of the treatment components and attempt to replicate the results with the identified client across time, across settings, and with other individuals who display similar behaviors.

making decisions using a single test

• Decision theory involves the collection of a screening test score and a criterion score, either at the same point in time or at some point in the future-use screening test to predict criterion score that you will get in future

info to gather in intake interview

• Demographic information • Referral reasons • Current situation • Previous assessments and counseling experiences • Birth and developmental history • Family history • Medical history • Educational and work background

Assessing Suicidal intent risk factors

• Depression and other mental disorders, or a substance-abuse disorder • Prior suicide attempt • Family history of mental disorder or substance abuse • Family history of suicide • Family violence, including physical or sexual abuse • Firearms in the home • Incarceration • Exposure to the suicidal behavior of others, such as family members, peers, or media figures • PIMP • SLAP • SAD PERSONS

Screening

• Determining whether there is a problem • Detection • Case identification • Response = yes or no • If yes, proceed to assessment • If no, stop! No need for intervention or further services at this time

how to establish evidence for construct validity

• Evidence for construct validity is established by defining the construct being measured and by gradually collecting information over time to demonstrate or confirm what the test measures

face validity

• Face validity is not a real type of validity, but what the test superficially appears to measure (i.e., "looks valid"). • Face validity is derived from the obvious appearance of the measure itself and its test items, but it is not an empirically demonstrated type of validity. • Almost like an opinion that you may have • Self-report tests with high face validity can face problems when the trait or behavior in question is one that many people will not want to reveal about themselves-if it looks like it measures depression if want to hide level of depression not be honest-some people will try to shade it-substance abuse items common with this-Ps can see through items and be less honest • Do social desirability scale when doing self-reports to see how making self look to get more info about them

random error affects reliability due to....

• Generally due to way measurement is occurring not content itself-more distractable when came first time or got less sleep when took it again-due to environmental issues • Fluctuations in the mood or alertness of persons taking the test due to fatigue, illness, or other recent experiences • Incidental variation in the measurement conditions due, for example, to outside noise or inconsistency in the administration of the instrument • Differences in scoring due to factors, such as scoring errors, subjectivity, or clerical errors • Random guessing on response alternatives in tests or questionnaire items

get reliability with confidence intervals from

• Get reliability from group testing and apply to individual results

effective interviewing with clinical interviews

• Identify client problems early • Obtain necessary information related to the problems (e.g., antecedent, consequence) • Assess client functioning, intellectual level, and psychosocial development • Examine the effects of an intervention during and after the intervention • Require counselors to establish rapport and have effective facilitative skills

Establishing the First Phase: Baseline A

• If a professional counselor is not objective in the problem identification phase, then evaluation of change can only be subjective. • Professional counselors should visually inspect the level and trend in their baseline data to ensure stability prior to implementing interventions. • Regardless of phase, trend should be level or opposite of treatment effect.

being purposeful in planning withdrawal

• If behaviors in the withdrawal phase resemble those of the baseline phase, treatment can be considered successful. Through a return to baseline in the second A phase, professional counselors can infer that the changes are due to the removal of treatment. • The treatment phase is reintroduced as a second phase B. • Counselors using an A-B-A-B design can increase internal validity by demonstrating an opposite pattern in trend and variability between the A and B phases.

strengths of interviewing

• In-depth analysis of issues • Flexibility in how the information is garnered • Instantaneous clarification of ambiguous information

elements within discrimination

• Indices of item discrimination • Extreme groups

factors impacting Pearson r

• It is important to keep in mind that the Pearson r is an index of the linear relationship between two variables. • Therefore, r = 0.00 indicates that there is no linear relationship between the two variables, but there still might be some kind of (nonlinear) relationship between them. • In some cases, r = 0.00 because there is a curvilinear relationship between two variables (e.g., age and physical strength). • Other times, r = 0.00 when calculated over a restricted range of variable values, although there is a linear relationship between the two variables over a larger range of their measures. • There may not be a linear relationship between two variables for a sample of persons (r = 0.00), but there might be a linear relationship between the variables for some subgroups of persons from the total sample. • For example, there may be a positive linear relationship between X and Y for one subgroup (e.g., females) and, conversely, a negative linear relationship between X and Y for another subgroup (e.g., males), even though there is no linear relationship (r = 0.00) for the total sample. • In this case, correlation coefficients by separate subgroups are more useful than r = 0.00 for the entire sample.

Testing Pearson r for statistical significance

• It is important to use the Pearson r as an inferential statistic to determine whether a linear relationship between X and Y exists in the entire population to which the sample belongs.

Linear transformation and Pearson r

• Linear transformations on variables X and/or Y do not affect the size of the Pearson r. • Linear transformations commonly occur when a client's raw score on a norm-referenced psychological or educational test is transformed into a standard score, such as a deviation IQ score (M = 100; SD = 15), T score (M = 50; SD = 10), or z-score (M = 0; SD = 1). • The correlation coefficient does not change when the values of X and Y are transformed into standard (z-) scores ( μ = 0; σ = 1) or other scales such as the T score ( μ = 50; σ = 10) and norm curve equivalent (NCE) scale ( μ = 50; σ = 21).

discriminant validity evidence

• Measure a depression test with a theoretically unrelated construct should have low correlation-close to 0 to show that they are not related and that is evidence that it measures that supposed to measure-common methods variance is why it is not exactly 0-if use same method to collect evidence on this depression scale and some other scale b/c using self-report methodology going to be a small correlation anyway-so should measure with various methods to help with this to show that really not related-2 different methods of data collection • Want correlation of .20 for this type of validity

Some Commonly Used Clinical Assessment Inventories

• Minnesota Multiphasic Personality Inventory—Second Edition • Millon Clinical Multiaxial Inventory—III • Achenbach System of Empirically Based Assessment • Beck Depression Inventory—Second Edition • Beck Anxiety Inventory • Substance Abuse Subtle Screening Inventory—4 • Eating Disorder Inventory—3

correlation between 2 variables

• Most people have an intuitive understanding of a correlation between two variables. • For example, when asked about the correlation between IQ and GPA, people usually say something like, "The higher the IQ, the higher the GPA of the students."

Factors the Influence Student Test Performance and Item Responses

• Motivation • Anxiety • Coaching • Test sophistication • Acquiescence • Response format • Reactive effects • Response bias • Physical or psychological • Social desirability • Environmental variables • Cultural bias • Examiner-examinee variables • Previous testing experiences

Diagnostic/Clinical Assessment

• Next step after "positive screen" • Intended to clarify nature of problem and severity of problem: • What is the problem? • To what extent is it a problem? • Helps counselor understand client's unique situation • Individualizing client case

experimental interventions (distinct groups)

• Not used nearly enough-should be able to use depression inventory on study with depression as DV as pre-test and post-test-should see the scores go lower with clinical trial to decrease depression with therapy-should have control group with it too • If it gets lower over time that is evidence that scale measures depression Pretest and post-test scores while using an intervention program Ex: If the intervention is effective and the test measures the construct, scores on the posttest should be significantly better than scores on the pretest (e.g., depression, anxiety).

SSRD Simple Phase Change A-B Designs

• Often denoted as an A-B design, a simple phase change (SPC) is used to answer two simple questions: "Is treatment effective?" and "Does one treatment work more convincingly than another?" • A denotes the baseline phase. • B denotes the treatment or intervention phase. • Professional counselors will hope to see a change in level, trend, or variability due to the introduction of the treatment. • A professional counselor can determine the significance of data changes in SPC by determining which changes (a) have the greatest magnitude, (b) are closely associated in time with that of the phase change, and (c) show consistency throughout the phase. • Professional counselors can use the A-B-A-B design to generate confidence in their assessment procedures, treatment procedures, and subsequent outcomes.

elements within difficulty

• Percentage passing • Distribution of test scores

clinical vs. personality assessment

• Personality assessment is the measurement of client traits, needs, motivations, attitudes, or other facets that describe how the client interacts with the external environment, others within that environment, and within the client's internal world. • Understanding the personality characteristics of a client can be helpful, but is not always essential for effective treatment. • Clinical and personality assessment are not mutually exclusive.

ethical ramifications of removing treatment

• Professional counselors are often concerned about the ethical ramifications of removing a treatment that is showing positive effects in the first B phase, but it must be remembered that counseling is time-limited and treatment must be concluded at some point; removal of treatment in the second B phase can be a short experiment and serves to promote the independence of the client

reversal design

• Professional counselors can be purposeful in planning a withdrawal phase (e.g., a reversal design, or A-B-A-B) • In a reversal design, baseline data is collected in the initial phase A, followed by the treatment phase B; in the second A phase, instead of collecting baseline data, the treatment is removed and changes relative to the removal of the treatment are recorded

Benefits of Within Series Designs

• Professional counselors can draw causal inferences between an intervention and changes in behavioral data. • Professional counselors can provide evidence of the effectiveness of an intervention with minimal changes in a treatment phase. • In cases where behaviors are known to be rather resistant to change without treatment, professional counselors can draw causal inferences by collecting lengthy and consistent baseline data.

Combined Elements

• Professional counselors often need to evaluate the effectiveness of an intervention on multiple targets (i.e., dependent variables) at different points in time. • Multiple Baseline (MBL) • A multiple baseline design (MBL) is simply two or more replicated simple phase changes, duplicated across two or more series categorized by time, setting, subject, or any combination thereof. Phase changes occur at different points in real time and follow first phases of different lengths. Behavior change is seen in the interrupted series before phase changes occur in the noninterrupted series. • Beneficial when a researcher wants to provide for an internal validity check on a simple phase change. • Counselors can answer the following questions: • Does an intervention work? • Does one intervention work better than another given that both are effective to some degree? • Considering two interventions, does either work and does one work better than the other? • Which components of an intervention make it effective? • What is the optimal level of intervention?

• Professional counselors must assess for risk factors and lethality

• Professional counselors should also conduct a comprehensive interview to understand the client demographics, psychosocial development, psychiatric disorders, history of suicide attempts and symptoms, and resiliency and support factors • Suicidal ideation screening instruments: • Columbia Suicidality Severity Rating Scale • Suicide Probability Scale • Beck Scale for Suicidal Ideation

reliability=proportion of what

• Reliability indicates what proportion of the observed score variance is true score variance.

Purposes of Evaluation

• Screening: "Is there a reason for further evaluation?" • Diagnosis: "Are diagnostic criteria met?" • Assessment: "To what extent and in what ways are there problems?" • Motivation: "How ready is the person for change?" • Treatment planning: "What services are needed?" • *Follow-up: "What has changed? What is still needed?"

clinical judgment vs. statistical models

• Statistical models are at least as accurate as, and usually superior to, clinical judgment. When clinical judgment disagrees with the statistical model, the experienced clinician usually realizes that it is best to collect more information to arrive at a more reasoned decision that one can endorse with greater confidence

caution with statistical regression with confidence intervals and SEM

• Statistical regression-pull scores to mean-higher the score and more extreme the more it pulls toward to mean-forget about this right now but really more extreme score is more going to influence this range

3 types of clinical interviews

• Structured • Semi-structured • Unstructured

test-retest reliability AKA

• Test-retest reliability, also known as temporal stability, is the extent to which the same persons consistently respond to the same test administered on different occasions.

other types of correlation coefficients

• The Pearson r is used to determine the relationship between two variables derived from interval or ratio scales. • Many other coefficients have been developed to analyze the relationships between variables from various combinations of scaling methods (e.g., nominal, ordinal, interval, ratio).

Spearman Rho

• The Spearman rho (rank correlation coefficient; rho)—only takes the subject's position into account (i.e., ordinal scales)

classical assumption of reliability=

• The classical assumption is that any observed score consists of both the true score and error of measurement.-depends on error-usually think I got a lower score than I deserved but goes in both directions

correlation and causation

• The coefficient of determination (r2) indicates what proportion of the variance in Y is associated with the variance in X, but this does not necessarily mean that individual differences in Y are caused by individual differences in X. • An easier way to say this is "correlation does not necessarily mean causation." • High (positive or negative) correlation between X and Y indicates that scores on Y can be accurately predicted from scores on X, but this does not imply that changes in X cause changes in Y.

Using SPSS to compute Pearson r

• The correlation coefficients (for each pair of variables) are summarized in a correlation matrix obtained through the use of SPSS. • The correlation matrix with the SPSS printout provides the correlation coefficients, their p values, and the sample size. • The p value is the correlation.

2 additional assumptions of classical test theory

• The distribution of observed scores that a person may obtain under repeated independent testing with the same test is normal-should be getting same score when give test again • The standard deviation of this normal distribution, referred to as the standard error of measurement (SEM), is the same for all persons of a given group taking the test-will talk about how to apply this to individuals score just got on test-how do we decide if got true score or not and what is error band • Only know true score if there is error free measurement-always know observed score-so if observed score is true score depends on how much consistency in measurement

main focus and overview of content validity

• The main focus is on how the instrument was constructed and how well the test items reflect the domain of the material being tested. • This type of validity is widely used in educational testing and in tests of aptitude or achievement. • Determining the content validity of a test requires a systematic evaluation of the test items to determine whether adequate coverage of a representative sample of the content domain was measured.-done in subjective not really objective way

major problem for test-retest reliability and when to not use it

• The major problem is the potential for carryover effects between the two administrations.-often not an essential type of reliability to report but want to report to know what can happen • Thus, it is most appropriate for measurements of traits that are stable across time.-medication monitoring

Counseling, Diagnosis, and the DSM-5

• The more one consciously integrates assessment procedures and outcomes research into one's practice, the more objective and informed one's practice becomes. • The DSM-5 provides specific criteria through which reliable diagnoses can be made, as well as a common language for mental health professionals to use when communicating. • The specificity of the criteria allows clinicians to reliably determine whether the disorder applies to a given client. • It is common for a client to obtain multiple diagnoses, which is referred to as comorbidity.

Interpreting Pearson r

• The reliability of the observed measures for X and Y influences the size of their correlation coefficient, r. • The reliability of measures indicates the degree to which these measures are free of error. • For example, a reliability coefficient of .85 indicates that 15 percent of the variance in the observed scores is due to measurement error or, equivalently, 85 percent of the observed score variance is true score variance. • The lower the reliability for X and/or Y, the lower the Pearson r gets compared to its "true" size.

calculating variance accounted for by reliability coefficient

• To obtain the variance accounted for by a reliability coefficient, you do not square the coefficient. This is the only exception to that rule. • Reliability is the similarity in scores. Therefore, it helps determine the error variance (any condition irrelevant to the test).

other ways to reduce test bias in assessment

• Understand the client's worldview to determine appropriate assessment methods. • Consider a client's level of acculturation (i.e., number of generations in a new culture, language preferred, extent to which client socializes or comes in contact with those outside his or her own cultural group). Be knowledgeable about the culture of the client being assessed. • Avoid relying solely on cultural stereotypes.-there are greater differences among cultures than between individuals of different cultures

times that reversal design may not be appropriate

• When an intervention is eliciting positive results and reducing harm to an individual (e.g., when working successfully with people who are demonstrating self-injurious behaviors) • When addressing academic issues because the information cannot be unlearned: it would be difficult to demonstrate a return to baseline once an intervention is removed

whole purpose of decision theory using single test

• Whole purpose is to set the cut score b/c leads to greatest proportion of accurate decisions-terms like sensitivity and specificity-talking about decision theory and type of validity-diagnostic validity • You want to maximize hits (valid acceptances and valid rejections) and minimize misses (false rejections and false positives) • Generally talking about one test score but when making diagnostic decisions need many more test scores to arrive at diagnosis • Administer screening test, collect criterion then can infer decisions based on that

Withdrawal

• Withdrawal: when a client drops out of a treatment; can occur naturally (e.g., the client may move away, decide not to participate in treatment any longer due to deleterious effects, or may discontinue due to positive effects of treatment)

self-fulfilling prophecy

• clients may actually change thoughts, feelings, or actions to align with the perceived expectations of the interviewer. • Both clinician beliefs and test results can be biased.

validity generalization

• criterion-related validity is a good estimate of usefulness of test scores only under the conditions and upon the samples the scores were collected

dots

• each dot on the scatterplot represents one person's score on the X and Y variables

problems with ACTeRS

• good for using in schools ADD comprehensive teacher rating scale • Problems with it: anchors 2,3,4 not labeled • Not good that grouped together and labeled-know what focused on-lets them know what want kid to get medicated for since labeled for you • Very subjective-not operationalized • Face validity: seem to be misaligned maybe go a bit overboard • Doesn't say things in () are an example could take that as only example of it • All worded directionally-categories either all positive or either all negative: put all items in same direction and then next 5 all in opposite direction; in bottom box high scores bad and low scores good • Should all be interspersed so not forecasting to the test takers what want to put if want to get specific result • Popular but not necessarily psychometrically sound

correlations with other tests measuring the exact same construct

• have to be measuring same construct exactly • Necessary and sufficient-administer scale with other currently available scales that measure same thing-should have mod. To large corr. With that test-correlation is not causation though not 100% overlap-evidence but not definitive-necessary and sufficient Evidence of measuring the same (or different) concept. - Moderate, but not high, correlations are desirable. - Tests should not depend on reading comprehension.

developmental changes

• if the construct proports to change over time as someone gets older-as go from 1st grade to 12th grade would expect math scores to get higher-every year should be average score increase as get older-necessary but not sufficient condition for evidence-if the theory proposes that scores should get higher over time-if that is what happens then great-if they don't get higher over time then have to explain what happened that causes problems-has to be there-but just b/c there is not direct evidence of construct itself Age (Grade) Differentiations - an increase or decrease with increasing age. - Necessary, but not sufficient condition for validation of IQ and achievement tests.

mental status exam (MSE)

• is a brief, organized interview that screens or assesses a client's emotional, intellectual, and neurological functioning; using any one of the three types of clinical interviews, the clinician assesses and describes the client's:

SEM formula application

• means that: • 1 SEM is 4.74 standard score points at the 68% LOC or 110 ±4.74 or 105.26−114.74 (i.e., the individual's true score probably lies within the IQ range of 105.26−114.74 on 68 out of 100 alternate form administrations of the test) • 2 SEM is 9.48 standard score points at the 95% LOC or 110 ±9.48 or 100.52−119.48 (i.e., the individual's true score probably lies within the IQ range of 100.52−119.48 on 95 out of 100 alternate form administrations of the test) • In terms of the individual's score of 110, if we assume a score reliability of .90, this means that: • 2.58 SEM is 12.23 standard score points at the 99% LOC or 110+12.23 or 97.77-122.23 (i.e., the individual's true score probably lies within this range of scores on 99 out of 100 alternate form administrations of the test) • 3 SEM is 14.22 standard score points at the 99.7% LOC or 110+14.22 or 95.78-124.22 (i.e., the individual's true score probably lies within this range of scores on 99.7 out of 100 alternate form administrations of the test)

hypothesis confirmation bias

• occurs when interviewers develop hypotheses to explain the concerns being presented by a client and then proceed to ask questions and elicit responses that confirm those hypotheses.

no correlation

• there is no correlation between two variables when neither a positive nor negative linear relationship exists (circular distribution of scores)

SEM as related to confidence intervals %s

• ±1 SEM forms a 68% level of confidence (LOC) • ± 2 SEM forms a 95% level of confidence (LOC)-this is our default • ± 2.58 SEM forms a 99% level of confidence (LOC) • ± 3 SEM forms a 99.7% level of confidence (LOC)


Related study sets

Alapfogalmak 1. tétel... vagy mi a fasz

View Set

Chapter 12: Estimating Cash Flows on Capital Budgeting Projects

View Set

BUS 444 (FINAL, CSUSM, Antoniou)

View Set

Florida Health, Life & Annuity End of Course Exam 3

View Set

Women's Health: Labor and Birth Process Practice Questions

View Set

Principles in Community and Public Health Nursing

View Set

PSY 410: Chapter 1 Psychological testing and Assessment

View Set