NCE: Assessment

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Army Alpha & Army Beta

- Army Alpha: - Army Beta: is a language-free test that was designed for individuals who could not read or were foreign born.

Evaluation Procedures Used in Counseling

- Clinical interviewing: Structured, semi-structured, unstructured - Informal assessment: Observation of behavior, rating scales, classification techniques, records, personal documents - Personality assessment: Standardized tests (e.g., MMPI), projective tests (e.g., TAT), interest inventories (e.g., Strong Interest Inventory) - Ability assessment: Achievement tests (e.g., WRAT), aptitude tests (e.g., SAT), intelligence tests (e.g., WISC).

Whiston ( 2013 ) provides professional counselors with a five-step process for evaluating counseling outcomes:

1. Defining the evaluation study focus. Professional counselors must first determine exactly what they want to evaluate. The focus of the study evaluation may be a specific service(s), a particular treatment or counseling intervention, or a program. 2. Determination of the evaluation design. Once the focus of the evaluation study is solidified, professional counselors must decide how they are going to evaluate outcomes by selecting an evaluation design. One of the most common evaluation designs is to administer a pretest, provide a counseling intervention (e.g., program, specific treatment), and then administer a posttest. The professional counselor can then compare the results of the pretest to the results of the posttest to determine whether the counseling intervention was effective. Counselors may also employ a qualitative evaluation design that involves interviewing participants about their experiences regarding a particular phenomenon. 3. Selection of participants. Next, professional counselors need to select which clients will participate in their evaluation study. Professional counselors may invite all clients to participate, use a random sample of clients, or use a subsection of the client population (e.g., women, adolescents, Latinos). Regardless of the type of participants selected, Whiston ( 2013 ) recommends that professional counselors involve as many participants as feasible to obtain a wide variation of viewpoints and increase the study's experimental validity. 4. Selection of assessments. In addition to selecting participants, professional counselors should decide which assessments they will use to measure a particular counseling outcome. Often existing assessments that have strong validity and reliability, are symptom based, and examine patterns in client change over time are used in evaluation studies. In some cases, professional counselors do create a study-specific survey to assess outcomes. 5. Analysis of data. Once information regarding counseling outcomes has been gathered from participants, professional counselors must analyze the data to determine the effectiveness of counseling. For quantitative data, where the results involve numbers, counselors can analyze whether a particular service or intervention was statistically significant, and therefore effective. Qualitative data, on the other hand, involves participant narratives and transcripts that must be coded and analyzed for themes in order to evaluate the effectiveness of a particular service or intervention. As a professional counselor, it is important to remember that assessment and evaluation procedures focus on client strengths, wellness characteristics, andareas of growth.

Scale

A scale refers to a collection of items or questions that combine to form a composite score on a single variable. Scales can measure discrete or continuous variables and can describe the data quantitatively or qualitatively. Quantitative data are numerical, whereas data presented qualitatively use forms other than numbers (e.g., "Check Yes or No").

**THE RELATIONSHIP BETWEEN VALIDITY AND RELIABILITY

Although scores produced by instruments must have both reliability and validity to be considered credible, test scores can be reliable but not valid. However, valid test scores are always reliable.

Bias in Assessment

Bias in assessment is a broad term that refers to an individual or group being deprived of the opportunity to demonstrate their true skills, knowledge, abilities, and personalities on a given assessment. Bias can result from the test itself, the examinee, the examiner, the testing context, or global systems that affect the examinee. Test bias occurs when the properties of a test cause an individual or particular group of individuals to score lower (negative bias) or higher (positive bias) on the test than the average score for the total population. • Examiner bias occurs when the examiner's beliefs or behavior influence test administration. For example, a professional counselor may believe that the international student does not understand English very well. • Interpretive bias occurs when the examiner's interpretation of the test results provides unfair advantage or disadvantage to the client. For example, the professional counselor is aware that international students often experience distress while adjusting to American culture. As a result, the counselor immediately looks for signs of emotional distress when interpreting. • Response bias occurs when clients use a response set (e.g., all yes or no) to answer test questions. For example, the international student may be embarrassed that he does not understand certain items on an assessment due to differences in culture. Therefore, she decides to answer yes to all of the questions she does not know. • Situational bias occurs when testing conditions or situations differentially affect the performance of individuals from a particular group. For example, persons who are not from Western culture may not rely on a numerical concept of time. Therefore, the international student from West Africa may perform differently on a timed assessment than an American student would. • Ecological bias occurs when global systems prevent members of a particular group of individuals from demonstrating their true skills, knowledge, abilities, and personalities on a given assessment. For example, ecological bias may occur if the college counseling center mandates that all students take the same Western-based career assessments. Because these assessments are written in English and endorse Western career theories and notions, the scores of the international student from West Africa may not be a true reflection of her abilities and personality.

Computer-adaptive Testing

COMPUTER-ADAPTIVE TESTING: Some computerbased tests have the ability to adapt the test structure and items to the examinee's ability level. This is known as computeradaptive testing . The GRE is an example of a commonly used computer-adaptive test. Computer-adaptive tests provide precise scores and quickly assess the examinee's ability level. As a result, test administration is reduced without sacrificing score accuracy.

Computer-Based Testing

Computer-based testing (CBT) , also known as computerbased assessment (CBA), refers to a method for administering, analyzing, and interpreting tests through the use of computer technology, software programs, or Internet sites. Computerbased testing is available for a variety of personality, intelligence, ability, and career development assessments. As with pencil-and-paper tests, computer-based testing has many benefits and some disadvantages. Benefits Associated with Computer-Based Testing • Administration time and cost are reduced. • Scoring accuracy is greater. • Feedback concerning client performance is quick, sometimes immediate. • Standardization of test administration procedures is enhanced. • Clients prefer tests administered via the computer when responding to sensitive topics. • Reports are computer-generated. Disadvantages Associated with Computer-Based Testing • Electronic equipment needed to administer the test can be expensive. • Widely used assessments may not by compatible with computer-based testing. • A lack of standards exists for obtaining and administering computer-based tests. • It minimizes human contact and involvement. • Computer-based tests may not have appropriate normative data.

Criterion Validity

Criterion validity indicates the effectiveness of an instrument in predicting an individual's performance on a specific criterion. Criterion validity is empirically established by examining the relationship between data collected from the instrument and the criterion. The two types of criterion validity are concurrent and predictive validity. • Concurrent validity is concerned with the relationship between an instrument's results and another currently obtainable criterion. To determine concurrent validity, instrument results and criterion performance scores must be collected at the same time. • Predictive validity examines the relationship between an instrument's results collected now and a criterion collected in the future. By establishing predictive validity, a test developer uses an instrument to try to predict performance on a future criterion measure. Therefore, a client's criterion performance scores are collected some time after the instrument results. To establish predictive validity for the depression instrument, client scores on the depression instrument would be compared to the number of times the client was hospitalized for suicidal ideation in a 6-month period occurring 2 years after taking the assessment. If the relationship is positive between client scores and the number of times hospitalized in the future, one could say that the instrument predicts the future occurrence of hospitalization.

Norm-Referenced Assessment

Derived scores are frequently used with norm-referenced assessments. Norms refer to the typical score/performance against which all other test scores are evaluated. In a norm-referenced assessment , an individual's score is compared to the average score (i.e., the mean) of the test-taking group. Knowing the relative position of a person's score in comparison to his or her norm group provides us with information regarding how that individual has performed. - College admissions exams GRE, SAT, ACT, MCAT, - GMAT Intelligence testing Stanford-Binet, Wechsler - Personality inventories MBTI, CPI • Although the rest of this subsection focuses on norm referenced assessments, there are other important ways to give raw scores meaning. One is to use a criterion referenced assessment . Criterion-referenced assessments provide information about an individual's score by comparing it to a predetermined standard or set criterion. For example, if Ivan's instructor decided that 90 to 100 was an A, 80 to 89 was a B, 70 to 79 was a C, and 60 to 69 was a D, then Ivan would receive a D on the math exam. Examples of criterion-referenced assessments include driver's licensing exams, professional licensure testing (such as the NCE), high school graduation examinations, and exit exams (such as the CPCE). • An individual's test score also can be compared against a previous test score. This type of comparison is referred to as an ipsative assessment . Whereas norm and criterion referenced assessments use an external frame of reference, ipsative assessments are self-referenced and use an internal frame of reference. Ipsative assessments are commonly used in physical education classes or in computer games.

Developmental Scores

Developmental scores place an individual's raw score along a developmental continuum to derive meaning from the score. Unlike standard scores, which transform raw scores into scores with a new mean and standard deviation, developmental scores describe an individual's location on a developmental continuum. In doing so, developmental scores can directly evaluate an individual's score against the scores of those of the same age or grade level. Developmental scores are typically used when assessing children and young adolescents. • Age-equivalent scores are a type of developmental score that compares an individual's score with the average score of those of the same age. Age equivalents are reported in chronological years and months. Thus, we could say that a 7-years-5-months-old child with an ageequivalent score of 8.2 in height is the average height of a child age 8 years 2 months. • Grade-equivalent scores are a type of developmental score that compares an individual's score with the average score of those at the same grade level. Grade equivalents are reported as a decimal number that describes performance in terms of grade level and months in grade. Thus, a grade equivalent score of 5.6 means the individual scored the average score of a student who has completed 6 months of the fifth-grade year. • If a first-grader, who has completed 2 months of first grade, scored a grade equivalent score of 1.2 in reading, what can we say about her performance? If you said she was performing at the mean for her grade-mates, you have an understanding of how to interpret grade equivalent scores! • Although grade equivalents are somewhat useful in measuring individual growth from year to year, they do not indicate that an individual is ready for a higher grade or should be moved back to a lower grade. A seventh grader who obtains a grade equivalent of 10.2 on a math test should not be moved into 10th-grade math. Grade equivalent scores simply identify where an individual's score falls on the distribution of scores for individuals at the same grade level; they are not an analysis of skills.

The Normal Distribution

In a normal distribution , nearly all scores fall close to average and very few scores fall toward either extreme of the distribution. Normal distributions are a product of the laws of nature and probability. As a result, most psychological and physical measurements (e.g., height and intelligence) are approximately normally distributed. Normal distributions have important characteristics that are useful to the field of assessment. - A normal distribution forms a bell-shaped curve when graphed. This graph is commonly referred to as a normal curve (bell curve) . (See Figure 7.1 .) The normal curve is symmetrical, with the highest point occurring at the graph's center. The lowest points lie on either side of the graph. The curve is also asymptotic , meaning that the tail approaches the horizontal axis without ever touching it. Normal distributions are also characterized by their measures of central tendency and variability. - The relationship between normal distribution and test scores: Normal distributions are the foundation on which derived scores are built. The mathematical relationships found in a normal distribution permit comparisons to be made between clients' scores on the same test or between the same client's scores on multiple tests. As a result, many derived scores (e.g., percentiles, normal curve equivalents, stanines, z-scores, and T scores) originate from the unique characteristics of a normal distribution.

Standardized Scores

In assessment, standardization relates to the conversion of raw scores to standard scores. Specifically, standardization refers to the process of finding the typical score attained by a group of test-takers. The typical score then acts as a standard reference point for future test results. Therefore, once a test is standardized, a score can be compared to the scores of the standard group. When comparing scores, it is important that the standard group reflects the test-takers who will be taking the test in the future. For example, you would want to compare a third-grader to a standard group of third-graders rather than a standard group of fifth-graders. - Z-SCORES: The z-score is the most basic type of standard score. A z-score distribution has a mean of 0 and a standard deviation of 1. It simply represents the number of standard deviation units above or below the mean at which a given score falls. Z-scores are derived by subtracting the sample mean from the individual's raw score and then dividing the difference by the sample standard deviation. - T SCORES: A T score is a type of standard score that has an adjusted mean of 50 and a standard deviation of 10. These scores are commonly used when reporting the results of personality, interest, and aptitude measures. T scores are easily derived from z-scores by multiplying an individual's z-score by the T score standard deviation (i.e., 10) and adding it to the T score mean (i.e., 50) - DEVIATION IQ OR STANDARD SCORE Deviation IQ scores are used in intelligence testing. Although deviation IQs are a type of standardized score, they are often referred to simply as standard scores (SS), because they are commonly used to interpret scores from achievement and aptitude tests, and it makes little sense to say "Johnny's reading score was a deviation IQ of. . . ." Deviation IQs have a mean of 100 and standard deviation of 15 and are derived by multiplying an individual's z-score by the deviation IQ standard deviation (15) and adding it to the deviation IQ mean (100). - NORMAL CURVE EQUIVALENTS The normal curve equivalent (NCE) was developed for the U.S. Department of Education and is used by the educational community to measure student achievement. NCEs are similar to percentile ranks in that the range is from 1 to 99, and they indicate how an individual ranked in relationship to peers. Unlike percentile ranks, NCEs divide the normal curve into 100 equal parts (see Figure 7.1 ). NCEs have a mean of 50 and a standard deviation of 21.06. They can be converted from a z-score by multiplying the NCE standard deviation (SD 21.06) by an individual's z-score and adding the NCE mean ( M 50).

Reliability Coefficient

In test reports and manuals, reliability is expressed as a correlation, which is referred to as a reliability coefficient . The closer a reliability coefficient is to 1.00, the more reliable the scores generated by the instrument. Any reliability coefficient less than 1.00 indicates the presence of error in test scores. Reliability coefficients typically range from .80 to .95; however, an acceptable reliability coefficient depends on the purpose of the test. For example, nationally normed achievement and aptitude tests, such as the GRE, are expected to have reliability coefficients above .90, whereas reliability coefficients for personality inventories can be lower than .90, and the instrument is still considered to yield reliable scores.

Informal Assessments

Informal assessments refer to subjective assessment techniques that are developed for specific needs. Unlike formal, standardized tests, the intention of an informal assessment is not to provide a comparison to a broader group; rather, informal assessments seek to identify the strengths and needs of clients. Types of informal assessment include observation, clinical interviewing, rating scales, and classification systems. OBSERVATION: is a broad term that refers to the systematic observation and recording of an individual's overt behaviors. Because the behavior is believed to serve a function in a particular environment, the antecedents and consequences are also recorded to ascertain the "why" behind the behavior. Behavioral assessments can be conducted by gathering data from direct or indirect observation. • Direct observation assesses an individual's behavior in real time and usually occurs in a naturalistic setting. • Indirect observation assesses an individual's behavior through self-report or the use of informants such as family, friends, or teachers. Indirect observation methods include the use of behavioral interviewing, checklists, and rating scales. CLINICAL INTERVIEWING: Clinical interviewing is the most commonly used assessment technique in counseling. Interviewing refers to the process by which a professional counselor uses clinical skills to obtain client information that will facilitate the course of counseling. Typically, clinical interviews are used to gather information concerning a client's demographic characteristics, presenting problems, current life situation, family, educational status, occupational background, physical health, and mental health history ( Erford, 2013 ). Many different types of interviews exist, and they can all be classified as structured, semi-structured, or unstructured. • Structured interviews use a series of pre-established questions that the professional counselor presents in the same order during each interview. The structured interview tends to be detailed and exhaustive as it covers a broad area of topics. Because the questions are predetermined and asked in a sequential manner, structured interviews provide consistency across different clients, counselors, and time periods. However, they do not provide the flexibility to ask follow-up questions or explore client issues more in depth. • Semi-structured interviews use pre-established questions and topic areas to be addressed; however, the professional counselor can customize the interview by modifying questions, altering the interview sequence, or adding follow-up questions. Although semi-structured interviews allow for more flexibility, they are more prone to interviewer error and bias. Therefore, they are considered less reliable than structured interviews. • Unstructured interviews do not use pre-established questions and tend to rely on the client's lead to determine a focus for the interview. Typically, professional counselors rely on open-ended questioning and reflective skills when conducting an unstructured interview. This type of interview provides the most flexibility and adaptability but is the least reliable and most subject to interviewer error. RATING SCALES: Rating scales typically evaluate the quantity of an attribute. • Rating scales are somewhat subjective, as they often rely on the rater's perception of the behavior. Nonetheless, rating scales provide an efficient way to measure a client's functioning in terms of behavior, personality, social skills, and emotions. • Rating scales vary in type and scope of information assessed. Some assess a broad array of behaviors; others measure specific behaviors and the conditions in which they occur. Broad-band behavioral rating scales evaluate a broad range of behavioral domains. Narrow-band behavioral rating scales evaluate a specific dimension of the targeted behaviors. Often, a broad-band behavioral rating scale will be administered to identify the problematic behavior, and a narrow-band behavioral rating scale will be subsequently administered to provide detailed information about the problematic behavior. CLASSIFICATION SYSTEMS: Classification systems are used to assess the presence or absence of an attribute. Three commonly used classification systems are • Behavior and feeling word checklists. Allow the professional counselor or client to identify the words that best describe the client's feelings or behaviors. • Sociometric instruments. Assess the social dynamics within a group, organization, or institution. • Situational tests. Involve asking the client to role-play a situation to determine how he or she may respond in real life.

Other Types of Assessments

MENTAL STATUS EXAM (MSE): MENTAL STATUS EXAM The Mental Status Exam (MSE) is used by professional counselors to obtain a snapshot of a client's mental symptoms and psychological state. When information yielded from the MSE is combined with client biographical and historical information, the professional counselor can formulate an accurate diagnosis and treatment plan. The MSE addresses several key areas: • Appearance. Refers to the physical aspects of a client, such as apparent age, height, weight, and manner of dress and grooming. Bizarre dress, odors, and body modifications are noted. • Attitude. Refers to how the client interacts with the professional counselor during the interview. • Movement and behavior. Observation of client's abnormal movements, level of activity, eye contact, facial expressions, gestures, and gait. • Mood and affect. Mood refers to the way a client feels most of the time (e.g., depressed, angry). Affect refers to the external expression of a client's mood (flat, silly) and can change frequently. The professional counselor notes quality and intensity of affect and congruence with mood. • Thought content. Refers to abnormalities in thought content such as suicidal/homicidal ideation, delusions, obsessions, paranoia, and thought broadcasting/insertion. • Perceptions. Refers to any sensory experience. There are three types of perceptual disturbance: hallucinations, derealizations, and illusions. • Thought processes. Refers to the connections between client thoughts and how they relate to the current conversation. Thought can be logical or illogical, coherent or incoherent, or flight of ideas, loose. • Judgment and insight. Judgment pertains to the client's ability to make decisions, and insight refers to the client's understanding of his or her current situation. • Intellectual functioning and memory. Assesses client's level of intellect, current knowledge, and ability to perform calculations and think abstractly. PERFORMANCE ASSESSMENT: Performance assessments are a nonverbal form of assessment that entails minimal verbal communication to measure broad attributes. The client is required to perform a task rather than answer questions using pencil-and-paper methods. Performance assessments are advantageous when working with clients who speak a foreign language, have physical or hearing disabilities, or have limited verbal abilities. Examples of intelligence performance tests include the Porteus Maze, Draw-a-Man Test, Bayley Scales, Cattell Culture Fair Intelligence Tests, and Test of Non-verbal Intelligence (TONI). The Gesell developmental scale is an example of a developmental performance test. SUICIDE ASSESSMENT: Suicide assessment refers to determining a client's potential for committing suicide. Specifically, the professional counselor must make a clinical judgment concerning the client's suicide lethality. Suicide lethality is defined as the likelihood that a client will die as a result of suicidal thoughts and behaviors . • Professional counselors often use the clinical interview to systematically assess for lethality. A thorough suicide assessment includes gathering information related to client demographics (e.g., gender, age, relationship status, ethnicity), psychosocial situation, history, existing psychiatric diagnosis, suicidality and symptoms (e.g., intent, plans, ideation), and individual strengths and weaknesses. Several assessment acronyms exist that can help professional counselors structure their suicide assessment. Two of the most common include PIMP (plan, intent, means, and prior attempts) and SAD PERSONS (sex, age, depression, previous attempt, ethanol abuse, rational thought loss, social supports lacking, organized plan, no spouse, sickness). • When using the clinical interview to assess suicide lethality, you should be familiar with the risk factors associated with committing suicide. In addition to the clinical interview, several standardized tests can be used to assess suicide lethality. Low lethality • The client is not suicidal at the time of the assessment. Low-moderate lethality • The client is somewhat suicidal but does not have any risk factors associated with suicide. Moderate lethality • The client is suicidal and has several risk factors. Moderate-high lethality • The client is determined to die and may commit suicide within the next 72 hours unless an intervention occurs. High lethality • The client is currently in the process of committing suicide (e.g., has already swallowed pills) and needs immediate hospitalization.

Assessment: Performance Test

Maximal and typical performance refers to the intention of the assessment. If a professional counselor would like information regarding the client's best attainable score/ performance then the counselor would use a *maximal performance test . Achievement and aptitude tests are measures used to test maximal performance. A *typical performance test , on the other hand, is concerned with one's characteristic or normal performance. For example, personality measurements assess a client's typical personality characteristics.

measurement instruments?

Measurement , in counseling, is the process of defining and estimating the magnitude of human attributes and behavioral expressions. The act of measuring involves the use of a measurement instrument, such as a test, survey, or inventory. • The terms assessment and test are commonly misused and mistakably interchanged. Remember, assessment is a broad term that involves the systematic process of gathering and documenting client information. A test is a subset of assessment and is used to yield data regarding an examinee's responses to test items. • Interpretation is part of the assessment process wherein the professional counselor assigns meaning to the data yielded by evaluative procedures. Meaning can be derived from the data by comparing an individual to his or her peer group, using a predetermined standard or set criteria, or through a professional counselor's judgment. • Evaluation refers to making a determination of worth or significance based on the result of a measurement. For example, a professional counselor can examine a client's monthly scores on the Beck Depression Inventory to evaluate the client's progress in counseling. Professional counselors use evaluation to assess client progress or to determine the effectiveness of interventions, programs, and services on client change.

Percentiles

Percentage scores are easily confused with percentiles. A percentage score is simply the raw score (i.e., the number correct items) divided by the total number of test items. • Percentile rank , which is also referred to as a percentile, is a commonly used calculation that allows a comparison to be made between a person's raw score and a norm group. A percentile rank indicates the percentage of scores falling at or below a given score. Percentile ranks range from less than 1 to greater than 99 and have a mean of 50. A percentile rank cannot be 0 or 100 because percentiles represent the percentage of scores below a given score. (Remember, a normal curve is asymptotic, so it goes to infinity in both directions). Percentile ranks are not equal units of measurement, meaning that the scale is inclined to exaggerate differences in percentiles near the mean and minimize differences at the tails. • Considering the test data found in Table 7.6 , what would Ivan's percentile rank be? First, we must calculate the mean and standard deviation of the data set. Using the calculations found in Section 8.5 , the mean is 63.0 and the standard deviation is 4. We then must determine how many standard deviations Ivan's score is from the mean. This can be determined by subtracting Ivan's score from the mean (i.e., 67 63 4). Because Ivan is 4 points above the mean and the standard deviation is 4, we know that he is one standard deviation above the mean. Ivan's percentile rank at 1 standard deviations ( sd ) can be determined by consulting a score transformation table or by adding the known area underneath the curve (i.e., -3sd to μ is 50% and to 1 sd is 34%, 50 + 34 = 84). Both methods yield a percentile rank score of 84.

Assessment of Personality

Personality tests assess a person's affective realm. Specifically, personality tests describe the facets of a person's character that remain stable through adulthood (e.g., temperament, patterns in behavior). Several distinctive tests that emerged in the 20th century attempted to assess both individual personality characteristics and the personality as a whole. Today, these tests are classified as either objective or projective personality tests. OBJECTIVE PERSONALITY TESTS: are standardized, self-report instruments that often use multiple-choice or true/false formats to assess various aspects of personality. The aims of objective personality tests are to identify personality types, personality traits, personality states, and self-concept. In addition, these tests can identify psychopathology and assist in treatment planning. Table 7.15 presents an overview of the most commonly administered objective personality tests. - Minnesota Multiphasic Personality Inventory-2 (MMPI-2) - Variable Response Inconsistency (VRIN) and True Response Inconsistency (TRIN) - Millon Clinical Multiaxial Inventory, Third Edition - Myers-Briggs Type Indicator - California Psychological Inventory, Form 434 (CPI 434) - The Sixteen Personality Factors Questionnaire (16PF) - The NEO Personality Inventory, Third Edition (NEO PI-3) - Coopersmith Self-Esteem Inventories (SEI) PROJECTIVE PERSONALITY TESTS: assess personality factors by interpreting a client's response to ambiguous stimuli. These personality tests are rooted in psychoanalytic psychology and propose that the ambiguity of the presented stimuli will tap into the unconscious attitudes and motivations of the client. Projective tests are used to identify psychopathology and assist in treatment planning. Table 7.16 presents an overview of the most commonly administered objective personality tests. - Rorschach inkblot test - Thematic Apperception Test (TAT) - House-Tree-Person (HTP) - Sentence completion tests

Reporting Validity Decision Acuracy

REPORTING VALIDITY In test reports and manuals, validity is expressed as a correlation coefficient (for more information regarding correlation coefficients see Section 8.6.1 ) . The validity coefficient is a correlation between a test score and the criterion measure. Validity also can be reported as a regression equation. A regression equation can be used to predict an individual's future score on a specific criterion based on his or her current test score. For example, college admission counselors may use a regression equation to predict an applicant's future GPA from a current SAT score. Unfortunately, some degree of error is always present in prediction, and we are never able to say that our predictions are 100% accurate. Consequently, the standard error of estimate must be calculated and reported when predicting criterion scores. The standard error of estimate is a statistic that indicates the expected margin of error in a predicted criterion score due to the imperfect validity of the test. The standard error of estimate is calculated by the following equation where est is the standard error of the estimate, Y is an actual score, Y is a predicted score, and N is the number of pairs of scores. The numerator is the sum of squared differences between the actual scores and the predicted scores. DECISION ACCURACY The work of a professional counselor involves making decisions regarding client diagnosis, treatment, and placement. Professional counselors often use psychological tests to enhance the accuracy of their decisions. For example, a professional counselor may notice that a client is demonstrating depressive symptoms but may consider administering a depression inventory to augment observations and subsequent diagnosis. Before administering the depression inventory, the professional counselor will want to assess the instrument's decision accuracy. Decision accuracy assesses the accuracy of instruments in supporting counselor decisions. Table 7.5 provides an overview of the terms commonly associated with decision accuracy.

Scales of Measurement

Scales of measurement provide a method for classifying or measuring specific qualities or characteristics. Four different scales of measurement exist: nominal, ordinal, interval, and ratio. • A nominal scale is the simplest measurement scale as it is only concerned with classifying data without respect for order or equal interval units. An example of a measure occurring on a nominal scale is gender. By assigning the label male or female , only a classification is being made. Magnitude cannot be assigned to gender (e.g., females are greater than males). Arbitrary labels, such as numbers, can be allocated to nominal data (e.g., males 0; females 1). • An ordinal scale classifies and assigns rank-order to data. Likert-type scales, which often rank degrees of satisfaction toward a particular issue, are an example. Ordinal scales designate order, but the intervals between the numbers are not necessarily equal. For example, it can be said that student A, who rates satisfaction with a counseling course as 4, is more satisfied than student B, who gave a rating of 3. • An interval scale includes all ordinal scale qualities and has equivalent intervals—that is, interval scale measures have an equal distance between each point on the scale. Therefore, the difference between 32 degrees and 31 degrees Fahrenheit is the same as the difference between 67 degrees and 66 degrees Fahrenheit. Interval scales do not have an absolute zero point. For example, 0 degrees Fahrenheit does not mean there is no temperature. As a result, it cannot be said that 60 degrees Fahrenheit is twice as warm as 30 degrees Fahrenheit. Educational and psychological testing data are usually interval scale measurements. • A ratio scale is the most advanced scale of measurement as it preserves the qualities of nominal, ordinal, and interval scales and has an absolute zero point. As a result, difference between values can be quantified in absolute terms. Height is an excellent example of a measure occurring on a ratio scale, and it can be said that a 6-foot-tall person is twice as tall as a 3-foot-tall person. Physical measurements used in the natural sciences are often ratio-scaled data (e.g., length, weight, time).

Derived Scores: Raw Score Derived Score

Suppose a student, Ivan, scored a 67 on a test. Did he do well? If we assume the score is 67 out of 100, then the score might not be so good. But what if the highest possible score on the test is 70? Then the score is probably really good. The bottom line is that a raw score (original data that have not been converted into a derived score) lacks sufficient meaning and interpretive value. To interpret and understand Ivan's raw score, it must be converted to a derived score or compared to some criterion. A derived score is a converted raw score that gives meaning to a test score by comparing an individual's score with those of a norm group.

Standard Error of Measurement

Suppose you administered a math exam to a student on five separate occasions, and he scored a 95, 91, 98, 86, and 89, respectively. How do you know which score best reflects his understanding of the material (i.e., which is his true score)? Recall from the previous discussion of reliability that all measurement instruments have some degree of error. Consequently, it is unlikely that repeated administrations of an instrument would yield the same scores for a given person. Since an individual's true score is always unknown, the standard error of measurement (SEM) is used to estimate how scores from repeated administrations of the same instrument to the same individual are distributed around the true score. The SEM is computed using the standard deviation (SD) and reliability coefficient of the test instrument: • Simply stated, the SEM is the standard deviation of an individual's repeated test scores when administered the same instrument multiple times. The SEM is inversely related to reliability in that the larger the SEM, the lower the reliability of a test; thus, if the reliability coefficient is 1.00, the SEM 0. Therefore, the reliability of a test can be expressed in terms of the SEM. The SEM is often reported in terms of confidence intervals, which define the range of scores where the true score is thought to lie. For example, let's say you give a student the same math test 100 times. The distribution of her observed scores would form a curve when graphed, where the mean is defined as her true score and the standard deviation represents the SEM. Suppose that the mean was 93 and the SEM was 2. As 68% of all scores fall between ±1 SEM, we can conclude that 68% of the time her observed scores will fall between 91 and 95 when given the same math test 100 times. At the 95% level of confidence (±2 SEM), 95% of her observed scores would fall within the range of 89-97, given 100 administrations of the test.

The Purpose of Assessment In Counseling

Suppose you are working with a client who presents with depression symptoms. You decide to administer the Beck Depression Inventory (BDI-II). What will you do with the results? The use of client assessment in counseling has several primary functions: Diagnosis and treatment planning. As a professional counselor, you must make clinical and treatment decisions on the basis of the information you gather from a client. Increasingly, professional counselors are being required by managed care organizations to diagnosis client problems. Accurately diagnosing client problems requires the counselor to engage in assessment. In particular, professional counselors must gather information from the client regarding the presence of psychological symptoms and the impact of these symptoms on client functioning. Professional counselors can use structured interviews and assessment instruments to assist in gathering this information. The BDI-II, for example, may assist counselors in diagnosing a mood disorder and recommending treatment based on the severity of the disorder. Professional counselors also use diagnostic systems to assist in diagnosing client problems. Diagnostic systems provide standardized terminology, or a common language, that allows mental health professionals to communicate with one another regarding client diagnosis and treatment planning. There are a variety of diagnostic systems that counselors can use, but the Diagnostic and Statistical Manual of Mental Disorders , Fifth Edition ( DSM-5 ; American Psychiatric Association, 2013 ) is the most widely used. • Placement services. Once a client is diagnosed, professional counselors may use additional assessment procedures to determine the type of program/service in which the client should be placed. For example, a school counselor on a child study team may use classroom behavioral records, observation, and individualized achievement test scores to determine whether a child should be placed in a mainstream classroom. • Admission. Assessment procedures are often used to determine admission into an educational institution. Aptitude tests such as the GRE are considered for entrance into many postgraduate educational programs. • Selection. Assessments are also used to select candidates for a special program or job position. For example, an auto mechanic may be asked to complete a battery of mechanical aptitude tests when applying for a mechanic position. The information yielded by these assessments will assist the employer in determining the candidate's suitability for the position. • Monitoring client progress. Assessment plays a fundamental role in the counseling process; it does not involve just administering tests in initial counseling session for diagnosis, treatment, and/or placement purposes. Professional counselors have a responsibility to monitor client progress throughout the course of counseling to determine whether the client is moving toward his or her goals. Client progress may be evaluated by administering a formal assessment such as the BDI-II. For example, a professional counselor may administer the BDI-II to a client during the initial intake and again during the third, fifth, and seventh sessions to determine if the severity of the client's depression is decreasing. Client progress can also be monitored through informal assessments and client self-report. Whiston ( 2013 ) suggests that professional counselors systematically ask clients at the beginning of each counseling session to rate, on a scale from 1 to 10, their current level of depression, anger, anxiety, etc. Counselors can record the client's self-report ratings each week in order to determine if the client is making progress toward counseling goals. • Evaluate counseling outcomes. In addition to monitoring client progress, professional counselors need to evaluate whether counseling treatment is effective overall. Due to the growing demands from managed care, professional counselors are increasingly being held accountable for outcomes. That is, they are being required to show documentation that counseling is an effective treatment and that clients do, in fact, get better.

Test Theory

Test theory expects that test constructs, in order to be considered empirical, must have the ability to be measured for quality and quantity to be considered empirical ( Erford, 2013 ). Consequently, test theory strives to reduce test error and enhance construct reliability and validity. Professional counselors must be familiar with the following types of test theory. • Classical test theory has been the most influential psychometric theory. It postulates that an individual's observed score is the sum of the true score and the amount of error present during test administration. The central aim of classical test theory is to increase reliability of test scores. • Item response theory , also referred to as modern test theory , refers to applying mathematical models to the data collected from assessments. Test developers use item response theory to evaluate how well individual test items and the test as a whole work. Specifically, item response theory can be used to detect item bias (e.g., whether an item behaves differently for males and females), equating scores from two different tests, and tailoring test items to the individual test-taker. • The construct-based validity model ( Messick, 1989 , 1995 ) proposed that validity is a holistic construct, not explainable as separate components, as the classical test theory model does with its three-in-one components of content, criterion, and construct validity. Specifically, Messick proposed the exploration of internal structural aspects and external aspects of validity to study score validity, all of which in combination describe score validity in a holistic manner.

Test Translation and Test Adaption

Test translation refers to a process by which test items are translated into the language spoken by examinees. Although test translation attempts to reduce cultural bias in testing, it has been heavily criticized for assuming equivalence in content and values across cultures. Because the translation of a test into a test-taker's native language is not enough to reduce test bias, test translation has been replaced by test adaptation. Test adaptation involves the process of altering a test for a population that differs significantly from the original test population in terms of cultural background and language. The process includes translating language as well as empirically evaluating the cultural equivalence of the adapted test. The goal of test adaptation, then, is to develop two or more versions of the test that will elicit the same responses from test-takers regardless of the examinee's cultural or linguistic backgrounds.

Assessment: Non-standardized tests

allow for variability and adaptation in test administration, scoring, and interpretation. These tests do not permit an individual's score to be compared to a norm group; consequently, the professional counselor must rely on one's judgment to interpret the data. Projective personality measures, such as the Rorschach inkblot test and the Thematic Apperception Test, can be considered nonstandardized tests, depending on the scoring and interpretive methods used.

Assessment: Group tests

are administered to two or more test-takers at a time. Group tests typically use objective scoring methods and have established norms. Group tests are economical and simplify test administration and scoring for the examiner. However, client responses are more restricted, and test administration lacks flexibility.

Achievement Tests

are designed to assess what one has learned at the time of testing. Specifically, they measure the knowledge and skills an individual has acquired in a particular area due to instruction or training experiences. Achievement tests are most frequently employed in educational settings and used for selection, classification, placement, or instruction purposes. Acceptable reliability coefficients for standardized achievement tests begin at .80 and range higher. SURVEY BATTERIES: refer to a collection of tests that measure individuals' knowledge across broad content areas. These tests must cover material from multiple subject areas and, as a result, do not assess any one subject in great depth. Survey batteries are usually administered in school settings and are used to assess academic progress. - Standard Achievement Test (SAT 10) - Iowa Test of Basic Skills - Metropolitan Achievement Test (MAT8) - TerraNova, Third Edition Tests DIAGNOSTIC TESTS: are designed to identify learning disabilities or specific learning difficulties in a given academic area. In contrast to survey batteries, diagnostic tests provide in-depth analysis of student skill competency. As a result, these tests yield information regarding a student's specific strengths and weaknesses in an academic area. - Wide Range Achievement Test (WRAT4) - Key Math Diagnostic Test (Key Math 3) - Woodcock Johnson III-Tests of Achievement (WJ III ACH) - Peabody Individual Achievement Test-Revised - Test of Adult Basic Education (TABE) READINESS TESTING: refer to a group of criterion-referenced achievement assessments that indicate the minimum level of skills needed to move from one grade level to the next. These achievements tests are frequently used in high stakes testing.

Assessment: Standardized tests

are designed to ensure the conditions for administration, test content, scoring procedures, and interpretations are consistent. They use predetermined administration instructions and scoring methods. Because standardized tests undergo rigorous empirical validation measures, they have some degree of validity and reliability. As a result, an individual's test score can be compared to a norm group. The Scholastic Aptitude Test (SAT) and Graduate Record Examination (GRE) are examples of standardized tests.

Intelligence Test

broadly assess an individual's cognitive abilities. Because it is a type of aptitude testing, intelligence testing measures what one is capable of doing. These tests characteristically yield a single summary score, commonly called an IQ (intelligence quotient), but also usually yield index scores derived from factor analytic studies. Intelligence tests are often used to detect giftedness and learning disabilities and to identify and classify intellectual developmental disabilities. THEORIES OF INTELLIGENCE What is intelligence? Is it related to a person's spatial and verbal aptitude? Or a person's music ability? Intelligence is difficult to measure and define because it is not an overt construct. As a result, experts continue to argue over the definition of intelligence. Several theories of intelligence exist, and each one strives to define the construct. Pg. 195 INTELLIGENCE TESTS Not surprisingly, the intelligence tests we use today are based on the preceding theories of intelligence. The majority of these tests are based on constructs such as the "g" and "s" factors and crystallized and fluid intelligence. - Standord-Binet 5 - Wechsler Scales - Wechsler Adult Intelligence Scale for Children (WAIS-IV) - Wechsler Intelligence Scale for Children (WISC-IV) - Wechsler Preschool and Primary Scales of Intelligence (WPPSI-IV) - Kaufman Assessment Battery for Children (KABC-II)

Clinical Assessment

can be thought of as the "whole person assessment." It refers to the process of assessing clients through multiple methods such as personality testing, observation, interviewing, and performance. Clinical assessments may increase a client's self-awareness or assist the professional counselor in client conceptualization and treatment planning. In this section, we discuss how to assess a person's affective realm using both objective and projective personality tests; the different types of objective and projective personality tests; informal assessment techniques such as observation, clinical interviewing, rating scales, and classification systems; the differences between direct and indirect observations; the three different approaches to clinical interviewing; other types of assessments such as the Mental Status Exam and the Performance Assessment; and suicide assessment—the risk factors associated with suicide, and how to gauge suicide lethality.

Alternative form reliability , also referred to as parallel form reliability or equivalent form reliability

compares the consistency of scores from two alternative, but equivalent, forms of the same test. To establish reliability, two tests that measure the same content and are equal in difficulty are administered to the same group of individuals within a short period of time, and the scores from each test are correlated. Administering parallel forms of the same test eliminates the concern of memory and practice effects.

Test-retest reliability (temporal stability)

determines the relationship between the scores obtained from two different administrations of the same test. This type of reliability evaluates the consistency of scores across time. Participant memory and practice effects can influence the accuracy of the test-retest method. Also, the correlation between scores tends to decrease as the time interval between test administrations increases. The testretest form of reliability is most effective when the instrument assesses stable characteristics (e.g., intelligence).

Face Validity

is not a type of validity, but it is often falsely referenced as one, thus deserving a brief caveat here. Face validity is a superficial measure that is concerned with whether an instrument looks valid or credible. Therefore, your depression test would have face validity if it simply "looked like" it measured client depression.

Construct Validity

is the extent to which an instrument measures a theoretical construct (i.e., idea or concept). It is especially important to establish construct validity when an instrument measures an abstract construct such as personality characteristics. Construct validity is determined by using experimental designs, factor analysis, convergence with other similar measures, and discrimination with other dissimilar measures. • Experimental design validity refers to the implementation of experimental design to show that an instrument measures a specific construct. • Factor analysis is a statistical technique that analyzes the interrelationships of an instrument's items, thus revealing predicted latent (hidden) traits or dimensions called factors. To demonstrate construct validity, a factor analysis must show that the instrument's subscales are statistically related to each other and the larger construct. Therefore, on the depression instrument, the items reflecting the subscales (e.g., sadness, loss of interest, irritability, suicidal thoughts, and worthlessness) must be related to the larger construct of depression. Also, the subscales must be somewhat related to each other, but not too closely related, as each subscale is supposed to measure a different facet of depression (for more information regarding factor analysis, see Section 8.6 ) . • Convergent validity is established when measures of constructs that theoretically should be related are actually observed to be related to each other. In other words, convergent validity shows that the assessment is related to what, theoretically, it should be. For example, if you correlate a new depression test with the Beck Depression Inventory II (BDI-II)—a similar instrument that has already been established to measure depression—and find a significantly positive relationship, there is evidence of convergent validity. • Discriminant validity is established when measures of constructs that are not theoretically related are observed to have no relationship. Therefore, to demonstrate discriminant validity, you would show that scores from the depression measure are not related to scores from an achievement instrument.

Content Validity

is the extent to which an instrument's content is appropriate to its intended purpose. To establish content validity, test items must reflect all major content areas covered by the domain (i.e., a clearly defined body of knowledge).

Inter-scorer reliability (Inter-rater reliability)

is used to calculate the degree of consistency of ratings between two or more persons observing the same behavior or assessing an individual through observational or interview methods. This type of reliability is important to establish when an assessment requires scorer judgment (e.g., subjective responses). For example, if a client were administered a test of open-ended questions regarding depression, you would need to establish inter-scorer reliability by having two or more clinicians independently score the test.

Internal consistency

measures the consistency of responses from one test item to the next during a single administration of the instrument. For example, if a depressed client agreed with the test items "I feel sad," "I feel sad and can't seem to get over it," and "I feel unhappy," the instrument would have good internal consistency. • Split-half reliability is a type of internal consistency that correlates one half of a test against the other. Using spilt-half reliability can be challenging, because tests can rarely be divided into comparable halves. Another drawback to using split-half reliability is that it shortens the test by splitting it into two halves. All things being equal, a shorter test yields less reliable scores than longer tests. To compensate mathematically for the shorter length, the Spearman-Brown Prophecy Formula can be used to estimate reliability: • Inter-item consistency is a measure of internal consistency that compares individual test item responses with one another and the total test score. Reliability is estimated through the use of mathematical formulas that correlate all of the possible split-half combinations present in a test. One such formula is the KuderRichardson Formula 20 , which is used when test items are dichotomous (e.g., scored as right or wrong, true or false, yes or no). Another formula, Cronbach's Coefficient Alpha, is used when test items result in multipoint responses (e.g., Likert-type rating scales with a 4-point response format).

Assessment: Subject tests

on the other hand, are sensitive to rater and examinee beliefs. They employ open-ended questions, which have more than one correct answer or way of expressing the correct answer (e.g., essay questions).

Assessment: Objective tests

provide consistency in administration and scoring to ensure freedom from the examiner's own beliefs or biases. Objective tests include questions that have a correct answer (e.g., multiple-choice, true/false, matching).

Validity

refers to how accurately an instrument measures a given construct. Validity is concerned with what an instrument measures, how well it does so, and the extent to which meaningful inferences can be made from the instrument's results. • Remember that establishing the existence of multiple types of validity provides stronger evidence for the credibility of the instrument.

Reliability

refers to the consistency of scores attained by the same person on different administrations of the same test. Ideally, we expect a person to receive the same score on multiple administrations of the same test. • Reliability is concerned with the error found in instruments.

High Stakes Testing

refers to the use of standardized test outcomes to make a major educational decision concerning promotion, retention, educational placement, and entrance into college. The results of high stakes testing can have serious consequences for the students being tested. For example, a student who scores high may receive a college scholarship, whereas a low-scoring student may have his high school diploma withheld. Criterion-referenced assessments are typically used in high stakes testing. High stakes testing has received a lot of criticism. Specifically, critics argue that a single test or test score is not representative of a student's abilities and, therefore, should not be the only factor considered when making an important educational decision. In addition, the standardized tests used in high stakes testing may fail to account for diverse factors influencing minority students' performance, which widens the achievement gap between higher and lower performing groups.

Assessment: Individual tests

require that a test be administered to one examinee at a time. Individual tests allow professional counselors to establish rapport with the examinee and closely monitor the factors that influence examinee performance (e.g., fatigue, anxiety). However, individual testing is time consuming for the practitioner and costly to the client.

Aptitude Tests

such as the GRE and the SAT, assess what a person is capable of learning. Unlike an achievement test, which measures an individual's current knowledge, aptitude tests attempt to predict how well that individual will perform in the future. Therefore, the GRE does not test your knowledge but rather your future performance in graduate school. Aptitude testing includes measures that assess intelligence, educational cognitive ability, and vocational aptitude. COGNITIVE ABILITY TESTS: make predictions about an individual's ability to perform in future grade levels, colleges, and graduate schools. Cognitive ability tests assess broad aptitude areas such as verbal, math, and analytical ability. - The Cognitive Ability Test (CogAT Form 6) - Otis-Lennon School Ability Test (OLSAT8) - ACT Assessment - SAT Reasoning Test - GRE Revised General Test - Miller Analogies Test (MAT) - Law School Admission Test (LSAT) - Medical College Admission Test (MCAT) VOCATIONAL APTITUDE TESTING: refers to a set of predictive tests designed to measure one's potential for occupational success. These tests are useful to employers and potential employees. For the employer, test results assist in the process of screening for competent, well-suited employees. For the potential employee, test results can offer career guidance. • Multiple aptitude tests assess several distinct aspects of ability at one time. They are used to predict success in several occupations. • The Armed Services Vocational Aptitude Battery (ASVAB) is the most widely used multiple aptitude test in the world. Although it was originally developed for the military, the ASVAB now measures the abilities required for a variety of military and civilian jobs. It includes ten ability tests: general science, arithmetic reasoning, word knowledge, paragraph comprehension, mathematics knowledge, electronics information, automotive and shop information, mechanical comprehension, assembling objects and verbal expression. • The Differential Aptitude Test® Fifth Edition (DAT) is a multiple aptitude measure for students in Grades 7 through 12. Similar to the ASVAB, the DAT has eight separate tests that assess verbal reasoning, numerical reasoning, abstract reasoning, clerical speed, and accuracy, mechanical reasoning, space relations, spelling and language usage. The DAT also includes a Career Interest Inventory (CII) that assesses student vocational strengths and determines possible careers that might interest the student. • Special aptitude tests assess one homogenous area of aptitude and are used to predict success in a specific vocational area. Table 7.11 lists several commonly used special aptitude tests.

Assessment: Speed test

use limited testing time to prevent perfect scores. Typically, these tests have easy questions but include too many items to answer in the allotted time. Speed tests assess how quickly the test-taker can understand the question and choose the right answer.

Assessment: Power test

use limited testing time to prevent perfect scores. Typically, these tests have easy questions but include too many items to answer in the allotted time. Speed tests assess how quickly the test-taker can understand the question and choose the right answer. limit perfect scores by including difficult test items that few individuals can answer correctly. These tests measure how well the test-taker can perform given items of varying difficulty regardless of time or speed of response.

Types of Scales

• A Likert scale , sometimes called a Likert-type scale, is commonly used when developing instruments that assess attitudes or opinions. The item format employed by a Likert-type scale includes a statement regarding the concept in question followed by answer choices that range from Strongly (Agree, Satisfied, etc.) to Strongly (Disagree, Dissatisfied, etc.). • Semantic differential , also referred to as self-anchored scales , refers to a scaling technique that is rooted in the belief that people think dichotomously. Although there are several varieties of this scale, the most common form involves the statement of an affective question followed by a scale that asks test-takers to place a mark between two dichotomous adjectives. • A Thurstone scale measures multiple dimensions of an attitude by asking respondents to express their beliefs through agreeing or disagreeing with item statements. The Thurstone scale has equal-appearing, successive intervals and employs a paired comparison method. • A Guttman scale measures the intensity of a variable being measured. Items are presented in a progressive order so that a respondent, who agrees with an extreme test item, will also agree with all previous, less extreme items.

Key Legal Issues in Assessment

• Civil Rights Act of 1964 and the 1972, 1978, and 1991 amendments. Assessments used to determine employability must relate strictly to the duties outlined in the job description and cannot discriminate based on race, color, religion, pregnancy, gender, or origin. • Family Educational Rights and Privacy Act of 1974 (FERPA). Ensures the confidentiality of student test records by restricting access to scores. At the same time, this law affirms the rights of both student and parent to view student records. Individuals with Disabilities Education Improvement Act of 2004 (IDEA). Confirms the right of students, believed to have a disability, to receive testing at the expense of the public school system. The act further mandates that students with disabilities receive an IEP that specifies the accommodations a student will receive to optimize learning. • The Vocational and Technical Education Act of 1984. Also known as the Carl D. Perkins Act, this law provides access to vocational assessment, counseling, and placement services for the economically disadvantaged, those with disabilities, individuals entering nontraditional occupations, adults in need of vocational training, single parents, those with limited English proficiency, and incarcerated individuals. • Americans with Disabilities Act of 1990 (ADA). Employment testing must accurately measure a person's ability to perform pertinent job tasks without confounding the assessment results with a disability. In addition, the act ensures that persons with disabilities receive appropriate accommodations during test administration (e.g., more time to take the assessment, special equipment). • Health Insurance Portability and Accountability Act (HIPAA) of 1996 . Secures the privacy of client records by requiring agencies to obtain client consent before releasing records to others. HIPAA also grants clients access to their records. • No Child Left Behind (NCLB) Act of 2001 . The act aims to improve the quality of U.S. primary and secondary schools by increasing the accountability standards of states, school districts, and schools. As a result, No Child Left Behind requires states to develop and administer assessments in basic skills to all students.

Ethical Codes for Professional Counselors

• Competence to use and interpret assessment instruments. Professional counselors should use only the assessment instruments they have been trained in and are competent to administer and interpret. • Informed consent. Professional counselors are obligated to inform the client of the nature and purposes of an assessment instrument prior to administration and the intended purpose of the assessment results. • Release of results to qualified professionals. Assessment results are released only to persons who are qualified to interpret the test data. To include identifying client information with the release of test results, a professional counselor must secure client consent. • Instrument selection. When selecting assessment instruments, professional counselors should use only select instruments that are current and consider a measure's validity, reliability, psychometric limitations, and multicultural appropriateness. • Conditions of assessment administration. All assessments should be administered under conditions that facilitate optimal results. • Scoring and interpretation of assessments. When reporting results, professional counselors should indicate any concerns regarding the validity and reliability of the assessment results due to the testing conditions or inappropriateness of the norms for the person tested. Professional counselors are to also document in the client record how the assessment results will be appropriately integrated into the counseling process. • Obsolete assessments and outdated results. Professional counselors should refrain from using assessment instruments or results that are outdated for the present need. • Assessment construction. Professional counselors are to use established scientific methodology and current professional knowledge when developing psychological and educational measures. Professional counselors who develop assessments are also to provide test-users with the benefits and limitations of the assessment, as well as stress the importance of relying on multiple sources to make decision in the counseling process.

Item Ananlysis

• Item analysis is a procedure that involves statistically examining test-taker responses to individual test items with the intent to assess the quality of test items and the test as a whole. Item analysis is frequently used to eliminate confusing, easy, and difficult items from a test that will be used again. • Item difficulty refers to the percentage of test-takers who answer a test item correctly. Test authors compute item difficulty to ensure that items provide an appropriate level of difficulty and to increase the variability of scores. Item difficulty is calculated by dividing the number of individuals who correctly answered the item by the total number of testtakers. The calculation is expressed as a decimal point called the p value. The reported p value ranges from 0 to 1.0, with higher values indicating an easier test item. For example, a test item with a .9 p value is interpreted as easy because 90% of the test-takers answered the question correctly. In general, test authors look for items to have an average p value of .50; as half of the test-takers will miss this item, this p value yields the most variation in a test score distribution. • Item discrimination is the degree to which a test item is able to correctly differentiate test-takers who vary according to the construct measured by the test. For example, items on a depression inventory need to discriminate between test-takers who are depressed and those who are not depressed. Item discrimination is calculated by subtracting the performance of the top quarter of total scores from the bottom quarter of total scores on a given test item. Most test developers consider a test item to be a good discriminator when increasingly higher numbers of upper-group members answer the question correctly than lower-group members (i.e., positive item discrimination). Items with zero and negative item discrimination are generally considered poor.

Sources of Information on Assessments

• Mental Measurements Yearbook (MMY). The best source for information regarding commercially available assessment instruments in the English language is the MMY , published by the Buros Institute of Mental Measurements every 2 to 8 years. Each entry in the MMY offers pertinent assessment information, including the test name, acronym, test author and publisher, copyright date, purpose, intended test population, administration time, forms, and prices. The MMY also contains information related to test reliability and validity, norming data, scoring and reporting services, and available foreign language versions. The majority of entries include test critiques by experts in the testing and assessment field; these reviews provide an additional measure of credibility to the assessment instruments. • Tests in Print (TIP). Tests in Print ( TIP ) is published by the Buros Institute of Mental Measurements every 3 to 13 years as a companion to the MMY . It offers a comprehensive listing of all published and commercially available tests in psychology and education. TIP provides information regarding the test title, intended population, publication date, acronym (if applicable), author, publisher, foreign adaptations, and references. Unlike the MMY , TIP does not provide critical reviews or psychometric information on the assessment instrument. • Tests. Tests , published by PRO-ED, Inc., contains information on thousands of assessment instruments in the psychology, education, and business industries . Tests provides quick access to concise instrument descriptions that include the test title, author, publisher, intended test population, purpose, major features, administration time, scoring method, cost, and availability. This resource does not provide assessment critiques or information regarding test norms, validity, or reliability. • Test Critiques. Test Critiques, also published by PRO-ED, is designed to be a companion text to Tests . Each entry in Test Critiques contains an overview of the assessment, practical applications (e.g., intended population, administration, scoring, and interpretation procedures), and information regarding the instrument's reliability and validity. In addition, Test Critiques offers comprehensive reviews of psychological assessments from testing experts; the reviews average eight pages in length. The text is designed to be a user-friendly resource and is written not only for professionals but also for persons unfamiliar with assessment jargon. Test Critiques is updated annually.

Factors that Influence Reliability

• Test length. Longer tests are generally more reliable than shorter tests. • Homogeneity of test items. Lower reliability estimates are reported when test items vary greatly in content. • Range restriction. The reliability of test scores will be lowered by a restriction in the range of test scores. • Heterogeneity of test group. Test-takers who are heterogeneous on the characteristic being measured yield higher reliability estimates. • Speed tests. These yield spuriously high reliability coefficients because nearly every test-taker gets nearly every item correct.


Kaugnay na mga set ng pag-aaral

cisco routing/swtiching essentials ch 7 exam

View Set

Audit Chapter 7 - Auditing Internal Control over Financial Reporting

View Set

Concept 27 - Care for patients with Noninfectious Lower Respiratory Problems

View Set

MS prep u 53: Patients with Male Reproductive Disorders

View Set

OSHA 30 Module 6: Personal Protective Equipment Module Quiz

View Set

Macroeconomics chapters 21-23 Test

View Set

Biology - Unit 5 - The Little Critters

View Set