Chapter 5

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What do both types of error do?

Both types of error play a part in obscuring the true measure.

Variability

Variability refers to the extent to which scores in a distribution differ from one another, that is, how "spread out" a group of scores is. If a distribution is lacking in variability, we may say that it is homogenous (note that the opposite would be heterogenous).

For the purposes of predicting future behavior, scores, success, and performance, as well as for diagnostic purposes, reliability is generally more important.

False (validity)

For the purposes of estimating consistency and stability of the scale, validity becomes the primary area of concern and reliability seems to be more important.

False (reliability)

symmetrical distribution

a normal distribution or normal curve (you may also see the term bell curve). Visually, a normal curve has one peak (or mode) and has roughly the same amount of area on both sides of that peak (normal curve)

What are the two sources related to external criterion

concurrent and predictive evidence

Classical test theory subsumed three primary types of evidence that supported the validity of a measure:

construct validity, content validity, and criterion validity.

Multimodal (bimodal)

if a distribution has more than one peak

Memory effect (test-retest reliability)

if the time interval is short, people may be overly consistent because they remember some of the questions and their responses

What are the three frequently used measures in variability?

range, variance, and standard deviation.

Convergent evidence

refers to the degree to which scores on a test correlate highly and positively with scores on other tests that are designed to assess the same construct or trait.

The correlation values higher than _____ are considered as satisfactory or good. When making clinical decision making, however, reliability scores of ____ or higher are preferred. Approaches commonly used to estimate reliability include inter-rater reliability, test-retest reliability, parallel form reliability, and internal consistency reliability as the most common. Internal consistency reliability furthermore can be calculated in various ways, including split-half reliability, Kuder-Richarson-20, and Cronbach's alpha.

0.80 0.90

What are the following steps to compute variance?

1. Find the mean. 2. Find the difference between each observation and the mean. 3. Square each difference score. 4. Sum the squared differences. 5. Because the data is a sample, divide the sum of the squared differences (from step 4) by the number of observations minus one, that is, n - 1 (where n is equal to the number of observations in the data set).

Evidence related to consequences of testing

A final, but essential, component related to validity is an examination of the possible consequences of using a particular assessment. This in essence ensures that no harm comes from using the assessment. For instance, if validity is how well an assessment measures what we think it should be measuring, using a test that is supposed to measure achievement to classify individuals as developmentally delayed may compromise the validity of the test because the assessment is now being used to measure aptitude and not achievement. It is important to keep in mind that assessments can frequently inform social policy and that assessment users are responsible for the ethical use and interpretation of the assessment scores.

Histogram (bar graph)

A histogram uses adjacent vertical bars to indicate the frequencies for each value of your variable, X, which are labeled along the horizontal x-axis. The height of the each bar, as measured along the vertical y-axis, denotes the frequency for each value. Unlike regular bar graphs, the bars on the histogram are contiguous and serve as a visual reminder of the increasing magnitude of the variable. Histograms also can be used to represent percentages instead of, or in addition to, frequency. When the vertical y-axis is expressed in percentage, rather than frequency units, the figure becomes a percentage histogram.

Nominal scale

A nominal scale consists of numbers assigned to groups or categories. No inherent quantitative value of the information is conveyed with these numbers, and no ordering of the numbers is implied. Therefore, we consider nominal scales qualitative rather than quantitative. The numbers assigned to each condition of the variable serve merely as labels and are completely arbitrary. The only statistical analysis permitted on nominal scales is determining frequencies for each category.

Ratio

A ratio scale is like an interval scale, except that it has a true zero point. The true zero point allows researchers to meaningfully compare the magnitude of one case to another. Because of this absolute zero point, the ratio scale allows us to apply all the possible statistical procedures in data analysis and therefore is generally preferred by researchers when this approach is feasible for the variables being studied.

What are factors that can influence reliability?

Administration Errors = Not following administration guidelines can cause tests to be less reliable. Test Length= Typically as the number of questions increases, so does the reliability of the assessment. Item Homogeneity= The more similar the test questions, the more reliable the test Item Difficulty= If a test is too hard or too easy, the less reliable it will be in its ability to discriminate levels of responses. Interval Between Tests= Error Effects Objectivity= Assessment responses that are subjective and involve observations or scoring by raters introduce additional sources of error. Testing Environment/Student Factors= Various factors related to the student (e.g., fatigue/illness) or test taking environment (e.g., temperature, distractions) can make assessments less reliable.

Variance

An important measure of variability that does take into consideration each data point in the distribution is variance. This measure of variability can be defined in terms of how close the scores in the distribution are to the mean. Variance is the average of the squared deviations from the mean. Where Xi is the data point (i simply denotes that there are various data points and you are selecting each individually), s2 is the variance, μ is the population mean, and N is the number of cases. The formula for the variance of a sample of scores, called "s-squared," where s2 is the estimate of the variance and X is the sample mean. you will notice that the heart of the variance is the deviation of the score from its mean (Xi − X). The deviation of the scores generates variance. If the deviations are small, then the scores are close to the mean. If the deviations are large, then the scores are spread out. However, the problem with deviations is that when you sum them, you always get 0. This is due to the nature of a mean value; the value of data points above the mean equals the value of data points below the mean.

What are other concerns related to split-half reliability?

Another concern with the split-half procedure is that the reliability estimate will vary as a function of how the test was split. Some splits may give a much higher correlation than other splits. It is also not appropriate on tests in which speed is a factor. Therefore, other procedures are recommended for these types of tests known as speeded tests. Despite these disadvantages, this reliability procedure has an advantage in terms of getting a measure of reliability from a single administration of one form of a test.

What is another source of evidence related to validity?

Another source of evidence of validity relates to how individuals respond to the test questions. If participants respond more from a place of wanting to be socially desirable, then that question is actually testing social desirability rather than the construct of interest. This type of evidence can also help testers to recognize how much differences in ways participants interpret the questions influence outcomes.

Graphing frequencies

Another way to look at frequencies is graphically. The most common methods of graphing a distribution are the histogram (bar graph) and the frequency (or percentage) polygon.

Types of validity

Content= The extent to which the measurement adequately samples the content domain Construct= The extent to which the test is an accurate measure of a particular construct or variable Subtypes: 1. Discriminant= Extent to which scores on a test do not correlate with or negatively correlate with scores on another test that was intended to measure a different construct 2. Convergent= Extent to which scores on a test correlate positively with scores on another test designed to assess the same construct Criterion= The extent to which a test is related to some external criterion of the construct being measured (Subtypes: Concurrent and Predictive)

Criterion validity

Criterion validity refers to the extent to which a test is related to some external criterion of the construct being measured. That is, criterion validity is a test that predicts an outcome based on information from other measurements. These other measurements are often represented as criteria.

Cronbach's Alpha

Cronbach's alpha (symbolized as a) is a modification of KR-20 that can be used with both binary and multiresponse measures and is the most widely reported reliability procedure for determining internal consistency. Cronbach's alpha ranges from a = .00 (indicating that the test is entirely in error) to a = + 1.00 (indicating that the measure has no error). Again, generally speaking, the higher the alpha is, the more reliable the test is considered to be. It is also a common misconception that if the Cronbach's alpha is low, it must be a bad test. However, keep in mind that the test may measure several constructs rather than one construct, and if so, the Cronbach's alpha may be deflated. Therefore, only determining a single Cronbach's alpha for the overall score would be misleading. It is important to determine the dimensionality of an assessment (i.e., whether it contains a single overall scale or subscales) and thus how scores are reported (e.g., just the overall score or subscale scores or both) and to compute a Cronbach's alpha for each scale reported.

Descriptive statistics

Descriptive statistics are used to explain the basic characteristics of study data, including describing the numbers and what they show. Descriptive statistics help us to simplify large amounts of data by providing a summary that may enable comparisons across people or other units. Descriptive statistics often function as a bridge between measurement and understanding. Descriptive statistics can be a means of finding order and meaning in this apparent chaos of numbers. Usually the raw data can be reduced to one or two descriptive summaries such as the mean and standard deviation or illustrated visually through various graphical procedures such as histograms, frequency distributions, and scatter plots.

Modern approach to validity

Evidence of validity based on . . . The extent to which the test is an accurate measure of a particular construct or variable. Content= Evidence of the measurement adequately sampling the content of the construct Response processes = Analysis of responses to individual test items, rationale of responses, performance strategies, even possibly eye movement and response times Internal structure = Evidence that the items on the measure relate to one another (e.g., one factor or multiple components of the construct) in a way that reflects the theoretical basis of the construct. Relation to external criterion = Evidence that the measure is appropriately related to external criterion of the construct being measured (Examples: Concurrent and Predictive evidence) Relation to other variables. Discriminant Evidence= Extent to which scores on a test do not correlate with or negatively correlate with scores on another test that was intended to measure a different construct Convergent Evidence = Extent to which scores on a test correlate positively with scores on another test designed to assess the same construct Consequences of testing= Evidence of the intentional and inadvertent consequences of using the measures

Outliers

Extreme values, which are on the tails of the distribution and create the skew.

Factor analysis (another way to measure convergent and discriminant evidence)

Factor analysis is a statistical procedure for analyzing the interrelationships among a set of variables to uncover the underlying dimensions or constructs that explain the relationships among observed variables. Through the factor analysis procedure, items that measure the same construct will be grouped together. These constructs are then named according to their characteristics allowing a researcher to break down information.

"you can have validity without having reliability, but you cannot have reliability in the absence of validity. This will be important to remember as we move forward."

False ("you can have reliability without having validity, but you cannot have validity in the absence of reliability. This will be important to remember as we move forward.")

Inter-rater reliability

For tests that have subjective responses and rely on observers or raters to provide scores, error can be introduced by the individuals scoring the assessment. Inter-rater reliability then assesses the consistency or agreement between two or more scorers when making independent ratings of a particular construct. There are multiple ways of calculating inter-rater reliability. Cohen's kappa and additional variations of kappa are used to assess ratings that consist of nominal data. For ordinal, internal, and ratio data, intra-class correlations (ICCs) are determined.

Kurotsis

How skewed a distribution is or the number of modes it has, and even the skinniness or flatness of a normal distribution or how high the peak is around the mean can all influence statistical calculations.

Z-scores

If the mean and standard deviation are known, the individual scores can be pictured relative to the entire set of scores in the distribution through standardization. Standard normal distribution is the normal distribution with a mean of 0 and a standard deviation of 1. Therefore, you could make all normal distributions the same units of size: standard deviation of s with the mean μ as center. When you standardize a raw score to a z-score, it provides you information about how far a person is from the mean, in the metric of standard deviation units. A score that is at the mean would have a z-score of 0. As you can see, z-scores are decimal numbers that can be positive or negative. Where X is a raw score to be standardized σ is the standard deviation of the population μ is the mean of the population Calculating z requires the population mean and the population standard deviation, not the sample mean or sample standard deviation.

Internal consistency

Internal consistency focuses on the degree to which the individual items are correlated. The test that has a high degree of internal consistency reliability has items that are homogeneous, measure a single construct, and correlate highly. Internal consistency reliability could be computed using Split-Half Reliability, the Kuder-Richardson formula 20 (KR-20) or Cronbach's alpha.

Interval

Interval scales take the notion of ranking items one step further. Unlike ordinal scales, interval scales have equivalent and meaningful distances between scale points. EX. Many of the standardized tests in the counseling profession use interval scales. Scores from standardized intelligence tests are a good example of interval scale scores. However, interval scales do not have a "true zero" or meaningful zero point. This means that it is not possible to make statements about how many times higher one score is than another. The interval scale of measurement only permits use of the following statistical procedures: Mean and Standard Deviation Correlation and Regression Analysis of Variance (ANOVA) Factor Analysis

Measures of central tendency

Measures of central tendency are intended to describe the most average or "typical" score in the distribution.

What are the four levels first proposed by Stanley Smith Stevens in his 1946 article, "On the Theory of Scales of Measurement"?

Nominal Ordinal Interval Ratio

What is a goal in the field of measurement?

One goal in the field of measurement and evaluation is to minimize these errors, limiting them to what is expected or appropriate for the purposes of the test.

Test-retest reliability

One way to estimate the reliability of an assessment is to administer the same assessment on two occasions and to correlate the paired scores. The closer the two results are, the greater the test-retest reliability of the assessment. The correlation coefficient between such two sets of responses is called a test-retest reliability coefficient. A test-retest coefficient assumes that the characteristic being measured by the test is stable over time. Because of this assumption, it may not be appropriate for measuring traits that fluctuate over time, such as emotions, or for assessments that aim to measure clinical improvement or decline over time. As such, the length of time in between testing must be considered. The appropriate length of the interval depends on the stability of the variables that are measured. On the other hand, if time between testing is too long, differential learning and maturation may be a problem. Also, respondents could learn to answer the same questions in the first test, and this could affect their responses in the next test.

Ordinal scale

Ordinal scales divide observations into categories and provide measurement by order and rank. Ordinal scale permits the measurement of degrees of difference or relative differences, but not the specific amount of difference. Measurements within ordinal scales are ordered in the sense that higher numbers represent higher values, but the intervals between the numbers are not necessarily equal. Ordinal scales are very common in counseling research. Any questions that ask the respondent to rate something are using ordinal scales. Ordinal scales allow for counselors to calculate the mode and median of a sample, but not the mean. The range and percentile ranking of a sample can also be calculated with ordinal scales, but not the standard deviation. EX. Likert scale Counseling researchers who classify Likert scales as ordinal have pointed out that the distance between scale points is unequal. Counseling researchers who classify Likert scales as roughly interval scales consider the distance between scale points to be "approximately equal intervals."

Parallel-forms reliability

Parallel forms reliability, which is also referred to as the alternative, or equivalent-form reliability, produces a reliability coefficient based on the administration of two equivalent versions of the same test (Form A and Form B) to one group of people. The parallel forms are typically matched in terms of content and difficulty. The correlation of scores on pairs of parallel forms for the same respondents provides parallel forms reliability coefficient. A high parallel form reliability coefficient indicates that the different forms of the test are very similar, and it makes virtually no difference which version of the test a person takes. On the other hand, a low parallel form reliability coefficient suggests that the different forms are probably not comparable; they may be measuring different things and therefore cannot be used interchangeably. practice effects remain a concern with this form of reliability procedure. However, a second advantage to this reliability procedure is that it controls for test sensitization and yields a coefficient that reflects two aspects of test reliability: variation from one time to another as well as variation from one form of the test to another. Despite these advantages, this is the most demanding, expensive, and difficult procedure for determining the reliability of a test. In addition, even with the best test and item specifications, each test would contain slightly different content, and as with test-retest reliability, maturation and learning may confound the results.

What are reliability and validity used for?

Reliability and validity are used to describe the accuracy and consistency with which an assessment tool measures what it is supposed to measure and ensures that bias and distortion are not significantly impacting the results.

Reliability

Reliability refers to the extent to which assessments are consistent. A reliable measurement is free from error and provides consistent results. focuses only on the degree of nonsystematic or random error in an assessment. When random error is minimal, a measure is said to have a high degree of reliability, and thus scores from this test are expected to be more consistent from administration to administration.

Split-half reliability

Split-half reliability is based on the correlation between halves of the measure. That is, the split-half reliability coefficient is obtained by dividing a test into halves, correlating the scores on each half, and then correcting for length (longer tests tend to be more reliable). The split can be based on odd- versus even-numbered items, randomly selecting items, or manually balancing content and difficulty. The most common procedure is to correlate the scores on the odd-numbered items of the test with the scores on the even-numbered items. If each respondent maintains a very similar response on the two sections (odd items versus even items), the reliability coefficient would be high. A concern with the split-half procedure revolves around the shortening of a test. Generally speaking, the more items a test has (i.e., the longer it is), the more reliable it is. If everything else is equal, more items produce more variation in test scores, which increases reliability. For this reason, an adjustment to the split-half reliability is recommended.

Disadvantages of T-scores

T-scores are a second category of standardized scores that are used widely to report performance on standardized tests and personality inventories. T-scores also create a common metric for comparing across samples, instead of using a mean of 0 and a standard deviation of 1 as in z-scores, T-scores are based on a mean of 50 and a standard deviation of 10. Unlike z-scores, T-scores are whole numbers that are always positive. To calculate a T-score, you will first want to determine the z-score for the raw data. Z-scores can be easily converted to T-scores

Types of Reliability Indices:

Test-Retest Reliability= Administer the same test twice and correlate the scores. Parallel Form Reliability= Administer similar, but not identical tests and correlate the scores. Internal Consistency= Correlating the individual items of a test to each other. Split-Half Reliability= Divide a test into halves (e.g., first v. second half, odd v. even questions) and correlate the scores on each half. Kuder-Richardson Formula 20= Test of internal consistency reliability for continuous response items; Essentially an average of all reliability coefficients from all possible split-half combination. Cronbach's alpha= Test of internal consistency reliability for continuous response items; Essentially an average of all reliability coefficients from all possible split-half combination. Inter-rater Reliability= Determine agreement in scores among two or more raters with subjective assessments.

Kuder-Richardson Formula 20

The KR-20 is a measure of homogeneity for dichotomous responses (i.e., yes-no, correct-incorrect), which functions under the assumption that all items on a test measure the same thing or are of the same difficulty level. The KR-20 solves this problem by computing an average between all the possible split-halves and yielding an overall reliability estimate. In addition, this type of split-half procedure is also a less expensive, although less direct, way of taking into account different samples of test items (for example, evaluating whether all items on a test are measuring the same construct). However, the KR-20 formula should not be used if the test has items that are not dichotomous.

Spearman Brown formula

The Spearman-Brown formula can be employed when estimating the reliability using the split half method. R = 2 x (r) ___________________ ( 1 + r) R = Total test reliability; r = correlation between dividing two tests

mean vs median

The mean and median are both measures of central tendency. The mean and the median can both be meaningful for symmetric distributions. In a normal distribution, the mean and median will be equal. If we have a skewed distribution, however, this relationship changes. In general, the mean will be higher than the median for positively skewed distributions and lower than the median for negatively skewed distributions. In addition, the mean is more heavily influenced by extreme scores or outliers than is the median. Thus, the median rather than the mean would be a better measure of central tendency for extremely skewed distributions.

Mean

The mean is probably the most commonly used method of describing central tendency. This is the arithmetic average of all scores. To compute the mean, all you do is add up all the values and divide by the number of values X = sample mean μ = the population mean N = the number of observations. The Greek letter sigma (a) = is known as the summation sign X = denotes each individual observation N = represents our total number of observations.

Median

The median is the value found at the exact middle of the distribution. One way to compute the median is to list all scores in numerical order and then locate the score in the center of the sample.

Mode

The mode is the easiest measure to understand because it is determined by inspection rather than by computation. The mode is simply the most frequent score in the distribution. On a frequency polygon, the mode is the score at the peak of the distribution. The mode is a useful descriptive statistic when studying nominal (categorical) variables such as gender and race to denote the most common category. the mode is often not as useful as an indicator of central tendency for numerical data, especially bimodal or multimodal data or distributions that are highly skewed.

Range

The range is the simplest measure of variability to calculate, and one you have probably encountered many times in your life. The range is calculated by subtracting the lowest score in the distribution taking the highest score. This is the major disadvantage of the range. The range does not include all the observations. The range gives you a very quick kind of measure of variability, but it obviously excludes a great deal of information.

Stanine

The stanine method of standardizing scores is also known as the standard nine scale. This standardized scale consists of whole numbers ranging from 1 to 9 with a mean of 5 and a standard deviation of 2 Under the normal curve, each stanine score represents a wide band of raw scores and percentile ranks. A normal distribution is divided into nine intervals, each of which has a width of one half of a standard deviation excluding the first and last intervals. The mean lies in the center of the fifth interval. To convert raw scores to stanine scores, first you need to rank the results from lowest to highest. To determine the nine intervals, you will place the lowest 4% of the scores into a stanine of 1, the next 7% in the second stanine, and so on as shown

Standard deviation

The variance and the standard deviation are essentially the same idea. They provide the same information in that one can always be obtained from the other. The standard deviation is the square root of the variance and converts the variance back in the same units as the raw data.

Evidence of internal structure

This form of evidence encapsulates how well the structure and relationship of the variables in the assessment correspond with the theoretical understanding of the construct. This can be determined by examining the relationships among assessment items and factors. approaches such as Item Response Theory, Differential Item Functioning, and Factor Analysis can be helpful in determining evidence related to internal structure.

Because reliability and validity are interconnected in their use in research design, measurement, and instrumentation, they are both useful, and it is difficult to designate one as more important than the other.

True

Evidence of content validity is commonly determined by an expert or expert panel who can judge the representativeness of the items on a test.

True

In statistics, the level of measurement of a variable is a classification that describes the nature of information contained within the numbers that are assigned to that particular variable.

True

Many of our statistical theories, and thus the statistical tests that we use, have as an underlying assumption that we are working with a normal distribution.

True

The presence of error is especially true for educational and psychological tests in the counseling field. The error introduced in any tests can be either systematic or random.

True

When conducting research, the researcher needs to know which level of measurement applies to each research variable. Different statistical procedures are possible depending on the level at which a variable is measured.

True

One challenge in statistics surfaces when we want to compare points from two different samples. Even if each sample follows a normal curve, they will likely have different means and standard deviations.

True Converting sample scores, or raw scores, to a standard metric helps us to compare across a sample and have a standard scale of reference. This process of standardizing scores also allows us to determine the probability of a particular score occurring within our sample. Creating standardized scores essentially means putting the score in terms of a mean and standard deviation. There are different types of normal scores, with z-scores, T-scores and stanine scores being the most prevalent.

Construct validity

Validity and what is understood singularly as construct validity according to classical test theory refers to the extent to which the test is an accurate measure of a particular construct or variable. In essence, is the assessment measuring what we think it should be measuring? Therefore, determining how well these items or criteria represent the underlying construct is essential, not to mention an often complicated and indirect process. The various sources of evidence are used to substantiate the degree to which the measure accurately and adequately reflects the underlying construct.

Validity

Validity has been defined broadly as the extent to which a test measures the construct or variables that it purports to measure. Validity is also defined as the applicability, meaningfulness, and usefulness of the specific inferences made from scores a test can be reliable without being valid.

What is important to know when answering this question: "What type of data do I need to collect to best understand this particular client?"

When attempting to answer this question, however, it is essential that any data collection procedure used (e.g., surveys, questionnaires, interviews, personality scales) accurately assesses the construct of interest and is free of bias and distortion. To do this, a counselor needs to assess the reliability and validity of the assessment tool.

Normal distribution rule

When our data represent a normal curve, we can easily calculate what percentage of the sample will fall within particular intervals under the curve relative to the mean and standard deviation.

frequency (percentage) polygon

appropriate for quantitative data Here the years of experience are plotted on the x-axis, and the frequency is on the y-axis. An interval having zero frequency is added below the interval containing the lowest value, and a second interval with zero frequency is added above the interval containing the highest value. With polygons, a point is located above the midpoint of each interval to denote the frequency of cases in each interval. Frequency polygons are especially useful for continuous data (such as age or height). Also, when comparing two or more distributions within the same figure, polygons can look cleaner than histograms, but clustered or stacked histograms can also be used.

Systematic error

associated with where the zero point is printed on the ruler systematic errors, which increase or decrease a response by a predictable amount with each measurement

Random error

associated with your eye's ability to read the marking and extrapolate between the markings random errors refer to error that results from chance alone and influences measures arbitrarily.

cumulative percentage

by taking the percentage of participants that fall at each year or below.

Why can it be helpful to graphically plot and visualize data?

can help to determine the pattern and shape of the distribution. In large data sets, the shape of a distribution generally fits one of three possible patterns: symmetrical, skewed, or multimodal.

Classical Test Theory

classical test theory suggests that any observed score (OS) consists of the true score (TS) plus some amount of error (OS = TS + error).

What is the 68-95-99.7 rule?

designates the actual percentages of the data for any normal curve that can fall within one, two, and three standard deviations from the mean. According to this rule, under any normal curve, 68% of the data will fall within one standard deviation from the mean (34% between the mean and one standard deviation below the mean and 34 % between the mean and one standard deviation above the mean). In addition, 95% of the data will be within two standard deviations, and a full 99.7% of the data falls within three standard deviations from the mean.

Frequency distribution

examine the frequency with which each of your data points (or ranges of points) occur. This information can be placed in a table known as a frequency distribution. The frequency distribution is the most common way to describe a single variable and display the chaos of numbers in an organized manner. The first column in the table contains all the different values of the raw data (referred to as the values of "X"). The second column contains the frequency or number of participants with each number of years of experience. The third column in the frequency table contains the percentage of participants. The fourth column gives a cumulative frequency. This shows the total number of participants who have that many years of experience or less.

Approaches commonly used to estimate reliability include: __________________________________________________________________________________________________________________________________________ as the most common. Internal consistency reliability furthermore can be calculated in various ways, including: _________________________________________________________________________________

inter-rater reliability, test-retest reliability, parallel form reliability, and internal consistency reliability split-half reliability, Kuder-Richarson-20, and Cronbach's alpha.

How can you assess convergent and discriminant evidence?

is through a multimethod-multitrait matrix (MTMM, developed by Campbell and Fiske, 1959). The MTMM is simply a matrix or table of correlations arranged to facilitate the interpretation of the assessment of evidence for a test's validity related to other variables. The MTMM assumes that you measure each of several traits (e.g., depression, anxiety, and exhaustion) by each of several methods (e.g., self-report survey, direct observation, and personal interview). Correlation of measures of the same trait but with different methods (i.e., convergent evidence) should be higher than correlations of measures of different traits by the same or different methods (i.e., discriminant evidence).

If the tail is being pulled toward the negative side of the x-axis by a few low scores:

it is negatively skewed

What are the common types of central tendency

mean, the median, and the mode. Each of these three can serve as an index to represent a group as a whole.

If the peak lies to the left, with the majority of grades closer to zero but a few stellar grades way out on the x-axis toward 100, we call the distribution:

positively skewed (going to the right) If the tail is being pulled higher on the positive side of the x-axis by a few really high scores, it is positively skewed.

Discriminant evidence

refers to the degree to which scores on a test do not correlate highly with scores from other tests that are not designed to assess the same construct or trait.

Evidence of validity

refers to the extent to which a test-taker's responses to a given test reflect that test-taker's knowledge of the content area that is of interest.

Evidence of test content

refers to the extent to which the measurement adequately samples the content domain. This source of evidence is particularly appropriate to ability and achievement tests.

Reliability coefficient

the measure of the degree of reliability of an assessment. Although it can be determined in several different ways, the reliability coefficient is a number that ranges from 0.00 to 1.00, with higher numbers indicating greater reliability. the deviation from 1.00 represents the degree of random error associated with the test.

Percentile score

the proportion of people with scores less than or equal to a particular score. A percentile score is an intuitive way of summarizing a person's location in a larger set of scores.

literature consistently and strongly suggests that in terms of the selection of an instrument, _________ is more important than ____________ because without validity, a test has no interpretable meaning. In fact, a test may be reliable without being valid, but a test cannot be valid without being reliable.

validity, reliability

Advantages of Z-scores

we can use standard scores to easily find percentile scores Second, standard scores provide a way to standardize or equate different metrics. Each score comes from a distribution with the same mean (0) and the same standard deviation (1).

Predictive evidence

yields scores at a later time from administration the test scores are kept on record and compared with a criterion measure obtained sometime in the future.

Concurrent evidence

yields scores at the time of administration the test scores and criterion measures are obtained at roughly the same time.


Ensembles d'études connexes

2B - Market Influences on Business (1)

View Set

Week 7 2003 Foundations of Biology

View Set

Code That Fits In Your Head Tips

View Set

ARTH- 452E The Skyscraper Exam #3

View Set