critical injuiry

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Statistical Analyses Using Hypothesis Testing and P Values

Although descriptive statistics such as the mean and SD of a sample may be useful in comparing 2 different treat- ment groups or different time points for 1 group, such as pretreatment to post- treatment scores, clinicians also want to know whether observed sample differ- ences represent true differences in the target population of patients. Therefore it is necessary to apply inferential statis- tical tests, such as the t test, ANOVA, or analysis of covariance (ANCOVA), to de- termine if between-group differences are statistically significant. These tests are examples of parametric tests, which are more robust tests in identifying signifi- cant differences in group means. How- ever, there are assumptions that need to be met to apply parametric tests, which typically include normal distribution of data, equal variances across group data, and independence of data.69 Alternately, when assumptions underlying parametric statistical tests are not met, nonparamet- ric analogs of these tests should be used, although nonparametric tests are gener- ally less powerful. For example, Hale et al41 decided in their study on postural control in patients with chronic ankle in- stability to use Kruskal-Wallis tests and Mann-Whitney U tests instead of ANO- VAs and t tests, because their outcomes data were not normally distributed. The traditional approach for making the decision about statistical significance is hypothesis testing. Taking a compari- son of means as one example, hypoth- esis testing attempts to determine with statistical methods whether differences between or among means are due to chance or are reflective of a true popula- tion difference in the target population. A central concept in hypothesis testing is the null hypothesis, one form of which states that there is no mean difference in the target population, thereby implying that any observed differences in sample means are due to chance. Therefore, if we reject the null hypothesis based on results of a statistical test, then we con- sider it unlikely that an observed differ- ence is due to chance, and the difference is said to be statistically significant. How- ever, statistical tests provide estimates of probability along a continuum, which is why researchers either express a specific threshold value or accept the default val- ue (.05 or 5%) for statistical significance. This threshold probability is the alpha level, or , which indicates the maximum level of risk tolerance for falsely rejecting the null hypothesis (a type I error).69 The alpha level, sometimes expressed as the "level of significance," is established by the researcher prior to data collection. When the alpha level is .05, P values less than .05 permit rejection of the null hypothesis, leading us to infer that true mean differences exist in the target pop- ulation. When P values are greater than .05, we conclude that the risk for com- mitting a type I error exceeds our pre- determined threshold (the alpha level). Therefore, when P is greater than .05, we do not consider observed differences to be statistically significant and conclude that such differences between groups may be due to chance. However, the set point for alpha is somewhat arbitrary and the P values can be influenced by sample size. Therefore, while researchers may set a specific alpha to accept or reject the null hypothesis, the savvy reader should still examine the results, confidence inter- vals (CIs), and sample size to determine whether or not a P value greater than .05 may be potentially meaningful. For example, in a recent clinical trial67 comparing 2 types of exercises for increas- ing strength in women with chronic neck pain, both groups increased strength from pretreatment to posttreatment (P.01). In other words, within-group improve- ments were significant. However, com- parisons of improvements between the 2 groups ("between-group differences") were not significantly different (P = .97). Based on the 2 P values for within-group and between-group differences, we can re- ject the null hypothesis for within-group improvements in the target population and conclude that each treatment group achieved statistically significant gains in strength from pretreatment to posttreat- ment assessment points. In contrast, we must conclude that the observed between- group difference in improvement was due to chance, attributable only to sampling error, and does not reflect a true differ- ence in effectiveness of the exercise pro- grams in the target populatio

Applying the Evidence—Practicing Evidence- Based Practice

Although it may not be viewed in this manner by all therapists, the diagnostic process is essentially an exer- cise in probability revision (Fig. 2).96 Prior to performing a test, a therapist has some idea of the likelihood that the patient has the condition of interest. The likelihood may be most readily expressed in qualitative terms such as "highly likely," "very unlikely," and so forth. These terms, however, can be made more quantitative by speaking in terms of probabilities. For instance, if a condition is thought to be highly likely, this may translate in the therapist's mind to a probability of 75% or 80% cer- tainty. The condition of interest may be a question of screening (Does the patient's problem involve a certain anatomical structure or region?) or of classification (Is the patient going to respond to a certain treatment?). The therapist can also have in mind a treatment thresh- old level of certainty at which he or she will be "sure enough" and ready to act.81 For example, a therapist may feel that he or she must be at least 80% certain that a patient has lumbar spinal stenosis before initiating a program of flexion exercises. Treatment thresholds may not be explicitly stated, but we believe that all therapists reach a point when the examination and evaluation process stops and intervention begins. This threshold should take into consideration the costs associated with being wrong versus the benefits of being correct.97,98 For example, a high threshold is required when ruling out metastatic disease as a source of LBP. Conversely, if the question concerned the application of a treatment with minimal cost and low potential for side effects, the threshold would be lower. For example, the application of patellar taping for a patient with patellofemoral joint pain is a low-cost intervention with few side effects. A therapist may feel it necessary to be only 50% certain that the treatment will be effective in order to initiate the treatment. The patient's values should also be considered in estab- lishing treatment thresholds and determining when to implement an intervention.99 As an example, during the examination of a patient who had a stroke over 1 year previously, a therapist may test the modality of light touch by alternately touching both of the patient's hands and checking for any difference in feeling. If the light touch test is positive (ie, there is a difference in feeling), there is evidence to suggest that the patient has a higher probability of improving function of the hemiplegic upper extremity with an intervention involving forced- use therapy.100 This intervention, however, requires the patient to immobilize the healthy upper extremity for up to 12 hours per day and attend daily therapy sessions lasting for 6 hours.100 Some patients may not value the potential increased function of the extremity highly enough to tolerate the required treatment intensity unless the probability of improving function is very high. The amount of data required to move beyond the treatment threshold is partly determined by the pretest probability that the condition of interest is present. The pretest probability is an important consideration for examining the diagnostic process because it determines how much data will be required to reach a treatment threshold. If the pretest probability that a condition is present is very high, perhaps 80%, one negative test result is unlikely to lower the probability sufficiently to permit its exclusion from further consideration, and additional testing will likely be required to reach a threshold at which the diagnosis would be sufficiently ruled out.101 Likewise, if the pretest probability is low, a single positive finding will probably not be adequate to elevate the probability beyond the threshold to rule in the condition. That is, if the therapist is fairly certain regarding a diagnosis and an unexpected finding occurs, further data are probably required before a treatment threshold can be reached. Pretest probabilities can come from a variety of sources, including epidemiological data on prevalence rates for certain conditions, information already obtained on the patient from the examination, and clinical experience with similar presentations.72 Regardless of the source, an often overlooked step in examining the diagnostic process is recognizing and quantifying the level of certainty in a diagnosis prior to the performance of a test. The information provided by the results of a diagnostic test will alter the pretest probability to some extent, resulting in a revised posttest probability that the condi- tion of interest is present. The magnitude of the revision is based, as has been noted, on data derived from comparisons of the diagnostic test with a reference standard. Likelihood ratios quantify the direction and magnitude of change in the pretest probability based on the test result and, therefore, provide the best informa- tion needed to select the test or tests that will most efficiently move from the uncertainty associated with the pretest probability to the threshold for action.32,102 To illustrate the process, we will use an example of a question that may arise during the examination of a 67-year-old patient with symptoms in both the low back/ buttock and anterior hip/groin that worsen when the patient is walking: Are the patient's symptoms coming from the lumbar spine (eg, lumbar spinal stenosis)? What is a reasonable pretest probability of lumbar spinal stenosis for this patient? Based on the patient's age and symptoms, epidemiological data103,104 as well as clinical experience suggest that the probability is fairly high, perhaps 50%. What test should be performed to rule in this diagnosis? Examining the results from several stud- ies30,105,106 (Tab. 11), the best test appears to be asking the patient whether symptoms are absent when sitting (positive LR6.6). It is not uncommon that information from the history exceeds that obtained from the systems review or the tests and measurements with regard to diagnostic accuracy. If the test is positive, what should the posttest probability of lumbar spinal stenosis be? Two methods can be used to make this determination. The simpler, but less precise, method uses a nomogram (Fig. 3).107 A straightedge is anchored along the left Knowledge of the positive LR values permitted the selection of the test that produced the greatest shift in probability favoring the condition. Had another test been selected with a smaller positive LR, the results would not have been as conclusive. For example, if the therapist had opted to assess pain with lumbar flexion and the test were positive (ie, no pain), the posttest probability would increase to only 58%. Without knowledge of the relative unimportance of this finding, the therapist might over- interpret the value of the positive result. The importance of the pretest probability is also high- lighted by this example. If the patient in question had the same symptoms but was younger, perhaps 45 years of age, the pretest probability of lumbar spinal stenosis would be lower. If the pretest probability was estimated at 20% (pretest odds0.25:1) and the question of the absence of pain when seated was positive, the posttest probability would increase to 62%. It is likely that, in the mind of the therapist, further confirmation would be needed to reach the action threshold for diagnosing the patient with lumbar spinal stenosis. Based on the data shown in Table 11, comparing walking tolerance with the spine flexed versus extended would be the best option (positive LR6.4). If this test were positive, the probability would increase from 62% to 91%, likely exceeding the action threshold. When the pretest probability is low, the therapist may instead seek information to rule out stenosis and then proceed with confirming an alternative hypothesis.4 In this circumstance, the test with the smallest negative LR would be desirable because a negative result would most effectively exclude the condition. Examining Table 11, it is again apparent that a question from the history will be more effective for this purpose than other factors. The patient is asked to rank sitting, standing, and walking from "best" to "worst" with regard to symptoms. If the test is negative (ie, pain during standing or walking is not ranked as "worst"), the negative LR associated with the finding is 0.33 and the probability of stenosis drops to 8%. Table 11 also illustrates the impact of the phrasing of the question. If the patient is asked simply whether or not symptoms become worse when walking, the result is useless, with positive and negative LR values of about 1.0. If the patient instead is asked about improvement in walking when holding on to a shopping cart, the speci- ficity and positive LR increase, but the negative LR remains fairly low. If the goal is ruling out lumbar spinal stenosis, having the patient rank pain during sitting, standing, and walking has the potential to provide the strongest evidence Likelihood ratios provide the most powerful tool for demonstrating the importance of a particular test within the diagnostic process in a quantified manner. Because LR values can be calculated for both positive and nega- tive results, the importance of each can be examined. This is necessary because few tests provide useful infor- mation in both capacities, and understanding the rela- tive strength of evidence provided by a negative or positive result helps to refine test interpretation. For these reasons and for other reasons discussed, research- ers examining diagnostic tests should calculate, or pro- vide sufficient data to permit the calculation of, LR values.83 Therapists should focus on LR values in deter- mining which tests are most effective for ruling in or ruling out conditions of interest..

Ottawa ankle rule

Ankle injuries are one of the most common problems seen in the emergency department. Most patients routinely underwent radiography prior to the devel- opment of the Ottawa ankle rule despite a relatively low fracture rate of less than 15%. Although ankle radiography is a relatively low cost investigation, the annual cost of ankle radiography is well over 500 million dollars for the US and Canadian health care systems.29 The Ottawa ankle rules were prospectively derived (N = 750 patients), refined, and prospec- tively validated (N = 1485). They incorporate simple historical and physical findings which are well defined to determine if patients require radiography of their ankle and/or foot following a traumatic injury (Fig. 2).27 Since their development they have been further studied to determine their impact on clinical practice in a variety of settings and in multi- ple types of studies. Two implementation trials followed the valida- tion study. The first was a non-randomised con- trolled trail with before—after and concurrent control study. This study assessed the use of radio- graphy for all 2342 adult patients seen in the two participating EDs during the 5 month before and 5 month after study periods.29 It demonstrated that the intervention site experienced a 27% relative reduction in radiography rates in the intervention site versus a 2% increase at the control site. For foot injuries, the difference was a 14% relative reduc- tion at the intervention site versus a 13% increase at the control site. The study also found that patients spent approximately 36 min less time in the ED without radiography and had similar levels of patient satisfaction. There were no missed fractures using the rule and similar radiography rates remained at the intervention site 12 months after the study concluded. The second implemen- tation study was a multi-centred study conducted in eight sites.30 This study assessed all 12,777 adult patients seen with acute ankle injuries 12 months prior to and 12 months following the active imple- mentation of the ankle rules. This study was a before and after study with both community and tertiary care institutions participating and found that the overall radiography rate was 83% prior to the implementation of the rules and 61% following the implementation. These findings represented a 26% relative reduction following the implementa- tion of the rules (consisted of a 1 h lecture, pocket cards and posters in the ED). The time spent in the ED was 33 min less for patients who did not undergo radiography. The Ottawa ankle and foot rules were subse- quently adapted for paediatric use and validated in children 2—16 years of age. This study found that the ankle rules were 100% sensitive (95% CI: 95—100) with a specificity of 24% (95% CI: 20—28) for 69 significant ankle fractures. The foot rules were also 100% sensitive for the 17 foot fractures (95% CI: 82— 100). In addition, all children thought to have a Salter-Harris growth plate fracture (type I) were also identified by the rules. If the rules were applied strictly, this would have led to a relative reduction of radiography of 16% for ankle views and 29% for foot views.19 An economic analysis was conducted using a decision analytic approach to technology assess- ment to determine the incremental cost saving expected when the ankle rules are implemented. This study took in consideration potential litigation of missed fractures, increased time off work and still found that the rules were cost-effective with a total savings of US$614,226 to US$3,145,910 per 100,000 patients depending on the cost of radiography (i.e. Medicaid, Medicare or hospital rate).1 In 2001, an international survey of 3350 emer- gency physicians from the UK, France, Spain, USA and Canada was conducted to evaluate the inter- national diffusion of the Ottawa ankle rules and physician attitudes towards them.11 This study had a response rate of 57% (ranging from a low of 49% for France and the USA to a high of 79% for Canada). The results found that more than 89% of Canadian and 73% UK emergency physicians reported frequent use of the rules, while less than one-third of Spanish, French and US physicians reported frequent use of the rules.

Confidence Intervals

Confidence interval analysis is an essen- tial skill for the evidence-based practitio- ner and will comprise an important part of almost every critical appraisal of evi- dence. Montori62 and others3,76 have ar- gued that because P values are not helpful in providing clinicians with information about the magnitude of the treatment ef- fect, other statistics should be used. In contrast to P values, CIs provide informa- tion on the magnitude of the treatment effect in a form that pertains directly to the process of deciding whether to ad- minister a therapy to patients. Whereas a sample statistic is only a point estimate of the true population value, the CI is a range of values within which the popula- tion value is likely to be found at a given level of confidence.35 Sim and Reid76 have reported that because CIs focus attention on the magnitude and the probability of a treatment effect, they thereby assist in determining the clinical usefulness and importance (as well as the statistical sig- nificance) of the findings.76 Most often the 95% CI is used. This is commonly in- terpreted to represent the range of values within which we can be 95% certain that the true population value actually lies.3 For example, Gerber et al34 reported the mean visual analog scale (VAS) score for knee pain after 15 weeks of postopera- tive exercise training for the experimental treatment group: 0.77 cm (95% CI: 0.19 to 1.35 cm). At a 95% level of confidence we conclude that the true posttreatment popu- lation mean pain value for patients receiv- ing this type of exercise training is no less than the lower limit of the CI (0.19 cm) and no greater than the upper limit of the CI (1.35 cm). Readers should note that not all values within the CI are considered equally likely to be the true population value. The point estimate from the sample (0.77 cm) is considered the single best estimate of the population parameter, with values becom- ing increasingly less likely when approach- ing either limit of the CI.33 The convention of using a 95% CI is arbitrary, similar to setting the alpha level to .05. The level of precision or imprecision expressed by CI width is affected by the sample size and the variance in the distri- bution of scores. Small sample sizes and greater variance result in wider CIs.73 Wide CIs reflect imprecision in the data and un- certainty associated with the magnitude of the treatment effect.33,44 In contrast, the narrower the width of the CI around the point estimate of the treatment effect, the more confident one can be that the true effect and its point estimate are similar, allowing the clinician to make more con- fident decisions from the data. Although journals are increasingly re- quiring authors to report CIs, readers will often find published evidence with no CIs around the point estimates of treatment ef- fects. Even when authors do report CIs they commonly fail to interpret them.29 Readers performing critical appraisals of evidence can often compute CIs themselves given published details. A helpful and easy-to-use spreadsheet for computation of CIs (PEDro Confidence Interval Calculator) is freely downloadable from the PEDro website.1 As an illustration, we can extract means, sam- ple sizes, and SDs from a recent randomized controlled trial (RCT)21 wherein authors found significantly better improvements (P = .009) in an experimental treatment group compared to a control group. Pretreatment to posttreatment improvements in shoul- der internal rotation were 20° 12.9° in the experimental group (n = 15) compared to 5.9° 9.4° in the control group (n = 24). Although the authors did not report a 95% CI around the between-group difference, we can easily compute it using the PEDro Confidence Interval Calculator.1 FIGURE 1 shows results for this computation. From these results we see that the point estimate for the difference between mean group im- provements was 14.1° in favor of the treat- ment group. The 95% CI does not include a zero difference, which is compatible with the statistically significant result (P = .009). Furthermore, we estimate the true popula- tion difference for mean improvement to be no less than 6.9° and no more than 21.3° favoring the treatment.

Estimating the pre-test probability

Go (1998, p. 13) provides a helpful way of thinking about the contribution of test results to clinical decision-making (See Figure 2). "What we thought before" (the pre-test probability) is how kely a clinician thinks it is, before doing the test, that a person has the disorder. Pre-test probabilities can be obtained from either published data or the clinician's subjective impression or personal experience (Elstein and Schwarz 2002, Go 1998, Sox et al 1988). Here are some examples of published prevalence data: in children aged between 10 and 16 the prevalence of idiopathic scoliosis is 2-4% (Reamy and Slakey 2001); peripheral neuropathy affects 2.4% of the population, peaking at 8% in older people (Hughes 2002); the prevalence of persistent asthma in childhood is 9% (Woolcock et al 2001). More commonly, a clinician will, during the subjective assessment or patient interview, form a hypothesis that a person may have a particular disorder. Based on the information provided by the client, the clinician may think it very likely or unlikely a particular disorder is present, or may be uncertain. To use this information in the most efficient way, it is necessary to put a figure on the level of uncertainty. This establishes the pre-test probability that the person has the disorder ("what we thought before"), so that the test result can then be used to arrive at "what we think after". Note that the clinician nominates pre-test probability for a particular condition after commencing the assessment. The clinician may have already ruled out rare but serious conditions by standard "screening" or "red flag" questions, and the patient's history has already provided some diagnostic information. Once the most likely diagnosis is identified, a conscious expression of the probability of that condition should be made (Sox et al 1988). 60% to 90% as your pre-test probability. There is no absolutely correct answer. Now use the pre-test probability for the knee injury in Case 1 (say it was 70%) to see how the LRs for common clinical tests (Table 3) can help to confirm or reject the initial diagnosis of an ACL tear, and to rule out meniscal damage. The sensitivity and specificity of tests are taken from Solomon et al (2001). The likelihood ratio nomogram allows us to quickly estimate a post-test probability for a positive or negative test result (Figure 3). For the primary diagnosis of ACL tear, the best test is Lachman's test because it has a much higher positive LR than the anterior draw test. Drawing a line from a pre-test probability of 0.70 through a LR of 42 results in a post-test probability of 0.99. With one test you have moved from being 70% to 99% certain that the patient has an ACL tear. Meniscal tears often accompany ACL tears, and although the patient does not report any locking or catching sensations, you want to exclude a meniscal tear. Your pre- test probability is 0.3, and you select the test with the lowest negative LR (Table 3). A negative medial-lateral grind test moves your pre-test probability of 0.3 to a post- test probability of 0.15 or a 15% chance the patient has meniscal damage. You would like to be even more certain, so you select the next best test for ruling out meniscal tears, the McMurray test. The post-test probability for the first test now becomes the pre-test probability for the second test. A negative McMurray test moves your pre-test probability of 15% to 12%. This is a very small change because the LR was close to 1. Points to remember: • Tests with the highest positive LR will provide the most information in the event of a positive test. • Tests with the lowest negative LR provide the most information in the event of a negative test. • When using a sequence of tests the post-test probability of the first test then becomes the pre-test probability for the next test. An alternative method of calculating the post-test probability is to use simple maths (Go 1998): 1 Estimate the pre-test probability (say 70% or 0.70). 2 Convert this to pre-test odds by dividing probability by 1-probability (odds = 0.70/(1-0.70) = 2.3). 3. Multiply the pre-test odds by the LR (let LR = 42, then 2.3 x 42 = 96.6). 4. Convert the post-test odds to post-test probability by dividing odds by odds +1 (post-test probability = 96.6/(96.6 + 1) = 0.99 or 99%. Key points for testing diagnostic hypotheses: • For the primary diagnostic hypothesis, ie the single most likely explanation for the patient's presenting problem, choose the test(s) with the highest specificity and largest positive LR to confirm the diagnosis. • For the secondary diagnostic hypothesis, ie a credible alternative explanation, choose the test(s) with the highest sensitivity and smallest negative LR to exclude this disorder. Although the examples given show tests applied in series, some tests can be applied in parallel. For example the Ottawa Knee Rules have a sensitivity of 100% (SnOUT) for knee fractures (Emparanza and Aginaga 2001). A knee x-ray is only indicated if a patient with a knee injury is: aged 55 or more; or has isolated tenderness of the patella; or has tenderness at the head of fibula; or is unable to achieve 90 degrees flexion; or is unable to bear weight immediately and on examination. If any of the five signs are positive the patient is referred for an x-ray. If all five clinical signs are negative there is 0% post-test probability that the patient has a knee fracture. Most tests provide only two possible results: condition present or absent; test result normal or abnormal. Multi- level tests have more than two possible results and LRs can be calculated for each level. For example, the results of a ventilation-perfusion test (lung scan) for suspected pulmonary embolism is reported in four categories: normal/near-normal, low, intermediate and high probability (Jaeschke et al 1994). The LRs for these categories are 0.1, 0.36, 1.2 and 18.3 respectively. A word of warning about low and high pre-test probabilities: Consider the case of a 32 year old male who presents with A word of warning about low and high pre-test probabilities: Consider the case of a 32 year old male who presents with sudden onset low back pain, severe pain to both legs, diff iculty initiating micturition and "saddle" numbness. What is the most likely diagnosis? Cauda equina syndrome caused by a large postero-central disc herniation is the immediate and strong suspect. The pre-test probability is extremely high (let's say 99%). Are any clinical tests required to confirm this diagnosis? Go to the nomogram and plot a post-test probability for a negative straight leg raise (SLR) test (-ve LR = 0.2). The chance of the person having a disc herniation has reduced from 99% to 95%. There is still a 95% chance the disorder is present. That means that it is most likely the negative test result is a false negative. In practice, you would arrange an urgent medical referral and not bother to do any further testing. You were already 99% certain of the diagnosis and a positive SLR would only have made you 100% sure. Consider another case scenario. A 32 year old male presents with gradual onset low back pain, ie generalised deep pain in the low back that refers occasionally into the left buttock. As a disc herniation is very unlikely in the absence of leg pain (Deyo et al 1992) the probability that this patient has a disc herniation is very low, say 5%. A positive crossed SLR test (+ve LR = 2.9) moves the probability that the client has a disc herniation from 5% to 10%. There is still a 90% chance the person does not have the disorder. It is most likely the positive test result is a false positive. Conversely, a negative result would only have moved your certainty from 5% to 4%; a disc herniation was almost certainly not the cause of the problem, so why test for it? The key rules to remember are: • When pre-test probability is very high, negative test results are usually false negatives. • When pre-test probability is very low, positive test results are usually false positives. Figure 4 provides a decision-making aid about whether or not to test based on the pre-test probability that the condition is present. The probability threshold below which a clinician decides not to do a particular test for a particular condition depends on a variety of factors including the seriousness of the condition, the cost, unpleasantness and risks of the test, and the patient's need for reassurance. An over-riding principle is that a test should be done only if the result could change treatment decisions (Sox et al 1988). Finally, it has to be said that clinicians require valid studies of diagnostic tests from which they can calculate and apply likelihood ratios. Recent systematic reviews suggest that the methodological quality of such studies is often poor (Massy-Westropp et al 2000, Solomon et al 2001). The key features of a valid study are the selection of a sufficient sample of consecutive patients who are suspected of having the target condition, and the comparison (for all subjects) of the test with a "gold standard" test using blinded testers. That is, the person or persons performing the gold standard test should be blind to the results of the diagnostic test and vice versa (Deeks 2001, Greenhalgh 1997, Sackett and Haynes 2002). Studies that do not have these features will usually over-estimate diagnostic accuracy (Deeks 2001).

Canadian CT head rule

Head injuries are among the most common types of trauma seen in North American emergency depart- ments. An estimated 800,000 cases of head injury are seen annually in US emergency departments.14 Although some of these patients die or suffer serious morbidity requiring months of hospitalisation and rehabilitation, many others are classified as having a ''minimal'' or ''minor'' head injury. ''Minimal'' head injury patients have not suffered loss of conscious- ness or amnesia and rarely require admission to hospital. ' 'Minor'' head injury is defined by a history of loss of consciousness, amnesia, or disorientation in a patient who is conscious and talking, i.e. a Glasgow Coma Scale (GCS)40 score of 13—15.25 A typical review of head injury patients admitted to a neurosurgical service found that 5% of cases were ''severe'' (GCS < 8), 11% were ''moderate'' (GCS 8— 12) and 84% were ''minor'' (GCS 13-15).15 The use of CT for minor head injury has become increasingly common in North America. In 1992, an estimated 270,000 CT scans of the head were performed in American emergency departments for head injury.16 Typical hospital charges in the USA for unenhanced CT range from 500 to 800 US dollars, suggesting a national total cost of 135—216 million US dollars. The US yield of CT for intracra- nial lesions in minor head injury has been esti- mated to be from 0.7% to 3.7%.15,6,41,39,24 Conversely, 96.3—99.3% of CT scans performed in the USA for patients with minor head injury would be expected to be normal and therefore to not alter management. The prospective derivation study, for the Cana- dian CT head rule (Fig. 5), was conducted in 10 large Canadian hospitals (N = 3121 minor head injury patients).36 The primary outcome was the need for neurological intervention and the secondary outcome was clinically important brain injury (CIBI) on CT. The rule stratifies patients into high-, med- ium-, and low-risk groups and mandates CT for the high-risk group. The medium-risk group may either undergo CT or be observed depending upon the vailability of CT and the low-risk group does not require CT. The prospective validation study assessed the accuracy, reliability, and clinical sensibility of the Canadian CT head rule in a new set of minor head injury patients. This study enrolled 2707 patients from nine Canadian tertiary care hospitals.38 All brain injuries on CT were considered clinically important unless the patient is neurologically intact and has one of these lesions on CT: (a) solitary contusion < 5 mm in diameter, (b) loca- lized subarachnoid blood < 1 mm thick, (c) smear subdural hematoma < 4 mm thick, (d) isolated pneumocephaly, or (e) closed depressed skull fracture not through the inner table. The Canadian CT head rule had a sensitivity of 100% (95% CI: 91—100) and a specificity of 65.6% (95% CI: 64—67) for need for neurological intervention in this vali- dation study. The Canadian CT head rule had a sensitivity of 100% (95% CI: 98—100) and a speci- ficity of 41.1% (95% CI: 39—43) for the outcome of CIBI.38 The potential impact on CT ordering was evalu- ated. Using just the five high-risk criteria, had a theoretical CT ordering rate of 35.4% (959/2707) representing a relative reduction of 55.9% from the actual rate of 80.2%.38 With all seven high- and medium-risk criteria together, the theoretical CT rate was 62.4%, a relative reduction of 22.2%. The mean length of stay in the ED for patients who did not undergo CT was more than 2 h less (181 min versus 324 min; P < 0.001) than patients who had CT. A preliminary economic evaluation, from the validation study data, found that widespread use of the rule would be expected to lead to annual cost savings as low as $3.5 million per year or as high as $9.5 million in Canada.38

Results for Continuous Scale Outcomes: Differences Between Means

If randomization in a RCT was effective in creating reasonably equivalent groups at baseline, the pretreatment group means for outcomes on continuous scales will be close to equal. Therefore, when group means are not meaningfully different at baseline, the magnitude of the between- group treatment effect, when statistically significant, can be most easily conceptu- alized as the posttreatment difference between group means for these outcome scales. However, clinicians should criti- cally assess the within-group variability because variance that is much different between groups could be somewhat mis- leading. In cases where groups are not equivalent at baseline for important prognostic factors, ANCOVA methods can statistically adjust the posttreatment means to account for baseline differenc- es.69 For example, Rydeard et al,72 found in a recent RCT that mean scores for the functional disability outcome were signif- icantly different between groups at base- line in spite of randomization. Therefore, they used baseline functional disabil- ity outcome scores as a covariate in the statistical analyses, then found that the between-group difference in posttreat- ment means for functional disability, as adjusted by the ANCOVA method, was statistically significant between groups. If the treatment under consideration (the experimental treatment) is more effective than the comparison (no treat- ment, placebo, or a competing treatment), the posttreatment experimental group mean will show greater improvement than the comparison group mean(s). For a scale on which a higher score is a better outcome (eg, muscle strength), the exper- imental group posttreatment mean will be greater than the comparison group mean if the treatment is effective. For a scale on which a lower score is a better outcome (eg, VAS for pain), the experi- mental group posttreatment mean will be less than the comparison group mean if the treatment is effective. The magnitude of this posttreatment between-group dif- ference is a measure of the treatment ef- fect and is sometimes called the raw effect size.22 Computation of the raw between- group effect size is the simple subtraction of one group mean from another and is expressed in the relevant units of the out- come scale. Therefore, this point estimate of the raw effect size is conceptually intui- tive and is crucial for deciding whether the magnitude of a statistically significant treatment effect is clinically meaningful. For example, Butcher et al13 reported vertical jump takeoff velocity in a control group (2.29 0.35 m/s) and in a trunk stability training group (2.38 0.39 m/s) after 3 weeks of exercise, and found this difference to be statistically significant (P.05). The raw between-group effect size is, therefore, 0.09 m/s (2.38 - 2.29 m/s). Knowing this value, the clinician can proceed to determine the clinical rel- evance of the treatment effect. In contrast to using raw posttreatment scores to calculate the between-group effect size, authors will sometimes use change scores to represent average im- provements over time by computing the difference between baseline, or pretreat- ment, means and posttreatment means. Between-group differences in average change scores are then computed to rep- resent the magnitude of the between- group treatment effect. This approach was used by Johnson et al49 when they re- ported that the improvement from base- line to posttreatment in shoulder external rotation in the experimental treatment group (31.3° 7.4°) was significantly bet- ter (P.001) than that in the comparison treatment group (3.0° 10.8°). Raw effect sizes are commonly trans- formed into unitless effect size indices, such as d for the t test and f for ANOVA, which are examples of standardized ef- fect size indices.22 The most common approach in rehabilitation research is to divide the raw effect size by the combined (pooled) SDs. This method has the ben- efit of accounting for both the magnitude of the treatment effect and the variability of the group means. For example, using values from the between-group compari- son in the Butcher et al study13 reported above, the raw effect size was .09 m/s, whereas the effect size index (d) was 0.24 (0.09 m/s divided by the pooled SD of 0.37 m/s). Effect size indices provide a general indication for relative magnitudes of treatment effects. For example, Co- hen22 characterized effect size indices for a comparison of 2 means as follows: 0.2, small; 0.5, medium; 0.8, large. Although unitless effect size indices are helpful for comparing the magnitude of effect sizes among studies using different outcomes measures, these transformed indices of treatment effect are not as intuitive or as helpful as raw effect sizes for making the crucial comparisons that allow clinicians to judge whether treatment effects exceed thresholds for clinical meaningfulness, as discussed below. However, if variance is much different between or among groups, raw effect sizes may be misleading. In ad- dition, effect size indices can be useful for comparing treatment effects across more than one experiment. For these reasons, readers may wish to consider both raw effect sizes and the standardized effect size indices when critically appraising evidence for therapy.

The Reference Standard

In a study of a diagnostic test, the test of interest is compared with a reference standard. The reference standard is the criterion that best defines the condition of interest.32 For example, if a test is performed to determine the presence of a meniscal tear in the knee, the most appropriate reference standard would be observation of the meniscus with arthroscopy. The ref- erence standard should have demonstrated validity that justifies its use as a criterion measurement.33 If the reference standard is determined to lack validity, little meaningful information can be derived from the com- parison.34 The validity of the reference standard may be compromised by several factors. First, the reference standard should possess acceptable measurement characteristics, as defined by the APTA's standards for tests and measurements.33,35 For example, the Ashworth scale has become a commonly used mea- sure of "muscle spasticity."36 Despite several studies questioning the reliability and construct validity of mea- surements obtained with the scale in either its original or modified form,37- 40 the Ashworth scale continues to be used as a reference standard.41 If a reference standard is not reproducible, or lacks a strong conceptual basis for its use, it should not be used as the criterion against which to judge the adequacy of another test.26 The reference standard should also be consistent with the intended purpose of the diagnostic test. The major- ity of reference standards used to study diagnostic tests have been measures of pathoanatomy.42 If a pathoana- tomical reference standard is consistent with the test's purpose, this could serve as a valid measure for compar- ison. If a diagnostic test is used to select interventions with the goal of maximizing outcomes, a measure of pathoanatomy is unlikely to serve as an appropriate reference standard. As defined by the Guide,3 outcomes are measures of functional limitations, disability, patient satisfaction, and prevention; therefore, diagnostic tests used to select interventions should be tested against a reference standard related to one of these measures. An investigation by Burke et al43 of the Phalen test, which is commonly used for patients with suspected carpal tunnel syndrome (CTS), provides an example of selecting reference standards consistent with the pur- pose of the diagnostic test. The Phalen test could be examined as a screening test to detect compression of the median nerve or as a test indicating the need for specific interventions (eg, wrist splinting).43 To reflect these different purposes, Burke et al43 compared the Phalen test against 2 reference standards: results of a nerve conduction velocity study and patient-reported improve- ment after a 2-week course of wrist splinting. The nerve conduction study in which the distal motor and sensory latencies of the median nerve were measured served as the reference standard for an examination of the Phalen test's ability to detect nerve compression. Patient-reported improvement after 2 weeks served as a reference standard for the accuracy of the Phalen test as an indication of whether wrist splinting was useful as an intervention. If the reference standard is not consistent with the purpose of the diagnostic test, the results become diffi- cult to interpret. For example, 2 recent studies exam- ined various tests for sacroiliac (SI) region dysfunction in patients receiving physical therapy.44,45 In both stud- ies, the reference standard was the presence of LBP, judged as positive (patient consulting for LBP) or neg- ative (patient consulting for an upper-extremity condi- tion). By using this reference standard, the researchers examined the accuracy of the SI region tests in distin- guishing between individuals with and without LBP. It does not appear, however, that this reference standard is consistent with the purpose of these tests. In the litera- ture, SI region tests are proposed to distinguish patients thought to have SI region dysfunction from those with LBP related to other syndromes46 - 48 or to determine whether a patient is likely to respond to a particular intervention designed for SI region dysfunction (eg, SI region manipulation).12,49 The results of studies using a reference standard of the presence of LBP when the issue is whether there is SI region dysfunction are difficult to interpret because this standard is inconsistent with the purposes for which the tests are commonly used. Determin- ing the usefulness of the tests based on the results of these studies may lead to erroneous conclusions. Improper use of reference standards in a study may compromise the validity of the research. The reference standard should be applied consistently to all sub- jects.25,26,32 If the reference standard is expensive or difficult to obtain, it may not be performed on subjects with a low probability of having the condition. Verifica- tion (or workup) bias occurs when not all subjects are assessed by use of the reference standard in the same way.27,50 A common example of verification bias is dem- onstrated by a study of diagnostic accuracy of tests for posterior cruciate ligament (PCL) integrity.51 The refer- ence standard was magnetic resonance imaging (MRI), an appropriate pathoanatomical reference standard for PCL integrity. A group of individuals with no history oknee injury were included in the study. These individuals were assumed to have an intact PCL without MRI verification.51 Another example comes from a study of a screening examination using goniometry for detecting cerebral palsy in preterm infants.52 Goniometric mea- surements were taken at the hip, knee, and ankle. If the range of motion measurements fell outside a normal range, the child was believe to be at an increased risk of having cerebral palsy.52 Infants with a high suspicion of cerebral palsy were referred to a neurologist whose evaluation then served as the reference standard. Only 97 of 721 infants were referred, and a less rigorous reference standard consisting of chart reviews was used for the remaining subjects.52 The impact of verification bias is related to the likelihood that an individual not assessed with the reference standard could have the condition. It may be unlikely that an individual with no history of knee injury would have a compromised PCL. The adequacy of using a chart review for identifying cerebral palsy may leave this study more susceptible to verification bias. Verification bias can lead to an over- estimation of diagnostic accuracy.26,53 The reference standard should also be independent of the diagnostic test. Incorporation bias occurs when the reference standard includes the diagnostic test being studied.54 An example comes from a study of single-leg hop tests for diagnosing anterior cruciate ligament (ACL) integrity.55 The authors evaluated 50 subjects with a chronic ACL-deficient knee and 60 subjects with no prior knee injury. All subjects performed the hop tests. The reference standard was defined as the mean (2 standard deviations) of the absolute value of the right- to-left difference in time to complete the test in the subjects without knee injury. The authors then applied this standard to the results of all 110 subjects and found high levels of diagnostic accuracy for the tests in distin- guishing the 2 groups of subjects.55 This result is not surprising given that the interpretation of the reference standard was based on the test results of the subjects without knee injury. Incorporation bias is also likely to inflate the accuracy of a diagnostic test.26 The reference standard should be judged by an individ- ual who does not know the diagnostic test results and the overall clinical presentation of the subject.26,53,56 If blind- ing is not maintained, judgments of the reference stan- dard may be influenced by expectations based on knowl- edge of the test results or by some other clinical information.56 Review bias may occur if either the refer- ence standard or the diagnostic test is judged by an individual with knowledge of the other result.53

STEP 4. INCORPORATING EVIDENCE INTO CLINICAL PRACTICE

Once it has been determined through critical appraisal that a particular study or group of studies provides valid, applicable evidence that a treatment yields clinically meaningful benefits, the clinician should integrate the evidence into clinical practice. If a given patient is reasonably similar to those in the study, a clinician should be able to in- tegrate valid evidence with considerable confidence. However, any given patient will have a unique set of prognostic at- tributes. Clinicians must recognize that treatments typically are not uniformly effective inasmuch as reported results are for average treatment effects.10 This is another reason why the clinician must integrate the best available evidence with clinical expertise and the goals, values, and expectations of the patient when de- termining which interventions are prefer- able for a particular individual. Many perceived barriers may prevent successful integration of EBP into physi- cal therapist practice.47,58 One barrier is excessive reliance on clinical expertise which can be associated with failure to acknowledge and incorporate current best evidence into clinical practice. Ex- pertise in physical therapist practice has been described as possession of profes- sional values, decision-making processes, communication styles or skills, specialty certifications, and years of practice in physical therapy.47,75 A study by Childs and colleagues16 found that experienced physical therapists with orthopaedic or sports certifications demonstrate greater knowledge in managing musculoskeletal conditions than therapists without spe- cialty certification. Despite these find- ings, one cannot infer that patients cared for by expert clinicians will achieve su- perior outcomes when compared to the outcomes of patients treated by novice clinicians.71,85 In fact, it has been dem- onstrated that expert clinicians are often resistant to changing their practice be- haviors even when their treatment ap- proaches have been disproven.5 Hence, while clinical expertise is important, it is insufficient to assure optimal outcomes. Reliance on clinical experience without including knowledge and application of evidence to clinical care is inconsistent with the principles of EBP.16,85 Therefore, seeking and incorporating the best avail- able evidence should be an integral part of the clinical decision-making process. Instituting behavior change among practicing clinicians is one of the fore- most barriers to successful integration of EBP.17,36,85 While some clinicians are quick to adopt change, many others are unfortunately resistant to change and rely predominantly on their clinical ex- perience rather than incorporating evi- dence into their practice.9 Although the volume and quality of emerging evidence in many areas of physical therapist prac- tice is mounting rapidly, we acknowledge that there are still many areas where evi- dence is sparse and inconclusive. In these instances, rather than waiting for the "perfect evidence," clinicians should act on the research evidence that is currently available and follow up by using patient- centered outcomes tools to determine those interventions which are effective for a particular patient and those which are not.70 Critical appraisals for lower-level ev- idence, such as cohort studies, case series, and case reports, can be performed using the same principles outlined above and in part I of this series. However, it becomes immediately apparent when appraising lower-level evidence that unprotected validity threats in these types of studies permit substantial bias and severely limit confidence in reported results. The hier- archy of evidence does not exclude expert opinion (level 5 evidence); but opinion should be considered best evidence only with specific knowledge that higher-lev- el evidence does not yet exist. Finally, it should be recognized that the results from higher levels of evidence, such as system- atic reviews, might conclude that there is currently insufficient evidence to support one intervention option over another. In these instances, treatment decisions based on clinician expertise and experience (al- though these are lower forms of evidence in most evidence hierarchies) may in fact be the most appropriate form of guidance to inform clinical decision making. To illustrate how knowledge of current best evidence, combined with critical ap- praisal skills, can guide clinical decision making, consider the case of a 74-year- old female with a history of spinal stenosis and cardiovascular disease who indicated that she developed her most recent bout of low back pain after injuring her back while playing with her great granddaughter 3 weeks previously. Her Modified Low Back Pain Disability Index was 20% and she indicated that her goals were to complete household activities without making her back pain worse and to be able to play with her great granddaughter in 2 weeks. The most impressive findings from the physi- cal exam include generalized stiffness and loss of motion in both hips and lumbar spine in flexion. In consultation with the patient, you indicate that her goals seem realistic and that you wish to reassess her Modified Low Back Pain Disability score in 2 weeks and expect her to demonstrate at least a 6-point change. Your interven- tion strategy includes patient education, joint mobilization to the hips and lum- bar spine, and implementation of a body weight-supported walking program. This patient case illustrates several im- portant issues. Although this patient has 2 potentially negative prognostic factors—a history of recurrent back pain and cardio- vascular disease—her modified low back pain disability score of 20 indicates a mild level of disability. Because the MCID for the Modified Low Back Pain Disability Questionnaire is 6 points,31 this is chosen as the quantitative goal that seems to best match those described by the patient. The intervention strategy is based on a recently published clinical trial by Whitman and colleagues84 that used a program of patient education, body-weight supported tread- mill training, and joint mobilization to the spine and hip joints. The typical subjects in the clinical trial were women with an aver- age age of 69 years and a baseline Modi- fied Oswestry score of 36, which seem to closely match the characteristics of this patient. In addition, the average modified low back pain disability score reduction at 6 weeks of the intervention program was approximately 10 points. Therefore, the goal of a 6-point change in 2 weeks seems realistic. As discussed in a previous sec- tion, however, MCIDs that are established based on group data can be misleading if applied to individual patients. Therefore, a more conservative approach of establish- ing goals that exceed the MCID threshold might be a better guideline to ensure that self-report measures represent true clini- cally important change.

DISCUSSION

Our intent was for this paper to introduce one approach to assessing the value of the physical examination procedures used in athletic training and other health care specialties. Like- lihood ratios are appealing in that the values are relatively easy to calculate from published reports and readily applied in clin- ical practice. Other techniques also permit improved clinical decision making. For example, receiver operator characteristic curves permit interpretation of test results that are points on a continuum (such as systolic blood pressure) rather than di- chotomous values and summarize estimates provided from multiple reports.8 Athletic trainers, however, use numerous physical examination procedures that have an intended posi- tive or negative result. Likelihood ratios are the most easily calculated and applied estimates of test performance for these examination procedures. How should the process we describe be applied in athletic training practice, education, and research? First, practicing cli- nicians should consider how they interpret the examination procedures learned and practiced over the years. Evaluating examination procedures places the uncertainties associated with the examination process in perspective and permits se- lection of the tests most likely to help make better clinical decisions. We must also recognize that the examination procedures we discuss are but a small portion of those taught throughout the educational experience in entry-level education programs. Which examination procedures should the student be taught? We believe that those procedures with the best performance characteristics should be emphasized. For example, Scholten et al21 compiled and analyzed the results from 10 studies on he performance characteristics of the Lachman,3 anterior drawer, and pivot shift tests (Table 3). Because the LR and LR for a Lachman test3 indicate that the test is of value in identifying ACL lesions as well as ruling the injury out in those with intact ACLs, the educator may elect to instruct stu- dents in the Lachman test3 only. Doing so might foster greater mastery of this test because students are relieved of the re- sponsibility of practicing multiple tests of similar purpose. Regardless of the selection of particular examination pro- cedures included in instruction, an understanding of LRs will improve students' understanding of physical examination pro- cedures in the context of a comprehensive patient evaluation. We believe that the ability to search the literature and calculate LRs and other indicators of test performance should be coreq- uisite to learning and practicing these procedures. It should also be appreciated that few if any reports have been published related to examination procedures when per- formed by ATs. Simply because an examination procedure is reported to be valuable when performed by an orthopaedic surgeon does not mean other clinicians will achieve similar results. For example, Cooperman et al22 found that orthopae- dists were more skilled at the Lachman test3 and accurately identifying ACL ruptures than physical therapists. Hurley23 reported a generally poor level of agreement between ATs and an orthopaedic surgeon in the assessment of a sample of sub- jects with and without known ACL deficiency. Of particular interest was that fact that only 4 of 22 ATs were found to perform the test in a manner consistent with the original report of a Lachman test by Torg et al.3 These 4 ATs achieved 67% agreement with the orthopaedic surgeon, whereas the other 18 ATs performed what Hurley23 defined as a generalized anterior tibial translation test and achieved only a 19% level of agree- ment. We believe there is a need to study examination pro- cedure performance characteristics of examinations performed by ATs.

SENSITIVITY, SPECIFICITY, AND LIKELIHOOD RATIOS

Sensitivity and specificity are central concepts to under- standing test performance characteristics. Sensitivity and spec- ificity are related to the ability of a test to identify those with and without a condition and are needed to calculate LRs. Table 1 illustrates how test results can be categorized.6 The inserted values correspond to the following example. Sensitivity is the number of illnesses or injuries that are correctly diagnosed by the clinical examination procedure be- ing investigated (cell A) divided by the true number of ill- nesses or injuries (cells A C) (based on the criterion or gold standard measure) and is calculated as follows8,15: For example, consider 35 patients with knee injuries. After assessment via an anterior drawer test, 18 are correctly diag- nosed as being ACL deficient. All 35 undergo arthroscopic surgery and 20 are found to have torn their ACL. From the formula above, 18/20 yields a sensitivity of .90. Specificity is the number of individuals correctly classified as not having the condition of concern based on the test being investigated (cell D) divided by the true number of negative cases (cells B D) (based on the criterion or gold standard measure). Is it possible to have a positive test in someone without the target condition? Absolutely: these are the cases contained in cell B of the contingency table. From the previous example, 15 patients did not have an ACL tear. Let's assume, however, that 3 of these patients were judged to have positive anterior drawer tests; thus, only 12 were correctly classified as not having injured their ACLs. Therefore, the specificity 12/15 or .80. Diagnostic procedures may have high sensitivity and low specificity or vice versa. It is often difficult to determine the effect of the estimates of sensitivity and specificity on the use- fulness of a procedure in clinical practice. Ideally, a test would have high sensitivity and specificity, but this is often not the case. Furthermore, even for tests with high sensitivity and specificity, the effect of test results on the probability that a condition either is or is not present cannot be calculated di- rectly from these values. To better understand how test per- formance affects clinical decisions, predictive values or LRs can be calculated. Predictive values are affected by the inci- dence of the condition being assessed in the population.15,16 Likelihood ratios, sensitivity, and specificity do not vary with changes in incidence.15-17 Thus, LRs offer an approach to as- sessing test performance that is unaffected by the incidence of the condition being assessed in the population and that incor- porates estimates of sensitivity and specificity into a clinically useful value. A LR indicates the effect of a positive examination find- ing on the probability that the condition in question exists. For tests with dichotomous results, a LR is calculated as fol- lows11:Using the example above, the LR equals .90/(1 .8) 4.5. This means that, based on our hypothetical numbers, a positive anterior drawer is 4.5 times more likely to occur in a patient with a torn ACL than one with an intact ligament.11 The application of LR values in the context of clinical practice will be addressed in the next section. A LR addresses the effect of a negative examination on the probability that the condition in question is present. A negative result from a diagnostic test with a small LR sug- gests that the chance of the condition of concern existing is very low. Negative likelihood is calculated as follows15,1 gain using the same values, the LR equals (1 .90)/.8 0.13. In this case, the examiner would find a positive an- terior drawer 13/100 times as often in uninjured knees as in injured knees. Jaeschke et al18 summarized LRs (positive and negative) into broader categories of clinical value (Table 2). From our examples anterior drawer test, the LR of 4.5 suggests that a positive test results in a small to moderate but likely important shift in the probability of a torn ACL. The LR of 0.13 suggests that a negative test results in a moderate and likely important shift in probability favoring an intact ACL.above, one could conclude that for an

Sensitivity, Specificity, and Predictive Values

Sensitivity and specificity values are calculated vertically from the 2 2 table and represent the proportion of correct test results among individuals with and without the condition, respectively. Sensitivity (or true positive rate) is the proportion of subjects with the condition who have a positive test result. Specificity (or true negative rate) is the proportion of subjects without the condition who have a negative test result.42 Predictive values are calculated horizontally from the 2 2 table and represent the proportion of subjects with a positive or negative test result that are correct results. The positive predictive value is the proportion of subjects with a positive test result who actually have the condition. The negative predictive value is the proportion of subjects with a negative test result who do not have the condition.73 Predictive values might appear to be more useful for applying the results of a study because these values relate to the way these tests are used in clinical decision making: given a test result (positive or negative), what is the probability that the result is correct? Sensitivity and specificity values work in the opposite direction: given the condition is present or absent, what is the probability that the correct test result will be obtained? Despite their apparent usefulness, predictive values can be deceptive because they are highly dependent on the prevalence of the condition of interest in the sample. Positive predic- tive values will be lower and negative predictive values will be higher in samples with a low prevalence of the condition. If prevalence is high, the trends reverse.74 Sensitivity and specificity values remain fairly consistent across different prevalence levels.42 A comparison of 2 studies examining the diagnostic accuracy of weakness of the extensor hallucis longus muscle for detecting L5 radiculopathy illustrates this point. Lauder et al75 stud- ied consecutive patients referred to physical medicine physicians with a suspicion of lumbar radiculopathy (Tab. 5). The reference standard was electromyographic findings, and, based on this standard, the prevalence of L5 radiculopathy was 11% (10/94). Kortelainen et al76 studied patients referred for surgery with symptoms of sciatica (Tab. 6). Based on a reference standard of surgical observation of the nerve root, the prevalence of L5 radiculopathy was 57% (229/403). The sensitivity and specificity values remained fairly consistent. The predictive values, however, varied greatly between stud- ies due to disparate prevalence rates, with the study with higher prevalence of radiculopathy showing a higher positive predictive value. Sensitivity and specificity values provide useful informa- tion for interpreting the results of diagnostic tests. Sensitivity represents the ability of the test to recognize the condition when present. A highly sensitive test has relatively few false negative results. High test sensitivity, therefore, attests to the value of a negative test result.77,78 Sackett et al42 have advocated using the acronym "SnNout" (if sensitivity [Sn] is high, a negative [N] result is useful for ruling out [out] the condition). High sensitivity indicates that a test can be used for excluding, or ruling out, a condition when it is negative, but does not address the value of a positive test. Specificity indicates the ability to use a test to recognize when the condition is absent. A highly specific test has relatively few false positive results, and therefore speaks to the value of a positive test.77,78 The acronym applicable in this case is "SpPin" (if specificity [Sp] is high, a positive [P] result is useful for ruling in [in] the condition).42 Unfortunately, few tests possess both high sensitivity and specificity. Knowledge of the sensitivity and specificity of a test can help clinicians refine clinical decision making by allowing them to weigh the relative value of positive or negative results. A recent study79 examining the diagnostic accuracy of clinical tests for detecting sub- acromial impingement syndrome provides an example. Six tests were compared against a reference standard of MRI of the supraspinatus tendon. No test had high levels of both sensitivity and specificity (Tab. 7). The Hawkin test was the most sensitive, and the drop arm test was most specific.79 The high sensitivity (92%) indicates that a negative Hawkin test is useful for ruling out subacro- mial impingement. The low specificity (25%), however, signifies that a positive Hawkin test has little meaning. The drop arm test was very specific (97%), indicating that a positive test is useful for confirming subacromial impingement. The sensitivity of the drop arm test was poor (8%), revealing a high number of false negative results and attesting to the lack of meaning of a negative result.

searching the literature

t is crucial to develop accurate and efficient search strategies when seeking the best available evidence in the literature. Computerized literature searching is an essential skill necessary to efficiently practice EBP.16 A number of searchable databases exist, including MEDLINE and the Cumulative Index of Nursing and Allied Health Literature (CINAHL). Search strategies entail using 1 or more key words that may be found in the article's title or abstract. Addition- ally, some databases use Medical Subject Headings (MeSHs), which are biomedi- cal terms that designate major concepts within the MEDLINE database.32 Search strategies for a specific MeSH term will reveal articles relevant to that heading and others associated within the respec- tive database. It has been reported that combining MeSH terms and key words yields the most sensitive search results (ability to detect all citations in the data- base) in MEDLINE, as compared to sim- ply searching by key words or MeSH.51 PubMed offers several helpful tutorials for using MeSH terms in online searches . PubMed Clinical Queries4 is a very helpful and efficient utility available within PubMed (http://www.ncbi.nlm. nih.gov/sites/entrez), the public access portal for MEDLINE searches. Key words entered into PubMed Clinical Queries search fields are automatically incorpo- rated into predetermined EBP-compliant search strategies to find the best evidence to answer foreground questions on diag- nosis, prognosis, therapy, or etiology/ harm. For each search type, users can specify whether narrow (specific) searches or broad (sensitive) searches are desired. Additional search strategies within Clini- cal Queries target systematic reviews and clinical prediction guides (rules). Search strategies used within Clinical Queries have been systematically tested to filter results based on study design. This ap- proach can substantially reduce time and effort for a busy clinician searching to identify studies of a particular design; but we must be aware that it does not provide an assessment of how well the study was conducted. To illustrate the efficiency of searches using PubMed Clinical Queries, we can compare results obtained with and without the methodologic filters. A PubMed search using the search string "exercise AND patellofemoral" yields 161 hits when used without the filters. How-ever, a Clinical Queries search using the same string yields only 54 hits using the broad, sensitive filter for therapy, but only 22 hits (a more manageable number) us- ing the narrow, specific filter for therapy. Examination of the automated transfor- mations of the simple search string us- ing the 2 search hedges () reveals that broad searches include lower-quality studies, while narrow searches target higher-level studies. Using these filters a clinician can avoid inefficient searches that yield too many studies of lesser qual- ity, searching first for studies of higher quality when looking for best available evidence. In addition to electronic search en- gines, a number of online databases pro- vide clinicians with evidence summaries, as well as quality ratings of the available evidence. A number of evidence sources currently exist including the Australian- based Physiotherapy Evidence Database (PEDro) (http://www.pedro.fhs.usyd.edu. au/), McMaster University's Health Infor- mation Research Institute and Centre for EBP (http://www.bmjupdates.com), the Cochrane Library (www.cochranelibrary. com), and the American Physical Thera- py Association's (APTA) Hooked on Evi- dence online database (http://www.apta. org/hookedonevidence.org). Hooked on Evidence allows APTA members to per- form a quick search on a specific topic and provides detailed description of the current evidence and allows for clinicians to quickly implement evidence into clini- cal practice.34 An extensive list of EBP- related databases with advantages and disadvantages of each can be found in the article by MacDermid.33 Additional- ly, a number of free online rehabilitation and medical journals and lists of these online journals exist, such as British Medical Journal (http://bmj.bmjjour- nals.com), BioMed Central (http://www. biomedcentral.com/bmccomplementalt- ernmed), free full-text journal listings (eg, http://www.freemedicaljournals.com), and Google Scholar, which is often able to find full text (http://scholar.google. com). Lastly, Open Door is a new feature vailable to APTA members through the APTA web site (http://www.apta.org/ opendoor). The mission of Open Door is to allow physical therapists easy access to clinical research. This service provides full-text access online to articles directly relevant to physical therapy practice through ProQuest, the Cochrane Library, and CINAHL. Readers are referred to the article by Doig and Simpson16 for more information on conducting efficient lit- erature searches.

Interpretations of Apparently Positive Trials: MCID, Effect Size, and CI Limits

A clinical trial is termed "positive" when the null hypothesis is rejected by formal hypothesis testing. In a positive trial, au- thors conclude that results are statistical- ly significant and that the experimental treatment is more effective than the com- parison. Guyatt et al38 use the phrase "ap- parently positive trial" to communicate the idea that critical appraisal requires an evidence-based practitioner to look beyond statistical significance. Addition- al judgments must be made about clinical meaningfulness of the treatment effect and the level of precision in the point es- timate of the effect size. These judgments are accomplished by comparing the raw effect size, with its accompanying CI, to the MCID. Even when we conclude that results are clinically meaningful because the point estimate for the raw effect size is greater than the MCID, we must rec- ognize that the true size of the treatment effect may be more or less than the point estimate from sample data. The upper and lower limits of the 95% CI around that point estimate for the effect size give us an indication of just how small or how large the true treatment effect might be in the population of interest. Therefore, we consider the 95% CI to determine whether the MCID is within that interval. If the MCID is within the 95% CI, then we cannot rule out at a 95% level of confi- dence that the true population treatment effect might be trivial (less than MCID). On the other hand, if the raw effect size is greater than the MCID and the MCID is excluded from the 95% CI, then we are 95% confident that there is a clinically meaningful benefit of treatment in the population—even if the true magnitude of that benefit is at the limit of the CI sug- gesting the smallest benefit of treatment. Guyatt et al38 characterize a positive trial in which the 95% CI excludes the MCID as "a definitive trial." Following the example above from Hy- land et al,45 we can consider the raw point estimate of the treatment effect (3.5 cm on the pain VAS) in the context of its 95% CI and the MCID (3.0 cm). This study had a small sample of subjects in the 2 groups considered here: 10 patients in the control group and 11 patients in the calcaneal taping group. Entering those sample sizes and the posttreatment pain VAS means and SDs for the 2 groups into the PEDro Confidence Interval Calcula- tor spreadsheet (FIGURE 1), we find that the 95% CI is 2.2 to 4.9. We conclude from this CI that the true treatment effect size in the target population is no less than 2.2 cm on the pain VAS, and no greater than 4.9 cm. Inasmuch as the MCID (3.0 cm) is not excluded by the 95% CI, we can- not rule out a trivial treatment effect in the target population. This is because the study results are compatible with true treatment effects as small as 2.3, 2.5, or 2.7 (etc), which are all smaller than the MCID and are therefore not clinically meaning- ful. This analysis does not change the fact that a statistically significant treatment ef- fect was found favoring the experimental treatment, nor does it change the fact that the best estimate33 of the population treat- ment effect (3.5 cm) is clinically meaning- ful. Rather, this illustration demonstrates the imprecision inherent in studies with small sample sizes and suggests that ad- ditional evidence with larger samples and correspondingly greater precision (less variability) is required before we consider this finding definitive.38 Adequate precision to rule out a trivial treatment effect in a positive trial is il- lustrated in a study of radial shock wave therapy for calcific tendinitis of the shoul- der.14 Posttreatment pain VAS scores (mean SD) were significantly better (P = .004) in the treatment group (0.90 0.99) than in the control group (5.85 2.23). The between-group difference was 4.96 cm (95% CI: 4.23 to 5.67). If we accept the MCID value of 3.0 cm for the pain VAS,55 we consider this study to be convincing evidence for a clinically mean- ingful benefit of treatment, inasmuch as the study results suggest an average treat- ment effect no less than 4.23 cm in the target population. In other words, the tri-al is definitive for this outcome, because the 95% CI around the point estimate for the treatment effect excludes the MCID.

STEP 5. EVALUATING PERFOR- MANCE ON STEPS 1 THROUGH 4

Although most of this commen- tary addresses critical appraisal of evidence, this fifth and final step in the process of achieving successful imple- mentation of EBP is arguably the most im- portant. Self-assessment of practice begins as a student in the form of self-observation and judgmental processing and should continue through one's professional ca- reer.64 The skills of self-awareness assist clinicians in identifying personal strengths as well as limitations.27 It is with reflective practice that physical therapists will refine their efficiency with integrating the best available evidence into clinical practice. Recognition of personal and profession- al limitations can be difficult and may result in avoidance of the issues, regardless of the internal drive and motivation of the thera- pist.27 Developing competence in the EBP process will require clinicians to acknowl- edge times of uncertainty and the need for gathering information. Competence includes self-awareness on behalf of the therapist and the ability to recognize per- sonal limitations, which can be very diffi- cult. Straus et al81 have developed a series of questions (APPENDIX B) to facilitate introspec- tive self-evaluation for the evidence-based practitioner. Therapists should reflect sub jectively on their ability to proceed through the first 4 steps, but should also assess pa- tient outcomes objectively and formally in the context of best available evidence. Phys- ical therapists should use reliable and valid outcome measures for every patient they see in clinical practice to ascertain if true and clinically meaningful changes in patient sta- tus occurred (ie, did patient improvements exceed the outcome scale's MDC and MCID scores). The data obtained through the use of valid and reliable outcomes tools, along with the self-evaluation of effectiveness and efficiency with the 4 steps, will enhance clinical practice. Clinicians may find it help- ful to read one of the several case studies or case series where clinicians provide detailed description of applying current best evi- dence in managing patients with a variety of conditions. For example, MacDonald and colleagues57 reported on the management of a series of patients with hip dysfunction who responded positively to novel manual therapy interventions. Similarly, Cleland 21 82 et al and Waldrop have published case series that apply recently developed clinical prediction rules to patients. As proposed by Flynn and colleagues,30 the use of minimal data collection forms that include key examination findings and appropriate patient-centered outcome mea- sures will allow students as well as practic- ing clinicians to monitor their individual clinical performance. With this informa- tion, clinicians can compare average patient improvements in clinical settings to average patient improvements in the current best evidence (ie, peer-reviewed, published lit- erature), while accounting for differences between clinical and research settings and contexts. It is ultimately through these qual- ity measurement processes and account- ability to EBP principles that therapists become clinicians of excellence.9

Key Words: Examining Diagnostic Tests: An Evidence-Based Perspective

As explicated by the Guide, diagnosis requires gathering of data through examination. During the initial examination, data are obtained through the history, systems review, and selected tests and measures.3 Therefore, questions of his- tory and the screening procedures performed during the review of systems are also considered diagnostic tests, along with the various tests performed and measurements obtained. Throughout the examination, data are gathered to evaluate and to form clinical judgments. The result of this diagnostic process is a label, or classification, designed to specifically direct treatment. Individual pieces of data are collected for different purposes during the process.6,7 Some data are collected to focus the examination on a region of the body or to identify a particular pathology (eg, screening tests). Other data are gathered for the purpose of selecting an intervention (eg, tests used for classifica- tion). In determining the accuracy of a diagnostic test, the intended purpose of the test should be considered.8 Although, according to the Guide, the end result of the diagnostic process should most often be a classification grouping based largely on impairments and functional limitations instead of pathoanatomy, individual tests may be used to focus the examination or detect conditions not appropriate for physical therapy management. Tests used in this manner need to demonstrate accuracy for identifying the underlying pathoanatomy. An example of a test used for this purpose is the ankle-arm index,9,10 a ratio of ankle to arm systolic blood pressure, as a method of screening for atherosclerotic diseases. Some studies have shown that low ankle-arm index values are indicative of various atherosclerotic diseases,9,10 and such a finding during an examination may indicate the need for referral of the patient to a physician. Another example would occur during the examination of an elderly patient with symptoms in both the lumbar and hip regions. The therapist may want to determine whether the hip symptoms indicate degenerative changes of the hip or whether the symptoms are referred from the lumbar region. Various tests and measures might be considered helpful in making this determina- tion; however, the measurements with the highest diag- nostic accuracy for detecting degenerative changes of the hip have been shown to be hip medial (internal) rotation range of motion of less than 15 degrees and hip flexion range of motion of less than 115 degrees.11 The occurrence of these impairments during the examination, therefore, could provide useful diagnostic information, indicating a need to focus the examination on the hip region. Some diagnostic tests are performed by physical thera- pists because the results, singularly or in combination with other findings, are believed to indicate that a particular intervention will be most effective in maximiz- ing the patient's outcome. Tests used in this manner form the foundation of classification systems and need to demonstrate accuracy for identifying which interven- tions might be useful. For example, the observation of frontal-plane displacement of the shoulders relative to the pelvis (ie, lumbar lateral shift) in a patient with low back pain (LBP) is frequently cited as an important examination finding.12-16 This finding has been consid- ered by some to be diagnostic of a lumbar disk hernia- tion16,17; however, the diagnostic accuracy of a lateral shift for detecting the presence of a disk herniation is poor.14 Other measures, such as the straight-leg-raise test, serve as more accurate diagnostic tests for the presence of a lumbar disk herniation.18,19 Despite the lack of accuracy for diagnosing a disk herniation, a lateral shift may be meaningful, not based on its ability to indicate a specific pathoanatomical origin, but because it may indicate which intervention (ie, correction of the lateral shift) will be most useful in reducing pain and disability.15,20 Although it lacks accuracy for detecting a disk herniation, the presence of a lateral shift may still have diagnostic value if it can be demonstrated that patients judged to have a lateral shift who are treated with correction of the shift have outcomes superior to those of patients treated with alternative approaches. No studies to date have investigated this hypothesis. In summary, both clinicians and researchers need to consider the purpose for which a diagnostic test is performed. Tests may serve to focus and refine the examination, or they may be used for classification with the goal of selecting effective interventions. The same test may have the potential to serve both purposes, whereas some tests may be useful for one purpose or neither purpose. This distinction is important in consid- ering how to use the diagnostic process in an evidence- based manner. The purpose of a test has important impli- cations for examining the evidence in support of the use of the test and applying the test to clinical practice.

Confidence Intervals

As is true of all statistics, sensitivity, specificity, and LR values are taken from a sample and represent an esti- mate of the true value that could be found in the population.84 The confidence interval (CI) attests to the precision of this estimate. A 95% CI is the most common and indicates a range of values within which the popu- lation value would lie with 95% certainty.86 If the CI is wide and contains values that are not clinically impor- tant, the usefulness of the measure may be questionable. That is, if another estimate were taken from a different sample, the statistic calculated might be substantially different. In the study by Calis et al,79 for example, the drop arm test had the largest positive LR among the tests for subacromial impingement (2.8), but the 95% CI was wide (0.35-21.7), indicating that the positive LR esti- mated from this sample of 120 patients was not very precise. Formulas for calculating CI ranges for diagnos- tic statistics have been published.84,86,87 As is apparent in Table 7, the recommended formulas do not result in a symmetrical CI about the statistical estimate.88 The asym- metry is more pronounced as the sensitivity and speci- ficity values move farther from 50% in either direction.86 The width of the CI will also be related to the sample size and the amount of variability in the test being studied. Reporting of a CI with any diagnostic statistic is recom- mended to permit an assessment of the precision of any estimate of diagnostic accuracy.86,89

METHODOLOGIC QUALITY AND STUDIES OF DIAGNOSTIC PROCEDURES

Assessing the performance of diagnostic tests and applying this knowledge in clinical practice requires familiarity with the associated terminology. The clinician must be able to evaluate the methodologic quality of studies of diagnostic testing and have a basic understanding of how a few numeric values are derived. This section provides an introduction into the assess- ment of methodologic quality of studies of diagnostic proce- dures.

Assessing the Anterior Cruciate Ligament: Patient Example

A 21-year-old female recreational basketball player presents to an outpatient sports medicine clinic on referral from her primary care physician. She reports injuring her left knee last week while playing. She states that when she turned to cut to the left, her shoe stuck to the court, causing her to fall. She notes that the knee was immediately painful and she discon- tinued playing, went home, and applied ice. She reports that she had substantial swelling the next day and tried to ''stay off her feet.'' She was evaluated by her family physician, giv- en analgesic medication, crutches, and a knee immobilizer, and was referred for further evaluation. Upon presentation, she complains of pain on the medial portion of her knee, as well as a deep pain that cannot be touched. She is unsure of what she heard or felt at the time of injury. Moderate swelling is evident. The clinician estimates the pretest probability of ACL rup- ture at 50%. An illustration of the effect of various pretest estimates on posttest probability follows. Given the 50% pretest probability estimate, the pretest odds 0.5/(1 0.5) 1. Upon physical examination, a positive Lachman test is demonstrated. Applying the data from Schol- ten et al,21 the sensitivity and specificity of the Lachman test3 were estimated to be 0.86 and 0.91, respectively. These values yield a LR 9.6 and a LR 0.15. Applying these values to the scenario above, the posttest probability is 91% (9.6/ 10.6) after a positive test and 13% (0.15/1.15) after a negative test. Thus, the physical examination results have had a large effect on the probability of an ACL lesion and, thus, treatment decisions.

Assessing the Medial Meniscus: Patient Example

A 33-year-old male recreational soccer player presents with a 2-month history of knee pain resulting from an episode in which the knee was twisted while he was being tackled during a game. He localizes the pain to the medial aspect of the knee and reports that after activity the knee feels swollen and that he experiences a painful catching sensation intermittently. Co- rea et al19 noted that in a series of patients with meniscal injuries, all had pain, 61% reported painful clicking, and 55% experienced recurrent effusion. Thus, based on this patient's history, a clinician may establish a pretest probability of a meniscus tear of 60%. This level of probability can be trans- formed into pretest odds using the following formula: Proba- bility/(1 Probability).15,16 In this case, .6/(1 .6) 1.5, indicating that the patient is 1.5 times more likely to have a meniscus tear than someone without these signs and symp- toms. To assess the effect of a positive McMurray test,4 the pretest odds are multiplied by the LR. Turning to the liter- ature, we find that Scholten et al20 calculated a summary LR for the McMurray test4 of 3.4. Thus, 1.5 3.4 yields posttest odds of 5.1. Posttest probability is calculated by the formula (Posttest odds)/[posttest odds 1]).13 Thus, the posttest prob- ability of a meniscus tear can be calculated as such: 5.1/6.1 0.84. Therefore, using the LR estimate reported21 and the pretest odds provided in this exercise, a positive McMurray test4 has shifted the estimate of probability of meniscus tear from 60% to 84%. The same process may be used to assess the effect of a negative McMurray test4 on the same patient. Again, a pretest probability of 60% (pretest odds 1.5) is assumed. Using the mean LR of 0.6 estimated by Scholten et al,20 a posttest odds estimate of (1.5 0.6) 0.9. Dividing 0.9 by 1.9 yields a posttest probability of 0.47. Thus, the effect of a negative McMurray test has resulted in a very small (13%) change in the probability a meniscus tear does not exist and will likely have little effect on treatment decisions.

Interpretations of Apparently Negative Trials: MCID, Effect Size, and CI Limits

A clinical trial is termed "negative" when we fail to reject the null hypothesis. In a negative trial, authors conclude that results are not statistically significant and that the experimental treatment is no more effective than the comparison. Guyatt et al38 use the phrase "appar- ently negative trial" to communicate the idea that critical appraisal requires an evidence-based practitioner to be wary of results from negative trials unless adequate statistical power can be dem- onstrated. The danger is that an under- powered trial might fail to find statistical significance in sample data even when there is a meaningful benefit of treatment in the target population (a type II error). Authors will frequently attempt to ad- dress this issue by revealing details of the statistical power analysis used to estimate the required sample size before the study was conducted. This approach is unsatis- fying in part because a priori power com- putations require estimations of variance that may or may not reflect the observed variance in sample data. Guyatt et al38 suggest a different method for determin- ing whether a negative trial has sufficient statistical power. Here again we consider the 95% CI around the point estimate of the raw effect size, to determine whether the MCID is within that interval. If the MCID is within the 95% CI, then we can- not rule out at a 95% level of confidence that the true population treatment effect might be clinically meaningful (greater than MCID), even though the authors failed to reject the null hypothesis.50 This circumstance would reveal inadequate statistical power in the study, suggesting that we should not accept any conclusion that the treatment is ineffective. On the other hand, if the MCID is excluded from the 95% CI, then we are 95% confident that there is no clinically meaningful ben- efit of treatment in the population, even if the true magnitude of the treatment effect is at the limit of the CI suggesting the largest between-group difference. Guyatt et al38 characterize a negative trial in which the 95% CI excludes the MCID as "definitely negative." A reader critically appraising a negative trial in which the 95% CI around the treatment effect ex- cludes the MCID can be confident that the failure to find a statistically signifi- cant difference is not attributable to a type II error. In other words, if precision in the study is sufficient for the 95% CI to exclude the MCID, the study has ad- equate statistical power to detect a clini- cally meaningful difference if one exists in the target population. Authors in a recent RCT23 found no statistically significant difference (P = .33) for knee flexion range of motion outcomes among 3 groups: a control group receiving no time on a continuous passive motion (CPM) machine, a treat- ment group receiving CPM treatments of 35 minutes duration once daily, and another treatment group receiving CPM treatments of 2 hours duration once dai- ly. The authors considered 10° to be the MCID for this outcome. FIGURE 2 provides a graphical display of 95% CIs for each of the 3 between-group comparisons at the time of discharge from hospital. The 2 dotted vertical lines represent MCIDs of 10° favoring either of the 2 groups for each plotted comparison. The solid verti- cal line represents the null value (0°) for the between-group differences. The 95% CIs around each of the between-group effect sizes are represented by horizontal lines with vertical anchors at each end, reflecting upper and lower limits of the CIs. Each 95% CI includes the null val- ue, suggesting no statistically significant differences—a finding consistent with re- sults from the traditional hypothesis test (P = .33). However, only 1 of the 3 95% CIs excludes the MCID. Therefore, sta- tistical power in this study was adequate to rule out a clinically meaningful treat- ment effect in the target population for 1 between-group comparison (CTL-EXP1); but the study power was insufficient to rule out a small but potentially mean- ingful difference for 2 of the 3 between- group comparisons. Adequate precision and statistical power are illustrated in a negative trial comparing arthroscopy to placebo ar- throscopy in patients with knee osteoar- thritis.63 Authors determined the MCID for the pain subscale of the Arthritis Impact Measurement Scales (AIMS) to be 10 points. At the 6-week follow-up measurement, the average pain score for the arthroscopy with debridement group (49.9 23.3) was not significantly dif- ferent (P = .85) from the placebo group (50.8 23.2). The difference between means was 0.9 (95% CI: -7.7 to 9.4). Therefore, the largest treatment effect favoring arthroscopy in the target popu- lation consistent with results from this study would be 9.4 points on the AIMS pain subscale: a trivial difference. Given that the MCID was excluded from the 95% CI, we conclude that the study had adequate precision and sufficient sta- tistical power to have found a clinically meaningful difference, if one existed, in the target population. This interpretation is the same as that expressed by the au- thors: "If the 95 percent confidence inter- val around the estimated size of the effect does not include the minimal important difference, one can reject the hypothesis that the arthroscopic procedures have a small but clinically important benefit."63 Results for Dichotomous Outcomes: Risk Reduction and Number Needed to Treat Although authors of clinical trials in physical therapy most often select con- tinuous outcome variables, there are many important naturally dichotomous outcomes that should be included in studies of orthopaedic and sports physi- cal therapy. Dichotomous outcomes are those that patients either experience or do not experience. Examples are re- current dislocations, failure to achieve complete recovery, failure to return to competition, recurrence of low back pain, receiving injections, and subsequent sur- gery. Because the statistical methods for analyzing dichotomous outcomes quantify reduction in risk, dichotomous outcomes are usually operationalized as negative outcomes (numbers of patients who did have a recurrent dislocation, patients who were not able to return to sport, etc). Important continuous scale outcomes can be dichotomized using the MCID to report numbers of patients who achieve or fail to achieve clinically meaningful improvements in motion, strength, pain reduction, etc.65 For ex- ample, Clegg et al18 dichotomized their primary outcome: the WOMAC pain subscale with raw scores ranging from 0 to 500. Authors dichotomized this scale by reporting percents of patients in each study group who achieved at least 20% improvement after treatment. This cut score is the MCID recommended by de- velopers of the WOMAC.7 Results for di- chotomous outcomes can be reported as odds ratios61 but are frequently reported as absolute risk reduction (ARR), rela- tive risk reduction (RRR), and number needed to treat (NNT).65 Deyle et al25 reported the number of patients who had knee replacement sur- gery by the time of a 1-year follow up in each of 2 groups with knee osteoarthri- tis. In the placebo group, 8 of 41 patients (20%) had surgery compared to only 2 of 42 patients (5%) receiving manual thera- py and exercise. The ARR is the difference between these 2 proportions: 20% - 5% = 15% (95% CI: 1% to 28%). The RRR is the reduction in risk relative to that in the comparison group: (20% - 5%) ÷ 20% = 75% (95% CI: 5% to 100%). The NNT is computed by taking the reciprocal of the ARR: 1.0 ÷ 0.15 = 7 (95% CI: 4 to 105). Reporting results in this way reveals that, although the risk of needing surgery with- in 1 year was 20% in the placebo group, risk was reduced by 15% in absolute terms and by 75% in relative terms by provid- ing manual therapy and exercise. The wide 95% CIs around the point estimates reveal considerable imprecision in the results. The principles discussed above for appraising "apparently" positive and negative trials apply equally to assessing dichotomous outcomes. However, rather than comparing the MCID to point es- timates and associated CIs for mean ef- fect sizes, a clinical judgment is required (depending on multiple considerations of context) to determine the minimal clini- cally important amount of risk reduction for comparison with the point estimates and associated CIs for ARR, RRR, and NNT. For example, if a clinician consid- ers a 5% RRR for needing total knee ar- throplasty to be clinically meaningful, the evidence from Deyle et al25 would be con- sidered "definitive." On the other hand, if a clinician judges a 30% RRR to be mini- mally clinically meaningful, the point esti- mate from Deyle et al25 (75% RRR) would be considered promising; but the wide CI around that treatment effect would lead the clinician to seek additional evidence, perhaps from a larger trial with greater precision. The NNT is defined as the number of patients who would need to be treated on average to prevent 1 bad outcome or achieve 1 desirable outcome in a given period of time.54 Therefore, when a low NNT is associated with a treatment, this indicates that relatively few patients need to receive this treatment in order to avoid 1 bad outcome. Therefore, NNT values are used as a measure of treatment effec- tiveness and are helpful in cost-benefit calculations. However, it should be noted that NNT values alone are not sufficient to determine if an intervention approach should be implemented. Patient values and preferences, the severity of the out-come that would be avoided, and the cost and side effects of the intervention are important determinants that should be considered when making treatment decisions. Thus, the threshold NNT will almost certainly be different for different patients and there is no simple answer to the question of when an NNT is suf- ficiently low to justify a treatment. TABLE 2 lists several physical therapy-related interventions with associated outcomes, NNTs, and 95% CI values.

A Primer on Selected Aspects of Evidence- Based Practice Relating to Questions of Treatment, Part 1: Asking Questions, Finding Evidence, and Determining Validity

According to the Guide to Physical 7 TherapistPractice, patientmanagement consists of 5 interrelated elements: exam- ination, evaluation, diagnosis, prognosis, and interventions and outcomes. Data collected during the initial examination should be evaluated and should facilitate decision making regarding management strategies that are most appropriate for the individual patient. The diagnostic process in physical therapist practice has been described in detail elsewhere.20 Once the diagnostic process surpasses the treatment threshold, or the point in the examination at which the clinician has determined that treatment may be- gin,20 the clinician must determine the optimal intervention or combination of interventions needed to maximize pa- tient outcomes. The clinician uses data collected during the examination, along with the diagnosis and patient goals, to determine the patient's prognosis and likely response to treatment. All elements of patient management as described by the Guide to Physical Therapist Practice7 relate to components of EBP. However, this clinical commentary will focus on as-pects of EBP associated with determining appropriate treatment. In the clinical decision-making pro- cess, a certain degree of uncertainty ex- ists with regard to interventions that will most likely maximize the chance of obtaining successful outcome for an individual patient.14,54 Although the vol- ume and quality of evidence for the effi- cacy and effectiveness of many commonly used physical therapy interventions is im- proving, the ability to identify the most appropriate treatment strategy can be a difficult task when faced with varying levels of uncertainty about the validity of a respective study's findings. Efficiency in incorporating EBP into clinical prac- tice specifically related to treatment is a 5-step process: (1) developing an answer- able question, (2) identifying the evidence for treatment, (3) critically appraising the evidence (which requires an understand- ing of research design and statistical prin- ciples33), (4) incorporating evidence into clinical practice, and (5) evaluating the effectiveness and efficiency with which steps 1 through 4 were carried out when determining an appropriate intervention strategy for the particular patient.46 The purpose of this clinical commentary will be to provide a perspective of the first 2 steps related to treatment and that part of step 3 related to validity of evidence, with an emphasis on studies of interven- tions in orthopaedic and sports physical therapy. This commentary is the first of a 2-part series. Part 2 will provide a perspective of principles for interpreting results from evidence for treatment, ap- plying the evidence to patient care, and evaluating proficiency with EBP skills.

A Primer on Selected Aspects of Evidence-Based Practice Relating to Questions of Treatment, Part 2: Interpreting Results, Application to Clinical Practice, and Self-Evaluation

Because clinicians are attempting to apply results from current best evi- dence to clinical practice, a key question to be answered is, "Are the patients in this study similar to the patient I am manag- ing? " Therefore, patient demographic data, such as age, diagnostic classifica- tion, level of impairment/dysfunction at baseline, and level of acuity, are just some of the characteristics that are typically reported in the methods or results sec- tion of the study. The clinician may even wish to determine if previously published prognostic studies have specific patient demographic data that may predict which patients are more likely to achieve a successful outcome regardless of what treatment is applied. If the clinician de- termines that patients from the study suf- ficiently resemble the patient of interest, then the clinician can proceed to a critical appraisal of the study design and results. The EBP approach identifies a finite set of key validity issues to consider and facili- tates decisions about clinical meaningful- ness of reported treatment effects. Critical appraisal enables a clinician to answer 3 questions37 after the foreground question is posed and the best evidence is found: (1) Are the results valid?, (2) What are the results?, and (3) How can I apply the re- sults to patient care? The first of these 3 questions was addressed in part 1 of this series. The remaining 2 questions will be addressed in this commentary.

Synthesized Results From Multiple Clinical Trials: Clinical Practice Guidelines

Clinical practice guidelines integrate synthesized evidence with broader cul- tural, societal, and patient-interest con- siderations. Results in practice guidelines come in the form of recommendations supported by specified levels of evidence. Readers performing a critical appraisal of a practice guideline should determine the method used by panel members to grade treatment recommendations, and then consider the relative strength of each recommendation. A common scheme for grading recommendations in clinical prac- tice guidelines is reproduced in

Summary

Clinical tests can assist clinicians to increase their level of certainty about whether or not a suspected condition is present. An understanding of the characteristics of diagnostic tests is essential to their clinical interpretation. A very sensitive test rules a condition out (SnOUT), and a very specific test rules a condition in (SpIN). Where sensitivity and specificity are less than perfect, the likelihood ratio nomogram is a useful aid in quantifying the probability that a client does or does not have a particular condition given a positive or negative test result. The clinician first must quantify the pre-test probability of a suspected condition. The likelihood ratio can then be used to calculate the post-test probability that the condition is present. Tests can be used in series with the post-test probability of the first test becoming the pre-test probability of the next test, and so on, until the condition is ruled in or out. Where the pre-test probability is either very high or very low, testing is not recommended, as results can provide little additional certainty as to whether or not a condition is present or absent.

Impact of clinical decision rules on clinical care of traumatic injuries to the foot and ankle, knee, cervical spine, and head

Clinicians are faced with a multitude of clinical decisions for traumatic injury for patients seen in emergency departments (ED). These decisions are weighted by several factors including: the desire to provide excellent clinical care, fear of malpractice litigation, and pressures to provide care with less cost to the insurers (government or private). Many decisions to perform investigations are based on personal experience and not evidence-based med- icine. There are now several clinical decision rules to assist emergency physicians to determine using evidence based medicine which patients require investigations for some of the most frequent trau- matic injuries seen in the ED. Clinical decision (or prediction) rules attempt to reduce the uncertainty of medical decision making by standardising the collection and interpretation of clinical data. A decision rule is derived from original research and may be defined as a decision making tool that incorporates three or more variables from the history, physical examination, or simple tests.12,26 These decision rules help clinicians with diagnostic or therapeutic decisions at the bed- side. The methodological standards for their devel- opment and validation and can be summarized (Fig. 1). Unfortunately, many clinical decision rules to our knowledge have not been prospectively assessed to determine their accuracy, reliability, clinical sen- sibility, or potential impact on clinical practice. The validation process is very important because many statistically derived rules or guidelines fail to per- form well when tested in a new population.9,23,4 The reason for this poor performance may be sta- tistical, i.e., overfitting or instability of the original derived model,5 or may be due to differences in prevalence of disease or differences in how the decision rule is applied.20,44 Likewise, the process of implementation is important to demonstrate the true effect on patient care and is the ultimate test of a decision rule; transportability can be tested at this stage.13 Clinical decision rules must undergo field trials to test their effectiveness, as the rationale for such rules lies in their ability to alter actual patient care.13,43 Our research group has previously con- ducted derivation, validation and implementation studies for the following widely used clinical deci- sion rules: Ottawa ankle rules,28,29,30 Ottawa knee rules,31—33 Canadian C-spine rule34,35,37 and the Canadian CT head rule36 (for minor head injuries). Clinical decision (or prediction) rules attempt to reduce the uncertainty of medical decision making by standardising the collection and interpretation of clinical data. A decision rule is derived from original research and may be defined as a decision making tool that incorporates three or more variables from the history, physical examination, or simple tests.12,26 These decision rules help clinicians with diagnostic or therapeutic decisions at the bed- side. The methodological standards for their devel- opment and validation and can be summarized (Fig. 1). Unfortunately, many clinical decision rules to our knowledge have not been prospectively assessed to determine their accuracy, reliability, clinical sen- sibility, or potential impact on clinical practice. The validation process is very important because many statistically derived rules or guidelines fail to per- form well when tested in a new population.9,23,4 The reason for this poor performance may be sta- tistical, i.e., overfitting or instability of the original derived model,5 or may be due to differences in prevalence of disease or differences in how the decision rule is applied.20,44 Likewise, the process of implementation is important to demonstrate the true effect on patient care and is the ultimate test of a decision rule; transportability can be tested at this stage.13 Clinical decision rules must undergo field trials to test their effectiveness, as the rationale for such rules lies in their ability to alter actual patient care.13,43 Our research group has previously con- ducted derivation, validation and implementation studies for the following widely used clinical deci- sion rules: Ottawa ankle rules,28,29,30 Ottawa knee rules,31—33 Canadian C-spine rule34,35,37 and the Canadian CT head rule36 (for minor head injuries). The impact of clinical decision rules may be assessed by multiple approaches. Implementation studies likely provide the most concrete evidence that the rules are reliable and work in the real world. Other methods include surveys of physician practices and opinions. Finally, we outline the current assessment of the impact of four well- known clinical decision rules, which were devel- oped using the accepted methodology for conduct- ing clinical decision rules, relating to traumatic injuries. We review the clinical impact of the Ottawa ankle rule, Ottawa knee rule, Canadian C-spine rule and the Canadian CT head rule for minor head injuries.

...

Clinical practice guidelines are another form of synthesized evidence wherein broader cultural, societal, and patient in- terest considerations are integrated with the best available evidence. Although the quality and completeness of practice guidelines can vary, the best guidelines are created by panels of experts senting a spectrum of constituencies, using EBP principles to generate explicit grades of recommendations supported by specified levels of evidence. Perhaps the best single online resource for clini- cal practice guidelines is located at the National Guideline Clearinghouse (www. guideline.gov) sponsored and maintained by the Agency for Healthcare Research and Quality. A simple search of that site in July 2008 using the words "physical therapy" yielded 1163 guidelines. Guyatt et al24 suggest 4 key validity issues for consideration when critically appraising a clinical practice guideline: (1) recommendations should broadly consider all relevant patient groups, treatment options, and outcomes; (2) recommendations should be linked to the best available evidence; (3) values and preferences should be explicitly linked to treatment outcomes; (4) rec- ommendations should be accompanied by grades of recommendation that in- dicate strength of associated evidence. The AGREE Collaboration (http://www. agreecollaboration.org) has recently de- veloped an instrument to assess the qual- ity of clinical guidelines.repre-

...

Clinicians may gain some insight into relative overall validity for published tri- als for treatments relevant to physical and occupational therapists by searching the PEDro3 and OTSeeker2 databases. Both databases use the PEDro scale to rate overall quality of clinical trials based on adherence to the principles of validity discussed above. The PEDro scale ranges from 0 to 10, with 1 point assigned for ad- equate protection against each of 10 valid- ity threats (). The PEDro scale has been shown to have fair to good interrater reliability (ICC1,1 = 0.68; 95% confidence interval [CI]: 0.57 to 0.76).35 PEDro scale scores posted online are independently confirmed when annotated as such. A PEDro search for "Pilates AND low back pain" yields 3 clinical trials.3 Judging by PEDro scale scores alone, seeking best evidence would suggest initial prefer- ence for the study by Rydeard et al45 (PEDro scale score of 8) over the study by Gladwell et al22 (PEDro scale score of 5). Although a convenient and freely available resource to get a quick indica- tion of validity for many trials, consulting PEDro scale scores does not obviate the need for independent professional judg- ments regarding validity threats as part of the critical appraisal process. Systematic Reviews are conducted by employing explicit methods for exhaus- tive searching and selective inclusion of original studies for analysis based on specified methodologic criteria. System- atic reviews of treatment studies can be performed for RCTs, cohort studies, or case control studies ( ). Readers must take care to distinguish systematic reviews from unsystematic "literature re- views" in which authors survey published literature without explicit search criteria or without specified selection criteria for studies to include in the review. These unsystematic reviews are more common in older literature and may be considered expert opinion (level 5 evidence), because conclusions by authors are subject to multiple forms of bias. Oxman et al40 suggest 4 key validity issues that one should consider when critically appraising a systematic review: (1) authors should address a clinical fore- ground question that is explicit and suffi- ciently narrow in scope; (2) the search for relevant studies should be detailed, ex- haustive, and fully revealed; (3) authors should use and report explicit criteria for assessing methodologic quality of stud- ies considered for inclusion or exclusion in the review; (4) adequate reliability between 2 or more assessors should be reported for decisions about which stud- ies to include, quality of included studies, and data extracted from original studies.

Synthesized Results From Multiple Clinical Trials: Systematic Reviews

In spite of language used above to char- acterize results from a qualifying clini- cal trial as "definitive," a single trial will rarely provide final or completely conclu- sive evidence for treatment effectiveness. This is why multiple RCTs with consis- tent results provide stronger evidence than a single RCT. Consequently, a sin- gle systematic review with homogeneity of results from multiple RCTs provides a higher level of evidence (level 1a) than a single RCT with good protections against validity threats (level 1b).2 Systematic reviews are at the top of the evidence hierarchy because they typ-ically use meta-analysis methods, when appropriate, to synthesize evidence from multiple single clinical trials.38 In this way, results from the overall body of best evidence, filtered and selected by explicit methodological quality criteria, are syn- thesized to provide an overall estimate of treatment effectiveness. Meta-analysis methods allow pooling of sample sizes from among included studies, resulting in substantial advantages: (1) increased sta- tistical ability to detect significant treat- ment effects, and (2) enhanced precision in estimates of effect sizes, reflected in narrower CIs around point estimates. Results of meta-analyses are typically presented in forest plots. A forest plot representing the simplest case from a me- ta-analysis based on only 2 original stud- ies is shown in FIGURE 3. Note that results from individual trials are represented by point estimates (squares in this example), with horizontal lines representing the CIs. Effect sizes for continuous scale out- comes in a meta-analysis are transformed to a normalized scale, such as a weighted mean difference (WMD) or a standard- ized mean difference. For dichotomous outcomes, effect sizes in a meta-analysis are typically reported as relative risk or odds ratios. The null value for the treat- ment effect is represented as a central vertical line in a forest plot. Point esti- mates for effect sizes plotted on one side of the vertical reference line favor the ex- perimental treatment; points plotted on the other side of the line favor the com- parison. If the CI around the point esti- mate crosses the vertical line, results are not statistically significant because those results are consistent with a zero treat- ment effect in the target population. FIG- URE 3 illustrates results from a systematic review39 for an outcome of pain intensity at 12 weeks, comparing results obtained in patients treated with bed rest com- pared to patients who stayed active. Two RCTs were included in the meta-analysis: one with a statistically significant effect favoring the recommendation to stay ac- tive and one with no statistically signifi- cant difference between groups. Without meta-analysis the overall accumulation of evidence might appear to be equivo- cal, with one study suggesting benefit and another suggesting no benefit. The synthesized result pooling data in a meta- analysis from both studies is represented by the diamond shape labeled "subtotal" in FIGURE 3. This meta-analysis result from aggregated evidence reveals a statistically significant benefit in favor of the recom- mendation to stay active: quite a different conclusion from the equivocal judgment suggested by a simple count of positive trials versus negative trials.

The Minimal Detectable Change and Minimal Clinically Important Difference Properties

Decisions about clinical meaningful- ness of results involve judgments about thresholds distinguishing trivial effects from clinically important effects. Al- though any such judgment can be subject to debate and will depend on multiple contextual considerations and local cir- cumstances, these judgments are essen- tial in any critical appraisal of evidence. Because clinicians are frequently inter- ested in identifying the amount of change over time, measurement properties such as minimal detectable change (MDC) are important to consider. Similar to other measures of reliability, such as standard error of measurement (SEM), the MDC is the smallest real difference, which rep- resents the smallest change in score that likely reflects true change rather than measurement change alone.74,77 For ex- ample, Stratford and colleagues78 have reported that the Roland-Morris Ques- tionnaire, a commonly used outcome measure for patients with low back pain, has an MDC of 4 points. Therefore, to be confident that 2 scores taken across time represent a true change the scores would need to be more than 4 points from each other. However, MDC only provides an indication of the minimum change that is detectable by the instrument, and not necessarily the amount of change that could be considered clinically meaning- ful to the patient. Jaeschke et al46 defined the minimal clinically important differ- ence (MCID) as "the smallest difference in score in the domain of interest which patients perceive as beneficial." There is a growing body of literature outlin- ing methods for determining MCID values,8,46 reporting MCIDs for specific scales, and using MCIDs to make judg- ments about clinical meaningfulness of treatment effects in clinical trials. Al- though published MCID values must be considered in the context of the varying methods and intended purposes for their derivations or estimations,8 clinicians unfamiliar with specific scales will often find it helpful to be aware of published MCID values when critically appraising evidence. No single published value for a MCID can be applied uncritically in all circumstances or for all purposes.59 Rather, a published MCID can provide an initial reference point when applying personal clinical expertise to make inde- pendent judgments about what distin- guishes trivial from clinically important treatment effects in a local context. An illustrative patient scenario integrating patient values with published MCIDs to make patient-relevant judgments in a critical appraisal is given below in the section titled "Step 4. Incorporating Evi- dence Into Clinical Practice." Published MCID values for selected outcome scales commonly used in ortho- paedic and sports physical therapy are displayed in TABLE 1. Although the definition of the MCID above suggests application to an individ- ual patient, MCID values are commonly employed to make judgments about the clinical meaningfulness of averaged group treatment effects, both for within- group effects25,53 and for between-group effects.20,24,63 Indeed, Jaeschke et al46 ex- plicitly anticipated use of the MCID to make judgments both for individual and group differences. If the observed raw ef- fect size is equal to or greater than the MCID, the treatment effect is consid- ered clinically meaningful. Otherwise, the treatment effect is deemed trivial regardless of whether statistical significance is achieved. For example, Hyland et al45 found in a RCT that the posttreat- ment pain VAS outcome in a calcaneal taping group (2.7 1.8) was significantly better (P.001) than that of the control group (6.2 1.0). Inasmuch as the point estimate for the treatment effect was a posttreatment between-group difference of 3.5 cm favoring calcaneal taping, we can compare this value to a MCID for the pain VAS. If we accept a suggestion of 3.0 cm as a reasonable value for the MCID for the pain VAS,55 we consider the treat- ment effect in the study sample to be a clinically meaningful benefit because the point estimate of the effect (3.5 cm) is greater than the MCID (3.0 cm). If a reader is not sufficiently familiar with an outcome scale to make an intui- tive judgment about clinical meaningful- ness of a treatment effect size, and if no published MCID can be found for that outcome scale, it is often helpful to con- vert the effect size to a percent difference. Following the example from Hyland et al45 above, the percent difference between groups in posttreatment pain VAS (10-cm scale) means was calculated as follows: (6.2 - 2.7) ÷ 6.2 = 57%. Therefore, the mean pain VAS score for the treatment group was 57% lower (better) than that of the control group. Most clinicians would judge a 57% average reduction in pain to be clinically meaningful, even without being familiar with a particular pain out- come scale.

SUMMARY

Determining the source, valid- ity, strength, and relevance of evi- dence for treatment decisions re- quires successful integration of the EBP process. The goal of EBP is to improve efficiency and assist clinicians in selecting interventions that will maximize patient outcomes rather than erroneously select- ing interventions with little or no demon- strated effectiveness.56 The identification of appropriate fore- ground questions, performing literature searches, critically analyzing the best available evidence, applying the best evi- dence to clinical practice, and ultimately assuring the proficiency of the process will ultimately lead to optimal care for our patients. Developing proficiency in the 5-step process to EBP requires strong ded- ication and effort from students as well as practicing therapists, and at times can be quite challenging. However, as healthcare providers, therapists should approach the challenge of successful integration of EBP with enthusiasm, as the overall goal is to provide the best quality of care and maxi- mize positive outcomes for their patients. They should embrace and not retreat from the challenge of integrating the best avail- able evidence, clinical expertise, and pa- tient values into clinical decisions for each

Applying the Evidence—The Consequences of Not Practicing Evidence-Based Diagnosis

Diagnostic tests play a critical role in the management of patients in physical therapy. The results of individual tests are evaluated during the examination process, determining which hypotheses should be ruled in or out, ultimately leading to a decision to a use a certain intervention that is believed to provide optimal out- comes for the patient. The ability to judge evidence for diagnostic tests, select the most appropriate test for an individual patient, and interpret the results will need to become familiar skills if physical therapy diagnosis is to become a more evidence-based process. Many aspects of physical therapist practice, including diagnosis, have been criticized for excessive allegiance to expert opinion and uncritical acceptance of standards that are not based on evidence.108,109 Systems of integrat- ing diagnosis and intervention in common usage by physical therapists too frequently owe their popularity to tradition instead of sound data attesting to their useful- ness. For example, neurodevelopmental treatment (NDT) is an approach to the management of patients with movement disorders in which the therapist exam- ines factors such as movement patterns and postural reactions and then selects interventions to reduce abnor- mal movements and improve function.110,111 Even though NDT appears to be the method most commonly used by physical therapists for managing children with cerebral palsy,112 little research has been performed to examine the evidence for examination techniques used within the system or the manner in which the tests are evaluated to determine appropriate interventions.111 Without any validation of the diagnostic decision making underlying intervention choices, it is not surprising that clinical trials comparing patients treated with an NDT- based approach versus other interventions have not demonstrated improved outcomes with the use of the NDT system.113-117 A similar situation exists for the most common treatment approach for patients with LBP, the McKenzie system.118 The McKenzie system uses a variety of examination techniques, the results of which are used to place patients into categories and to determine inter- ventions. Little work has been done to examine the diagnostic process used by the McKenzie system, and the reliability of the classifications is questionable.119 A recent clinical trial comparing outcomes for the McKen- zie system with chiropractic care and a patient education pamphlet resulted in essentially no differences among the treatment approaches.120 Reliance on patient management systems that are not evidence-based, in our view, has negative consequences not only for practitioners, but also for the profession of physical therapy as a whole. Both the McKenzie system and NDT have been used in clinical trials as representa- tive of "physical therapy" interventions for patients with LBP or cerebral palsy, respectively.117,120 The negative results of these trials have led to the conclusion that physical therapy may not have a role in the management of these conditions. It should not be surprising, however, that systems whose diagnostic procedures are not evidence-based do not result in improved patient out- comes. If diagnostic decisions had been made on the basis of tests with evidence attesting to their ability to focus the examination and determine the most effective interventions, the results might have been more positive. The McKenzie system and NDT serve only to illustrate a more fundamental problem. Without evidence-based diag- nosis, interventions will continue to be based on observa- tion that may not even be systematic, pathoanatomical theories, ritual, and opinion. Studies examining the out- comes of such interventions will continue, in our opinion, to offer discouraging results. The solution is not only to explore new and innovative interventions, but to refine the process by which interventions are linked to examination findings by studying evidence-based diagnosis.

The interpretation of diagnostic tests: A primer for physiotherapists

During the patient interview, a physiotherapist develops hypotheses about possible causes or diagnoses for the presenting problem. These hypotheses are then tested during the objective assessment, or physical examination, using clinical tests. A diagnostic test seeks to determine whether or not a person has or does not have a particular condition. The evidence-based practitioner needs to be able to locate and evaluate the quality of research papers that report on the accuracy of diagnostic tests (Greenhalgh 1997, Sackett et al 2000). A preliminary step in becoming an evidence-based practitioner is to acquire a thorough understanding of the characteristics of diagnostic tests. Clinicians need to appreciate the extent to which a positive or negative test result can confirm or disprove their diagnostic hypothesis. The aim of this paper is to provide physiotherapists with an understanding of diagnostic test characteristics and how these can be interpreted in the clinical setting.

Conclusions

Emergency physicians should use the Ottawa ankle rule, Ottawa knee rules, Canadian CT head rule, and Canadian C-spine rules to provide more standar-dised care, to decrease the chances of misdiagnosis and reduce health care costs. International surveys and implementation studies have confirmed that the rules are effective and can be successfully imple- mented in a wide spectrum of emergency depart- ment settings around the world.

...

Even when randomization procedures are followed, bias from investigators in- fluencing subject enrollment and group composition can threaten validity if al- location to groups is not concealed from those enrolling subjects in the study.49 Concealment of group allocation is typi- cally accomplished by first obtaining informed consent and enrolling a new subject into a clinical trial, and only then opening a sealed envelope obtained from a locked filing cabinet to reveal group as- signment. Readers performing a critical appraisal should look for language in a published RCT reflecting these or similar methods for concealing group allocation. Interestingly, despite strong rationale for concealment of group allocation, a study of 2297 RCTs in the PEDro database re- vealed that only 16% of these studies re- ported concealment of allocation.38

How Useful Are Physical Examination Procedures? Understanding and Applying Likelihood Ratios

Massessment'' is one of the 6 central domains of ath letic training education. The certified athletic trainer (AT) is often the first health care provider encountered by ath- letes and people engaged in physical activity after such inju- ries. The AT, therefore, is often called on to evaluate the in- jured person and make decisions with regard to return to participation or nonemergent referral. These decisions require that the diagnostic probabilities be considered. The educational model in athletic training calls for students to demonstrate an ability to complete an evaluation of numer- ous musculoskeletal structures. Although specific physical ex- amination procedures are not identified in the educational stan- dards, the instruction of numerous ''special tests'' is a longstanding practice. Moreover, the texts used to support ath- letic training education also reference many special tests. Ma- gee,2 for example, described more than 2 dozen examination procedures and modifications for evaluation of the ligaments and menisci of the knee alone. Such instruction is common- place, but we believe it may be incomplete. How valuable to the clinical decision-making process are the examination pro- cedures we teach? Which tests should we teach, and which should be omitted? What does the AT need to know to apply research on diagnostic procedures to clinical practice? What should we really be teaching students about the physical ex- amination process? Clinical examination procedures have emerged as individual clinicians attempted to improve on their ability to accurately diagnose illnesses and injuries. The clinical examination find- ings influence treatment and referral decisions and permit prognostication regarding outcome after one or more courses of treatment. Before we allow the results of a particular ex- amination procedure to influence a clinical judgment, however, we should know the performance characteristics of the test. Test performance characteristics can be estimated by compar- ing the results of a clinical examination procedure with an established diagnostic standard. For musculoskeletal condi- tions, operative findings and the results of advanced diagnostic imaging (eg, magnetic resonance imaging [MRI]) provide such reference standards. Applying research related to the perfor- mance characteristics of musculoskeletal physical examination procedures requires an understanding of terminology and sta- tistical procedures unfamiliar to some clinicians. One method that we believe has great utility in athletic training involves the calculation and interpretation of likelihood ratios (LRs). Our purpose, therefore, is to describe the calculation and interpretation of likelihood ratios for examination procedures performed by ATs. Likelihood ratios are not the only means of describing test performance characteristics, but they are at- tractive for the reasons we describe, especially for dichoto- mous clinical examination procedures (eg, Lachman proce- dure3) in which the intended result is either positive (anterior cruciate ligament [ACL] is torn) or negative (ACL is intact). We first define key terms and review issues of validity and test reliability. The derivation of values needed to understand and apply the results of investigations of diagnostic tests is then provided. For the purpose of this article, we consider the assessment of an injured knee. More specifically, we present and discuss the performance characteristics of the Lachman3 and McMurray4 tests for ACL and meniscal injuries, respectively. These patient examples link an understanding of likelihood ratios, clinical research, and individual clinical practice and teaching. These concepts can be applied across a spectrum of exami- nation procedures common to athletic training education and practice.

Understanding sensitivity and specificity with the right side of the brain

I first encountered sensitivity and specificity in medical school. That is, I remember my eyes glazing over on being told that "sensitivity = TP/TP+FN, where TP is the number of true positives and FN is the number of false negatives." As a doctor I continued to encounter sensitivity and specificity, and my bewilderment turned to frustration—these seemed such basic concepts; why were they so hard to grasp? Perhaps the left (logical) side of my brain was not up to the task of comprehending these ideas and needed some help from the right (visual) side. What follows are diagrams that were useful to me in attempting to better visualise sensitivity, specificity, and their cousins positive predic- tive value and negative predictive value.

Using both sides of the brain

I hope that having worked through sensitivity and specificity from scratch you will be wondering why it initially seemed so confusing. It may be because of our dependence on the left (linguistic) side of the brain. When told that a test has a sensitivity of 94% and a positive predictive value of 1%, our left brain has difficulty grasping how a test can be 94% sensitive and yet be correct only 1% of the time. It is partly misled by the huge difference between prevalence, on the one hand, and sensitivity and specificity on the other. The prevalence of systemic lupus erythematosus is 0.033% while the sensitivity and specificity of the test are about 95%; this difference is of several orders of magnitude. If, for example, we developed a test with sensitivity and specificity of 99.999% rather than 95%, we would be able to boast of a positive predictive value of 97%.

Sensitivity and specificity

I will be using four symbols in these diagrams Let us start by looking at a hypothetical population (fig 2). The size of the population is 100 and the number of people with the disease is 30. The prevalence of the disease is therefore 30/100 = 30%. Now let us imagine applying a diagnostic test for the disease to this population and obtaining the results shown in figure 3. The test has correctly identified most, but not all of the people with the disease. It has also correctly labelled as disease free most, but not all, of the well people. Calculating sensitivity and specificity will allow us to quantify these statements. Sensitivity refers to how good a test is at correctly identifying people who have the disease. When calculating sensitivity we are therefore interested in only this group of people (fig 4). The test has correctly identified 24 out of the 30 people who have the disease. Therefore the sensitivity of this test is 24/30 = 80%. Specificity, on the other hand, is concerned with how good the test is at correctly identifying people who are well (fig 5). The test has correctly identified 56 out of 70 well people. The specificity of this test is therefore 56/70 = 80%. Having a high sensitivity is not necessarily a good thing, as we can see from figure 6. This test has achieved a sensitivity of 100% by using the simple strategy of always producing a positive result. Its specificity, however, clearly could not be worse, and the test is useless. By contrast, Figure 7 shows the result a perfect test would give us. On the other hand, negative predictive value is concerned only with negative test results (fig 9). In our example, 56 out of 62 negative test results are correct, giving a negative predictive value of 56/62 = 90%. The interesting thing about positive and negative predictive values is that they change if the prevalence of the disease changes. Let's assume that the prevalence of disease in our population has fallen to 10%. If we were to use the same test as before, we would obtain the results in figure 10. The sensitivity and

Linking History and Physical Examination Findings

If the clinician had estimated a 20% pretest probability of an ACL tear, a positive Lachman test3 would result in a post- test probability of 71%, and a negative Lachman test would result in a posttest probability of 4%. Conversely, an 80% pretest probability estimate combined with a positive Lachman test3 results in a posttest probability of 97%, whereas a neg- ative Lachman test3 yields a 38% posttest probability. These results still represent generally large shifts in probability but illustrate the links among history, observation, and clinical ex- amination procedure results on diagnostic certainty.

...

In an attempt to minimize the effect of rater or subject bias, studies use various blinding schemes. There are 4 categories of study participants who should ideally be blinded to group assignment: (1) pa- tients, (2) treating clinicians, (3) data col- lectors, and (4) data analysts.48 Although it is usually feasible to blind those from all 4 categories in a pharmaceutical study, this is usually not possible in studies of physi- cal therapy interventions. Physical thera- pists are usually aware of the treatment they are delivering (rater bias); blinding the patient with sham interventions may be difficult or impossible.29 Addition- ally, most current Institutional Review Boards require that patients are aware of all of the possible interventions they may receive as part of the informed consent process, which provides another barrier to complete patient blinding. However, the person measuring outcomes in phys- ical therapy trials can almost always be blinded to group assignment in order to minimize rater bias. Authors should re- veal this antibias protection with clear language, such as, "An investigator, who was blinded to the treatment condition... performed this measurement."53 Never- theless, Moseley et al38 found that only 5% of studies in the PEDro database re- ported using blinded outcome assessors. Therefore, a reader performing a critical appraisal must decide whether blinding occurred and, if not, how serious a threat to validity is posed by this problem. The implication of nonblinding for overall ap- praisal of the evidence will depend on the context and particulars of the study. For example, self-report outcome tools (eg, Oswestry Scale, WOMAC scale, etc) are not as readily subject to rater bias, even when the outcomes assessor is not blind- ed to group assignment.

The Role of Reliability

In order to provide useful information, a test should yield reliable results in the clinical setting. That is, performance of the test on different occasions should yield the same result if the status of the patient being examined has not changed. Traditionally, reliability has been emphasized as a precursor to validity, a preliminary step that should be completed prior to initiating any study of validity. The numerous studies examining diag- nostic test reliability without any assessment of validity attest to this mind-set. The peril in this approach is that it may lead to the dismissal of potentially useful tests based on an inability to reach an arbitrary threshold of reliability. This could be due to properties of the statis- tics used to measure reliability. The kappa statistic is the reliability coefficient typically used in studies of agreement between examiners for categorical data.92 The kappa statistics appropriate for this purpose because it is a chance-corrected measure of agreement; however, it can be subject to deflation based on the prevalence of the condition being measured.35,93 For example, Spitznagel and Helzer94 noted that, if 2 raters of equal ability each performed a test and each rater was known to have 80% sensitivity and 98% speci- ficity when his or her results were compared with a reference standard, the kappa statistic between the rat- ers would be .67 if the errors made by the raters relative to the reference standard were independent. If the same raters, with the same level of accuracy, repeated the test in a second population with a prevalence of only 5%, the kappa value would fall to .52.94 This is an example of the difficulty in interpreting kappa values when prevalence is extremely high or low. Many conditions of interest in physical therapy are rare, and kappa statistics used in these instances may be artificially lowered. In addition, although arbitrary scales exist for categoriz- ing kappa values as poor, fair, good, and so on,92 the threshold level making a test "reliable enough" is not known. For example, Smieja et al95 examined the reli- ability and diagnostic accuracy of tests used in the identification of patients with diabetes who lacked suffi- cient protective sensation of the feet. A total of 304 patients were examined, 200 of whom were also exam- ined by a second rater to measure reliability. The reference standard was a Semmes-Weinstein monofila- ment examination. One diagnostic test that was exam- ined was position sense assessed at the interphalangeal joint of the great toe for a 10-degree change. The kappa value between raters for judgments of position sense was only fair by most standards (.28). The results (Tab. 10), however, show that the position sense test provided useful information when it was positive (spec- ificity98%, positive LR12.8).95 If the reliability assess- ment had been performed separate from the study of validity, it is possible that the position sense test would have been discarded from further consideration due to a lack of reliability, and the potential diagnostic value of a positive result may not have been uncovered. Reliability data certainly convey meaningful informa- tion; however, we believe that their usefulness is best appreciated when considered in conjunction with data examining diagnostic accuracy or utility. Reliability assessments conducted as independent preliminary stud- ies can lead to the premature exclusion of useful tests or the promotion of highly reliable, but diagnostically meaningless, tests. To encourage complete examination of a diagnostic test, reliability data should be considered a complement to, not a precursor of, an assessment of diagnostic value. An important role of reliability data in the context of assessing the strength of evidence pro- vided by a diagnostic test is that it may provide an explanation for inadequate accuracy or utility.56,77 When a measurement is found to have little diagnostic mean- ing and poor reliability, the test's diagnostic ability may be improved if the test is performed in a manner that leads to more reliable measurements.

Likelihood ratios

Likelihood ratios summarise the information contained in both sensitivity and specificity (Dujardin et al 1994). A likelihood ratio (LR) tells us how likely a given test result is in people with the condition, compared with how likely it is in people without the condition. Calculation of LRs is simple: • Likelihood ratio (test +ve) = sensitivity/(1-specificity) • Likelihood ratio (test -ve) = (1-sensitivity)/specificity However, it is easy to get confused when calculating LRs. To make the LR calculations work, sensitivity and specificity must be expressed as a decimal, ie 0.95. Alternatively, they may be expressed as percentages by changing the 1 in 1-specificity and 1-sensitivity to 100. Interpreting the LR is also simple. The higher the positive LR, the more certain you can be that a positive test indicates the person has the disorder. The lower the negative LR, the more certain you can be that a negative test indicates the person does not have the disorder. If the LR is close to 1, then the test will not provide much information. That is, the likelihood that a person has, or does not have, a condition will not change at all if the LR is exactly 1.0 and the diagnostic hypothesis is no closer to being confirmed or rejected. Likelihood ratios in a nutshell: • A +ve LR of 10 or more is an indicator that a positive test will be very good at ruling the disorder IN. • A -ve LR of 0.1 or less is an indicator that a negative test will be very good at ruling the disorder OUT. • A LR close to 1.0 will provide little change in probability that a person has or does not have a disorder. Now to put LRs into practice.

Predictive values

Now let us consider positive predictive value and nega- tive predictive value. We will again use the population introduced in figure 3. Positive predictive value refers to the chance that a positive test result will be correct. That is, it looks at all the positive test results. Figure 8 shows that 24 out of 38 positive test results are correct. The positive predictive value of this test is therefore 24/38 = 63%. specificity have not changed (sensitivity = 8/10 = 80% and specificity = 72/90 = 80%), but the positive predic- tive value is now 8/26 = 31% (compared with 63% pre- viously) and the negative predictive value is 72/74 = 97% (compared with 90% previously). In fact, for any diagnostic test, the positive predictive value will fall as the prevalence of the disease falls while the negative predictive value will rise. This is not really so mystifying if we consider the prevalence to be the probability that a person has the disease before we do the test. A low prevalence simply means that the person we are testing is unlikely to have the disease and therefore, based on this fact alone, a negative test result is likely to be correct. The following real example should make this clearer.

APPLICATIONS

Now that we have survived this crash course in the gener- ation of LRs, the issues become why this statistic is of value and how it can be applied in practice and teaching? We will illustrate the applications of LRs by discussing 2 examination procedures of the knee. Before doing so, however, let us con- sider the realities and uncertainties of the examination process to better appreciate the role diagnostic tests, including physical examination procedures, play in the management of patients. In clinical practice, physical examination procedures are performed in the context of a comprehensive evaluation of a patient. The examination procedures are performed based on some level of suspicion that the condition exists. This pretest probability estimate varies with clinicians and the circumstanc- es of the individual patient. The key is to recognize that a level of suspicion regarding diagnostic possibilities exists be- fore the examination procedure of interest (eg, McMurray test) is performed. Fritz and Wainner15 described the relationship between pre- test probability and LRs. Understanding this link is prerequi- site to understanding the effect of diagnostic test results on posttest probability, or the degree of certainty a condition does or does not exist after a clinical examination is completed. The issue of pretest estimate raises 2 issues. First, do clinicians really go through this process? Second, what is a ''good'' es- timate? At some level, clinicians must, and we mean must, make a judgment regarding the probability of one or more diagnoses. This process starts before any clinical examination procedures are performed or diagnostic tests ordered. In fact, it is the process by which the clinician selects the components of the remainder of the evaluation. In many cases, one could argue that a clinician's pretest probability is too conservative or too liberal. This is a reminder that clinicians vary in their probability estimates based on experience and specialization, as well as the subtle findings gleaned from history and obser- vation. The effect of various pretest probabilities is discussed later. Probability implies uncertainty. Consider how often you are absolutely, positively certain of a diagnosis at the end of a physical examination. These events happen, of course, but not as often as most of us would like. Uncertainty is inherent in the clinical practice of athletic training. However, decisions regarding referral, plans of treatment, and a physician's use of additional diagnostic studies revolve around the level of cer- tainty (probability) that a condition does or does not exist. The value of specific examination procedures may be in the context of their effect on patient care decisions based on a positive or negative result. Now we move to the application of LRs on posttest prob- ability of a condition being present or absent. The assessment of the performance characteristics of a McMurray test4 for meniscal injury and a Lachman test3 for ACL lesions provides a spectrum of LRs. Thus, we will apply the results from a comprehensive analysis of each test to patient examples to integrate and illustrate the concepts introduced in this paper.best viewed

Canadian C-spine rule

Potential traumatic cervical spine injury is a very frequent problem in emergency departments around the world. In the USA, there are approximately 1 million blunt trauma patients a year with potential cervical spine injury.14 For patients who are alert and oriented and neurologically intact, the risk of spinal injury or acute fracture is less than 1%.21 Despite this low risk most patients undergo cervical spine imaging studies resulting in 98% of the studies being negative for acute injury.8,10,3,17,42,22,7 After demonstrating a large variation among Cana- dian emergency departments and the physicians working within a given emergency department, a clinical decision rule was developed to standardise care.34 This study composed of a prospective deriva- tion phase (N = 8924 patients).35 The prospective validation study assessed the accuracy, reliability, and clinical sensibility of the Canadian C-spine rule in a new set of blunt trau- matic injury neurologically intact patients where potential cervical spine injury was a concern (Fig. 4). This study enrolled 8283 patients from nine Canadian tertiary care hospitals.37 The Canadian C- spine rule had a sensitivity of 99.4% (95% CI: 96—100) and a specificity of 45.1% (95% CI: 44—46%) for cervical spine fracture in this validation study.37 The potential impact on cervical spine imaging was evaluated. Using the Canadian C-spine rule had a theoretical cervical spine representing a relative reduction of 22% from the actual rate of 71.7%. The mean length of stay in the ED for patients who did not undergo cervical spine imaging was almost 2 h less (123 min versus 233 min; P < 0.001) than patients who had imaging. An implementation study in 12 Canadian sites is currently in progress.imaging rate of 55.9%

he Diagnostic Test

Practitioners and researchers should be able to describe diagnostic tests in sufficient detail to permit replication of the tests by other therapists. We contend that test descriptions should cover 3 aspects: the intended use, physical performance, and scoring criteria. The intended clinical use of a test is an important consider- ation, although this aspect of the test description is often overlooked by researchers and practitioners.25 As indi- cated previously, a diagnostic test may be used for a variety of purposes. If researchers do not clarify the intended purpose of a test under study, it is difficult to assess the appropriateness of the reference standard. When clinicians do not consider the purpose of diagnos- tic tests used in practice, they are susceptible to viewing tests as either good or bad, without recognition that a test may be useful for one purpose, but inappropriate for another purpose. For example, the KT-1000 knee arthrometer* possesses a high degree of diagnostic accuracy for distinguishing between individuals with and without ACL deficiency,57,58 but it has not been shown to be useful for assisting in the selection of an intervention (surgical versus nonsurgical).59 The manner in which a test is performed should be detailed. A study's results can be generalized to a clinical setting only if a test is performed as it was performed in the study. For example, Katz and Fingeroth60 compared various tests for ACL integrity against a reference stan- dard of observation of the ligament during arthroscopy. The Lachman test demonstrated very good diagnostic accuracy for ACL integrity; however, the test was per- formed with the subjects under anesthesia. If the results were accepted without consideration of the manner in which the test was performed, a clinician may have unrealistic expectations of the usefulness of test results when applying the test to patients who are not under anesthesia. This is illustrated by a study of the Lachman test performed by physical therapists in a clinical setting that led to lower levels of diagnostic accuracy.61 The description of a diagnostic test should include the criteria used to determine positive and negative results. Many tests used in physical therapy, though well known, may have varied or unclear grading criteria. Testing for centralization in patients with LBP is an example. There is general agreement that centralization is an important diagnostic finding,62- 64 but no such consensus exists on precisely what constitutes centralization. Some therapists use definitions strictly based on movement of symptoms from distal to proximal,16,65 whereas other therapists define centralization to include diminishment of pain during testing.63 Such disagreements are not unique to judgments of centralization, and it is crucial for authors to clarify how they defined positive and negative results. It is also important to indicate whether the test cannot be performed or the results are indeterminate for any subjects. Because these occurrences could influence the clinical use of a test, they should be reported and explained.53,66 Measurements obtained with a test also are susceptible to review bias, as previously explained. Review bias can be avoided if the measurements and judgments are done by individuals who are blinded to the reference standard. Diagnostic accuracy may be overestimated if blinding is not maintained.26

Positive and negative predictive values

Predictive values tell us how likely it is that a person who tests positive has the disorder, and how likely it is that a person who tests negative does not have the disorder. Predictive values are also called "post-test probabilities" (Go 1998). Unfortunately the predictive values only apply when the clinical prevalence is identical to that reported in the study. Prevalence changes dramatically depending on where the test is being performed. For example, in a general physiotherapy outpatient department or practice, the prevalence of patients with anterior cruciate ligament (ACL) tears will be lower than in a sports clinic that specialises in knee injuries. Prevalence is also called the "pre-test probability" that a person has the disorder. The pre-test probability of a client having an ACL tear is higher in the sports clinic than in the general practice. Because predictive values only apply to populations with the same prevalence, they are not very useful values. Consequently you can now almost disregard the positive predictive values and negative predictive values. A more useful tool for interpreting clinical tests is the likelihood ratio.

STEP 3B. CRITICALLY AP- PRAISING THE LITERATURE: WHAT ARE THE RESULTS?

Readers should understand sta- tistical analyses and the presentation of quantitative results when critically appraising an article.26 While an extensive review of data analysis techniques is be- yond the scope of this commentary, we will describe a number of statistical concepts and procedures commonly used in physi- cal therapy literature. Bandy6 conducted a 2-year review of the literature published in the journal Physical Therapy and iden- tified 10 statistical procedures that were used in 80% of the articles reviewed. These were descriptive statistics, 1-way analysis of variance (ANOVA), t tests, fac- torial ANOVA, intraclass correlation, post hoc analyses, Pearson correlation, regres- sion, chi-square, and nonparametric tests analogous to t tests.6 In this commentary we will review some basic statistical con- cepts that we feel are important for readers performing critical appraisals. We will also discuss statistical methods used to identify between-group differences in clinical trials that use both continuous scale outcomes and dichotomous scale outcomes, with il- lustrations from orthopaedic and sports physical therapy literature.

Evidence-Based Practice and the Diagnostic Process

Recently, the term "evidence-based practice" has entered the lexicon of physical therapists, as it has for most medical professionals. Evidence-based practice has been defined by proponents as "the conscientious and judi- cious use of current best evidence in making decisions about the care of individual patients."21(p71) Implicit in this definition is the need for a method of determining what constitutes the "best" evidence and how to apply evidence in clinical practice. Substantial effort has gone into the development and dissemination of methods for grading evidence as it relates to treatment effectiveness. Several hierarchical schemes have been promulgated for the purpose of ranking evidence from studies concern- ing treatment outcomes.22-24 Although the schemes have some variations, all emphasize the importance of factors such as random assignment to treatment groups, com- pleteness of follow-up, and blinding of examiners and patients in determining the quality of evidence. Although principles for evaluating the quality of an article on treatment outcomes are relatively well known, some authors8 contend that the question being asked should determine the nature of the evidence to be sought. Therefore, when seeking to answer a diagnostic question, the rules governing the evaluation of studies regarding treatment outcomes are no longer applicable. Rules for judging evidence offered by a study of a diagnostic test have been elucidated; however, they tend to be less widely known and frequently remain unheeded by researchers designing and reporting studies in this area.25-28 Knowledge of the issues that are important for determining the strength of evidence offered by studies of diagnostic tests is important if the professional dialogue on the diagnostic process in physical therapy is to move forward within a context of evidence-based practice. Central to the concept of evidence-based practice is the integration of evidence into the management of patients. Integration cannot be reduced to a dichotomy (eg, "use the test or don't use the test") but instead involves a complex interaction between the strength of the evidence offered through use of a test and the unique presentation of an individual patient. Diagnostic tests cannot simply be deemed good or bad. The same test may provide important information for certain patients under certain conditions, but not for others. For example, testing vibration perception is useful for diag- nosing a lack of protective sensation and an increased risk of ulceration in the feet of patients with diabetes.29 However, vibration perception deficits are of more lim- ited diagnostic value in the examination of a patient suspected of having lumbar spinal stenosis.30 We will next examine further 2 aspects of evidence-based practice as they apply to the diagnostic process. First, we will discuss 2 of the most important considerations for the evaluation of the strength of evidence related to diagnostic tests: study design and data analysis.25-27 Sec- ond, we will examine the integration of the evidence into the diagnostic process.

Reliability

Reliability, in the context of physical examination proce- dures, is the extent to which the results of a test can be rep- licated when the same phenomenon is measured under the same conditions multiple times.6 The reliability of physical examination procedures can be divided into 2 components: intratester (same tester blindly repeating an examination) and intertester (level of agreement between 2 or more testers).11 Reliability is a requisite of validity; however, reliability alone does not establish validity.11 For example, 2 clinicians might agree on the results of a diagnostic test in 18 of 20 patients (high reliability) but in fact correctly categorize only 20% (low validity) according to a gold standard. In some cases, such as with goniometric measurements12 and assessment of sacroili- ac13 and shoulder dysfunction,14 reliability estimates have been reported. In other cases, such as with examination pro- cedures to detect a damaged meniscus, less is known about the intratester or intertester reliability. Unless reliability data are available, the clinician must be cautious in believing re- ports suggesting that a particular examination procedure is good for detecting and ruling out a particular condition and must recognize that the procedure may not perform as well in other settings or when administered by clinicians with different training and skills. From the above discussion, it becomes apparent that a care- ful consideration of the research methods is necessary before the results from a study and the conclusions drawn by inves- tigators are reviewed. The research consumer must consider the potential that study methods have biased the data. Fur- thermore, test reliability, setting, study population, and training of the clinicians involved in the diagnostic process must be considered before generalizing an investigator's conclusions to individual clinical practices. With this foundation, it is time to interpret and apply study results and conclusions. This process requires an understanding of how sensitivity, specificity, and LRs are derived.

Reporting of Results in Treatment Studies

Results published in studies of physical therapy interventions typically include a summary of the findings from a wide vari- ety of tests and measures that quantify the outcome variables selected by the authors to determine the effects of the intervention being studied. In some instances, such as with case reports or case series, raw data from each subject in the study may be presented. However, this approach is not realistic or warranted in studies with larger samples. More commonly, data are analyzed and reported as aggregated group results. Numerical indices are then used to describe attributes of the aggre- gated data. The mean or average is a mea- sure that describes central tendency in a distribution of scores, and is most useful for variables that are on an interval or ratio scale.69 If data exhibit outliers such that the value of the mean would be dis- torted, the median is often reported as the measure of central tendency. The median might also be preferred over the mean when sample sizes are so small that they may not represent the target population. For example, in a case series including patients with hip osteoarthritis, MacDon- ald et al57 reported medians rather than means for all baseline attribute variables and for all outcome variables. If data are from nominal or ordinal scales, the mode or median scores, respectively, are report- ed to describe central tendency. A comparison of means is frequently used to make judgments about differences between different groups or across various time points in a study. However, means are incomplete descriptors of data because they give information only about central tendency. A more complete description of the data includes an indication of the vari- ability in the distribution of scores (disper- sion of the individual data points). The more variable the data, the more dispersed the scores will be. Among several available measures of variability, the SD is the statis- tic most frequently reported, together with the mean so that data are characterized according to both central tendency and variability.69 Results are commonly report- ed as the mean SD. For example, Hall et al42 compared headache index results at 4 weeks for their treatment group (31 9) and their placebo group (51 15), re- vealing a between-group mean difference of 20 points, with somewhat greater vari- ance in the placebo group. If the median is used as the measure of central tendency, the range or interquartile range should be used to describe variability of the data, as the median may not always be the central value within the given range, especially when the data are nonparametric.

Likelihood Ratios

Sensitivity and specificity values provide useful informa- tion; however, they have several shortcomings. These values work in the opposite direction of clinical decision making. Clinicians have knowledge of the test result and want to infer the probability that the result is correct. Sensitivity and specificity values infer the probability of a correct test, given the result of the reference standard. Sensitivity and specificity values can be used as indepen- dent estimates of the usefulness of negative and positive test results, but this information cannot be combined and analyzed simultaneously. The actual performance of a diagnostic test is not only related to sensitivity and specificity values, but also dependent on the pretest probability that the condition is present. Useful tests should produce large shifts in probability once the result of the test is known.77,80,81 Sensitivity and specificity values cannot be used to quantify the shift in probability of the condition given a certain test result. The best statistics for summarizing the usefulness of a diagnostic test are likelihood ratios.82,83 Likelihood ratios (LRs) overcome the difficulties cited by reflecting a combination of the information contained in sensitiv- ity and specificity values into a ratio that can be used to quantify shifts in probability once the diagnostic test results are known.84 The positive LR is calculated as sensitivity/(1 specificity) and indicates the increase in odds favoring the condition given a positive test result. The negative LR is calculated as (1 sensitivity)/ specificity and indicates the change in odds favoring the condition given a negative test result.27 An LR of 1 indicates that the test result does nothing to change the odds favoring the condition, whereas an LR greater than 1 increases the odds of the condition, and an LR less than 1 diminishes the odds of the condition. Table 8 provides a guide for interpreting the strength of an LR.83 A positive LR indicates the shift in odds favoring the condition when the test is positive. It is desirable, there- fore, to have a large positive LR. Tests with a large positive LR generally have high specificity because both values attest to the usefulness of a positive test. In the study by Calis et al,79 for example, the drop arm test had the highest specificity (97%) for determining the pres- ence of subacromial impingement syndrome and also the largest positive LR (2.8) (Tab. 7). Because the negative LR indicates the change in odds favoring the condition given a negative result, a small negative LR will indicate a test that is useful for ruling out a condition when negative. Small negative LR values correspond to high sensitivity, as illustrated by the subacromial impinge- ment syndrome tests. The highest sensitivity and smallest negative LR were found for the Hawkin test. A comparison of the horizontal adduction and Speed tests indicates the importance of combining sensitivity and specificity values. The sensitivity of the Speed test (69%) was less than that of the horizontal adduction test (82%). However, because the Speed test was substantially more specific than the horizon- tal adduction test (56% versus 28%), the negative LR was smaller for the Speed test (0.57 versus 0.65). Diagnostic tests measured on a continuous scale are frequently transformed into multilevel ordinal outcomes based on cutoff scores. When this is the case, LR values can be calculated for each level of the test.42 Riddle and Stratford85 illustrated this process using the Berg Bal- ance Test. Different test results were used as cutoff scores, and the LR values were calculated for each level. A more detailed explanation of the process can be obtained from the article by Riddle and Stratford.

A real example

So far we have been discussing hypothetical cases. Let us now take a look at the use of the antinuclear antibody test in the diagnosis of systemic lupus erythematosus. I have massaged the numbers slightly to make them easier to illustrate, but they are close to reported figures in both the United Kingdom and Singapore.1 2 The prevalence of systemic lupus erythematosus is 33 in 100 000, and the antinuclear antibody test has a sensi- tivity of 94% and a specificity of 97%. To visualise this we need to imagine 1000 of the 10 by 10 squares used in the earlier figures (fig 11). Only one of these squares contains some patients with the disease. Figure 12 shows the result of applying the antinuclear antibody test to this population. There are many more true negative results than false negative results and many more false positive than true positive results. The test therefore has a superb negative predic- tive value of 99.99% and a depressingly low positive predictive value of about 1%. In practice, since most diseases have a low prevalence, even when the tests we No of true negatives = 96 900 No of false negatives = 2 Negative predictive value = 96 900 96 900 + 2 = 99.99% No of false positives = 3067 Positive predictive value = 31 31 + 3067 ≅ 1% Fig 12 (top) Results of antibody nuclear test in systemic lupus erythematosus; (bottom) negative and positive predictive values use have apparently good sensitivity and specificity we may end up with dismal positive predictive values. Knowing that the positive predictive value of this test is 1%, we may then ask: does a positive test result in a female patient with arthritis, malar rash, and proteinuria really mean that she has only a 1% chance of actually having systemic lupus? The answer is no. Look at it this way—the patient is not a member of the general population. She is from the population of people with symptoms of systemic lupus erythemato- sus, and in this population the prevalence is much higher than 33 in 100 000. Hence the positive pre- dictive value of the test in her case is going to be much higher than 1%.

The Chi-Square Statistic

Studies of diagnostic tests comparing categorical results of a test and a reference standard are frequently ana- lyzed with a chi-square statistic and accompanying signif- icance level. The chi-square statistic tests the hypothesis that the test results and reference standard have no association, but it does not indicate the strength or direction of any relationship that exists.90 Chi-square statistics and associated probability values cannot assist in the process of probability revision based on test results in individual patients and, therefore, cannot be consid- ered evidence-based statistics.91 Conclusions based strictly on chi-square analyzes can be misleading without information on sensitivity, specificity, and LR values. The study by Burke et al43 on diagnostic tests for patients with suspected CTS illustrates this concern. One diagnostic test examined by the authors was the patient self-report of hand swelling, graded as present (positive) or absent (negative), against a refer- ence standard of response to 2 weeks of splinting. The reference standard was graded as "positive response to splinting" or "no response to splinting" based on patient self-report.43 The authors chose to analyze the data using a chi-square test only and found a statistically significant result (P.028) (Tab. 9). The authors concluded, "These data suggest that the complaint of subjective swelling in the hand or wrist may be one of the most important findings from the history and clinical exami- nation for determining which patients will, in fact, respond to conservative treatment (splinting)."43 The sensitivity, specificity, and LR values calculated from the data do not support this conclusion. The sensitivity (33.3) and specificity (49.8) were low, resulting in a positive LR of 0.66 and negative LR of 1.34 (Tab. 9). Both LR values are close to 1, with the negative LR slightly greater than 1 and the positive LR slightly less than 1, indicating that the weak relationship between a complaint of swelling and response to splinting is in an inverse direction (ie, a negative complaint of swelling is associated with an increased likelihood of response to splinting). Because evidence-based statistics were not reported, we believe that the authors overinterpreted the utility of the test. This example illustrates the necessity of reporting sensitivity, specificity, and LR values to permit an appropriate assessment of a diagnostic test and interpreta- tion for individual patient decision making.

The Study Population

Subjects included in a study of a diagnostic test should consist of individuals who would be likely to undergo the test in clinical practice.26,53 This also means that individ- uals who are positive on the reference standard should reflect a continuum of severity, from mild to severe, whereas those who are negative with respect to the reference standard should have conditions commonly confused with the condition of interest and should not be a group of control subjects without impairments or disabilities.34 Many of the tests already cited in this perspective have used groups of subjects without impair- ments or disabilities who were chosen out of conve- nience. When subjects without any symptoms, impair- ments, or disabilities are tested, this does not reflect the way most tests are applied clinically, where distinctions between individuals with similar symptoms are required. Any test should at least be expected to demonstrate greater diagnostic accuracy when attempting to distin- guish between individuals without symptoms and those with severe conditions.56 Spectrum (or selection) bias may occur when study subjects are not representative of the population on whom the test is typically applied in practice.26 Spectrum bias, in our opinion, can pro- foundly affect the results of a study.26 The best method of ensuring a representative sample and avoiding spectrum bias is to utilize a prospective cohort design with a consecutive group of subjects from a clinical population. Use of a case-control design with retrospective selection of subjects for inclusion makes a study susceptible to spectrum bias.53 This type of design occurs when a group of subjects with the condition of interest and a group of comparison subjects are assembled for examination. Even if the use of subjects without known impairments or disabilities is avoided, case-control designs can distort the typical mix of subjects seen in a clinical setting by artificially controlling the prevalence and presen- tation of the condition of interest, potentially affecting the accuracy and utility of a diagnostic test.28,54,67,68 A comparison of studies examining the diagnostic accu- racy of the Phalen test for detecting median nerve compression in the carpal tunnel provides an example of the impact of spectrum bias (Tab. 3). The study by Burke et al43 and 2 other studies69,70 compared the Phalen test against a reference standard involving nerve conduction velocity studies. Similar criteria for judging the reference standard were used in the 3 studies. The description of how the diagnostic test was performed and the grading criteria were nearly identical in 2 studies, but they were not reported in the third study. The greatest difference among the studies was the subjects. In 2 studies,43,70 there were cohorts of subjects with symptoms consistent with CTS. In the third study,69 subjects included those with symptoms consistent with CTS, a few with known diagnoses other than CTS but with a similar presentation (eg, diabetic peripheral neuropathy), and 25 subjects (50 hands tested) without symptoms consistent with CTS. Inclusion of people without symptoms creates a spectrum bias by assembling a population unrepresenta- tive of the clinical population in which the test is typically used. As would be anticipated, the study most subject to spectrum bias also demonstrated the highest level of diagnostic accuracy for the Phalen test (Tab. 3).

...

Successful integration of in- dividual clinical expertise, patient values and expectations, and the best available evidence requires famil- iarity and skill with the EBP process. Formulating an appropriate question, performing an efficient literature search, critically appraising the best available evidence relative to treatment, applying the best evidence to clinical practice, and ultimately assuring proficiency with the process will ultimately lead to improved care for our patients. Developing profi- ciency in the 5-step EBP process requires strong dedication and effort, and can be quite challenging initially. However, as with any skill attainment, the process gets easier and faster with practice and experience. This first commentary in a 2-part se- ries reviewed principles relating to formu- lating foreground questions, searching for the best available evidence for treatment effectiveness, and determining validity of results in studies of interventions for orthopaedic and sports physical therapy. Part 2 of this series will assist readers in interpreting results, applying results to patient care, and evaluating proficiency with EBP skills.

Sensitivity and specificity

Tests are rarely 100% accurate, so false positives and false negatives can occur. The findings of a test are generally plotted against the actual diagnosis in a "two by two" or "truth" table (Figure 1). The characteristics of a diagnostic test, defined in Table 1, are calculated from the truth table. Where sensitivity or specificity is extremely high (98- 100%), interpretation of test results is simple. If the sensitivity is extremely high, we can be sure that a negative test will rule the disorder out. This is because there can be very few false negatives (Cell c). If the specificity is extremely high we can be sure that a positive test will rule the disorder in. This is because there can be very few false positives (Cell b). These rules can be recalled using the mnemonics SnOUT and SpIN: SnOUT : If Sensitivity is high, a negative test will rule the disorder OUT. SpIN : If Specificity is high, a positive test will rule the disorder IN. Table 2 shows some SpINs and SnOUTs of interest to physiotherapists. These have been chosen on the basis of their high sensitivities and specificities. Unfortunately, it is rare for sensitivity and specificity to reach such giddy heights. Sensitivity and specificity tell us how often a test is positive and negative in people who we already know have the condition or not. Clinically, however, we do not initially know whether or not our client has the condition. In this case, it is essential that we know how to interpret a positive or negative test result.

Evaluating the Evidence—Data Analysis

The basic layout for the data analysis in a study of a diagnostic test is depicted in Table 1. The result for each subject fits into only 1 of the 4 categories based on a comparison of the results of the diagnostic test and the diagnosis based on the reference standard. Results in categories "a" (true positive) and "d" (true negative) represent correct test results, whereas categories "b" (false positive) and "c" (false negative) contain errone- ous results. From this basic layout, several statistics can be calculated (Tab. 4).56 The overall accuracy of a test can be determined by dividing the number of correct results by the total number of tests conducted.56 A perfect test would have an overall accuracy of 100%; however, no test used in clinical practice can be expected to demonstrate this level of accuracy, and the goal is to characterize the nature of the errors.71 The overall accuracy of a test does not distinguish between false positive and false negative results and therefore has limited usefulness.72

STEP 3C. CRITICALLY APPRAISING THE LITERATURE: HOW CAN I APPLY THE RESULTS TO PATIENT CARE?

The final question in a critical appraisal of evidence involves a se- ries of deliberate judgments about the relevance and applicability of the evi- dence to a specific patient in the context of a specific clinical setting. An evidence- based practitioner will need to decide whether the patient under consideration is sufficiently similar to the patients in the study or group of studies for the results to be relevant. For example, the clinician should determine whether the patients enrolled in the study were similar to his/ her own patient, including the inclu- sion and exclusion criteria, age, gender, race, sociodemographics, stage of illness, comorbidity and disability status, and prognosis. Next, the practitioner must integrate patient values, preferences, and expectations in shared decision making when selecting a particular treatment. Also, the evidence will be relevant to a given patient only if outcomes measured in the clinical trial are consistent with the individual patient's goals. Consideration must be given to whether the treatment as structured in the research study is accept- able to the patient. Many issues must be considered, such as anticipated frequency and duration of patient visits, cost of the treatment, possible discomfort or other adverse effects of the intervention of inter- est and of competing interventions (such as injections, surgery, or other noninva- sive interventions), and how consistent the treatment is with patient expecta- tions. This final question also prompts the practitioner to integrate personal clinical expertise. Some treatments require spe- cialty skills or specific equipment that may not be currently available and may not be obtainable in a reasonable amount of time to help a particular patient. Critical appraisal is an essential skill for an evidence-based practitioner. Al- though applying the principles outlined above for critical appraisal may be dif- ficult to master initially, the process be- comes much easier with practice. Critical appraisal using these principles is the best method to facilitate independent professional judgments about the valid- ity, strength, and relevance of evidence for therapy. A checklist to organize key judgments during a critical appraisal for a RCT is included in APPENDIX A.

Conclusions

The process of diagnosis is an essential task for physical therapists because it serves as the link between examina- tion findings and interventions. To be able to examine diagnosis from an evidence-based perspective, we argue that therapists need to be familiar with the standards defining the "current best evidence" and how the evi- dence can be used for "making decisions about the care of individual patients."21(p71) The standards relate to several aspects of the study design and data analysis. An important first step is to define the purpose for which a diagnostic test is used. The purpose should be reflected in he choice of a reference standard (measurement) against which the results are compared. Both the diagnostic test and the reference standard should be applied consistently in all subjects and judged by blinded examiners. The study sample should be representative of the type of patients on whom the test is typically used in the clinical setting. The best statistics for application in individual patient decision making are LRs because they can be used to quantify probability revision based on positive or negative test results. The application of evidence into patient management requires an understanding of prob- ability and the shifts in probability caused by a certain test result. Systems of patient management that link diagnostic tests with interventions may produce less favorable results when the diagnostic process within the system is not evidence-based. More studies are needed to examine com- monly used diagnostic methods in physical therapy. The evidence provided by past and future studies should be applied to the management of patients in order to make the practice of physical therapy more evidence-based

CONCLUSIONS

The special tests performed in the physical examination of patients with musculoskeletal injuries are learned skills. Pro- ficient performance requires formal instruction in and practice of the proper technique. Skilled performance, however, does not assure accuracy. Athletic trainers should understand the value and limitations of these special tests, so that test results are interpreted in the context of the full examination. Further- more, examination procedure performance characteristics should be considered in the development of the athletic train- ing curriculum. Procedures that are poor discriminators may be best left out of instruction so that students may focus their attention on mastering those skills most useful in clinical prac- tice. Lastly, research into the various aspects of injury evalu- ation by ATs is needed. Little in the athletic training literature describes how well these clinicians recognize, evaluate, and assess ill and injured athletes despite the prominence of these responsibilities in the professional and educational standards.

Evaluating the Evidence—Study Design

The strength of evidence provided by any study will be substantially affected, and potentially limited, by the study's design. The optimal study design is the one that most effectively reduces susceptibility to bias (ie, a deviation of the results from the truth in a consistent direction).27,31 For studies investigating treatment out- comes, the design best accomplishing this objective is recognized as the randomized clinical trial. However, if the research question is one of diagnosis, the random- ized trial is no longer the most desirable design. The optimal design for examining a diagnostic test, in the opinion of experts, is "a prospective, blind comparison of the test and the reference test in a consecutive series of patients from a relevant clinical population."26(p1062) That is, a study investigating a diagnostic test should utilize a prospective cohort design in which all subjects are evalu- ated using the diagnostic test or tests and a reference standard representing the definitive, or best, criteria for the condition of interest. When performed in this manner, the results of the test and the reference standard can be summarized in a 2 2 table, as depicted in Table 1. Issues beyond the basic design of a study are important for determining the extent to which the potential for bias has been minimized in a study and for determining the strength of the evidence. For studies of diagnostic tests, the most important issues are the reference stan- dard, the diagnostic test, and the population studied. The most important considerations for each issue are summarized in Table 2 and are described below.

Ottawa knee rule

Traumatic knee injuries account for about 1 million ED visits annually in the USA.16 Approximately 80% of patients underwent radiography prior to the devel- opment of knee rules, with over 94% being negative for acute fracture.14 The Ottawa knee rules were also prospectively derived (N = 1054 patients)31 and prospectively vali- dated (N = 1096).32 They incorporate simple histor- ical and physical findings which are well defined to determine if patients require radiography of their knee following a traumatic injury (Fig. 3). Since their development, there have been further studies to determine their clinical impact. An implementation trial followed the successful validation of the rule.33 This study was a controlled clinical trial with before—after and concurrent con- trols. This study followed all 3907 with acute knee injuries who presented to one of the four study sites during the 2-year study period. This study found that the intervention sites had a relative reduction in knee radiography of 26% versus a relative reduction of 1% for the control sites. No fractures were missed by applying the rule and patients who were dis- charged without radiography spent an average of 33 min less in the ED than those who underwent radiography. An economic analysis was conducted using similar methodology as that for the ankle rules.18 It found that the knee rules were cost-effective. The mean cost savings were US$31 (95% CI: 22—44) per Cana- dian patient, US$34 (95% CI: 24—47) per US Medicare patient and US$55 (95% CI: 34—90) per knee injury patient for fee for service cost in the USA.18 Given the fact that more than one million Americans pre- sent to the ED each year for acute knee injuries, the potential cost savings of using the knee rules are very significant.16,14 The international survey looking at the awareness and use of the ankle rules also had questions addres- sing the Ottawa knee rule.11 The survey found that physicians reported much less awareness of this rule than the ankle rules. English speaking countries (Canada, US, and UK) reported the highest aware- ness with 63%, 53% and 29%, respectively. In France, the awareness was reported by just 12% and even less in Spain with just 8% reporting awareness of the rule. This corresponded to use rates of the rule by all respondents ranging from a high of 17% in Canada to a low of 3% in France.11 A meta-analysis was also conducted for the Ottawa knee rule. This study pooled data from six studies with 4249 patients with acute knee injuries.2 They found that assuming a fracture prevalence of 7% that a negative score by the knee rule has a fracture risk of less than 1.5%.

Validity

Validity is a complex subject, and our purpose is ultimately to demonstrate the calculation and application of likelihood ratios in clinical practice. These apparently foreign issues are, however, closely linked. Validity has been defined as ''the ex- tent to which a test measures what it is intended to measure,''5 or, stated differently, ''a term used to characterize the overall correctness of a test.''6 Messick,7 however, clarified the issue when he stated that ''validity is not a property of a test or measurement as such but rather the meaning of the test scores.'' Data are the results of measurement. Thus, in the context of physical examination procedures, validity really boils down to whether the data collected (positive or negative results) reflect the reality of the condition of each patient. We shall see that data derived from physical examination proce- dures with high positive likelihood ratios (LR) and low neg- ative likelihood ratios (LR) are more likely to reflect the presence or absence of a condition, respectively, than those with low LR or high LR. Likelihood ratios for physical examination procedures are derived by comparing the results of a procedure of interest (eg, Lachman test3) with the results of a previously validated examination (eg, arthroscopy) often referred to as the ''gold standard.'' Because an entire population of patients cannot be studied, investigators can only estimate the true performance characteristics (including LR and LR) of a diagnostic test. Likelihood ratios are, in fact, data that provide estimates of how well physical examination procedures measure what they are intended to measure, namely the presence or absence of a medical condition. Likelihood ratios (the data derived from studies of diagnos- tic testing) can be influenced, or biased, by the design and study methods employed to assess the performance of physical examination procedures and other diagnostic tests. Therefore, before we address the interpretation and application of LRs, it is important to identify methodologic issues that may threaten the validity of LR estimates.

Hierarchy of Evidence

When evaluating evidence for effective- ness of an intervention, clinicians often find it helpful to use a system to deter- mine the level of evidence for a particu- lar study. A level of evidence is a label reflecting a study's position on the hier- archy of evidence, providing a rough in- dication of inherent protections against validity threats, or sources of bias, based on the study design and methods.41 Af- ter assessing over 100 different systems for rating the strength and quality of evidence, the Agency for Healthcare Re- search and Quality identified 7 systems that fully address all important domains for a body of evidence.5 Among these 7 systems is one developed by David Sack- ett and colleagues, freely accessible from the Centre for Evidence-Based Medicine website.1 Levels of evidence applicable for studies exploring the efficacy of clini- cal treatments were extracted from that system and presented with descriptions in . In addition to identifying the level of evidence on the hierarchy, thera- pists must also consider critically ap- praising the study's overall quality and the study's internal and external validity, prior to implementing the results in clini- cal practice.

Biases in Studies of Physical Examination Procedure Performance

When we review studies of diagnostic procedures, our first consideration is what was the gold standard used for comparison? Authors of a report should provide evidence that the gold standard is adequately accurate for comparison purposes with- in the context of a study.8,9 Direct observation during surgery is often the most accurate method of confirming musculoskel- etal injury. Diagnostic imaging, while highly useful, is not perfectly accurate. Thus, when MRI, for example, is employed as a gold standard for comparison in a study of clinical ex- amination procedures, the values of specificity, sensitivity, and LRs (described in detail shortly) for the MRI should be pro- vided to the research consumer. The consumer must then de- cide if the gold standard is sufficiently accurate for comparison purposes. The comparison with an established standard, however, is not the sole concern when evaluating the methods of a study of diagnostic procedures. Bias can be introduced through sub- ject selection8,9 and other methodologic issues. A good study includes a spectrum of patients to whom the test in question would typically be applied in a clinical setting. Spectrum bias is introduced when only patients very likely to have a condi- tion (based on history or other criteria) are studied or when patients who clearly do not have a condition are included.6,8 The results of a study can also be influenced by ''work-up'' bias. Work-up bias may exist if the gold standard is not applied to everyone in the study.8 For example, if MRI is ordered only for those thought most likely to have sustained an ACL tear, the number of false-negative results on the criterion measure (eg, Lachman test) may be underreported. Bias may also be introduced when examiners are not blinded to the gold stan- dard test results. We are all human, and we tend to find what we expect, such as a positive Lachman test in a patient with an MRI identifying a torn ACL. Furthermore, and for the same reason, the gold-standard interpreters must be blinded to the clinical results.8,9 For example, if 20 people were tested clin- ically for ACL deficiency with a Lachman test and 10 were found to be positive, the radiologist reading the MRI should not be aware of which subjects had laxity in the clinical tests. One final consideration exists when reading research related to diagnostic testing: the generalizability (also referred to as external validity) of the investigators' conclusions.10 Readers need to judge how closely the study setting and patient sample reflect their own environment. Furthermore, readers must be sensitive to the training and expertise of the health care pro- fessionals performing, administering, and interpreting the tests of interest. This issue is, in our opinion, of particular concern in athletic training and is discussed further at the end of this paper.

forming a question

he first and often most dif- ficult step is the development of a well-built clinical question that facilitates a literature search, ultimately leading to the best evidence available to remove or optimally reduce clinical un- certainty.44 There are 2 types of clinical questions: background and foreground. Background questions are developed to enhance knowledge relative to a specific disorder.46 For example, a clinician may ask "What causes carpal tunnel syn- drome?" or "Why do patients develop coronary artery disease " While these background questions will lead clinicians to information regarding the specific pathology,46 they usually do not provide the clinician with up-to-date information about optimal treatment options for the patient. Foreground questions of therapy are developed in response to the need to identify evidence regarding the use of a specific intervention in the management of a particular patient.46 As it is the pur- pose of this commentary to discuss stud- ies of treatment effectiveness, foreground questions will remain the focus of this section. Foreground questions of therapy con- sist of 4 components: (1) a patient or problem, (2) an intervention, (3) a com- parison intervention (if relevant), and (4) an outcome.46 These 4 components may be referred to as PICO (patient, interven- tion, comparison, outcome). Some exam- ples of foreground questions, including these 4 components, are as follows: (1) In a 38-year-old female with carpal tunnel syndrome, what is the efficacy of exercise and ergonomic interventions compared to no treatment for decreasing pain and disability? or (2) In a 43-year-old female with plantar fasciitis, are custom-fit or- thotics more effective than prefabricat- ed orthotics in decreasing plantar foot pain?


Ensembles d'études connexes

Prep U Chapter 34: Assessment and Management of Patients with Inflammatory Rheumatic Disorders

View Set

Adolescent Psychology- Chapter 5

View Set