Art and Science of Guideline Formation

Published on 27/03/2015 by admin

Last modified 22/04/2025

Print this page

This article have been viewed 1364 times

Chapter 212 Art and Science of Guideline Formation

Clinical practice guidelines have become an integral part of the practice of medicine. They are meant to be used by physicians as resources to consider when making treatment decisions for individual patients. They are also frequently used by various organizations for policy and payment decisions. As of February, 2009, 2408 sets of clinical guidelines are listed on the National Guidelines Clearinghouse (NGC), with 479 additional guideline sets registered as “in progress (http://www.guideline.gov).” One hundred and seventy-four guideline sets in this one database focus on disorders of the spine. Only 25 of these were produced by organized spine surgery, sponsored by either the AANS/CNS Section on Disorders of the Spine or the North American Spine Society. These sets do not include myriad “technology assessments” commissioned by third-party payers, nor do they include a multitude of guidelines, evidence-based reviews, evidence-informed consensus statements, or other similarly titled systematic literature reviews published and disbursed outside of the NGC system. Clinical practice guidelines are here to stay and have proven to be important for the assessment of current best practices, guidance for future research, and defense of unpopular yet effective treatment strategies. The purpose of this chapter is to describe how guidelines are created in both the ideal situation and in the real world.

Author Group

One of the most useful tools for learning about evidence-based medicine, guidelines, and the application of guidelines to the real world is a small text by David Sackett and the McMaster University group called Evidence Based Medicine.¹ We refer to this text several times in this chapter when discussing how to rate evidence and how to apply evidence to clinical situations. In the chapter devoted to a discussion of the creation of clinical practice guidelines, Dr. Sackett offers the reader the following advice:

We hope, …, that you see how doubly dumb it is for one or a small group of local clinicians to try and create the evidence component of a guideline all by themselves. Not only are we ill equipped and inadequately resourced for the task, but by taking it on we steal energy away from …our real expertise… This chapter closes with the admonition to frontline clinicians: when it comes to lending a hand with guideline development, work as a “B-keeper ^*” not a meta-analyst.¹

Despite this warning, it is absolutely critical that physicians with clinical expertise participate in the formation of clinical practice guidelines. Although epidemiologic support is necessary for the analysis of study design, clinical data cannot be accurately interpreted and the translation of data to recommendation cannot be made without an understanding of the clinical significance of the data. This understanding does not come from textbooks. A more reasonable interpretation of Sackett’s statement is that it is not efficient or desirable to have individual groups spend the resources to develop practice guidelines at a local level. It makes more sense to have guidelines produced at a national level and leave the interpretation of those guidelines to the local experts. A series of review articles published in the Journal of the American Medical Association by the same author group offers detailed explanations of many of the concepts to be discussed in this chapter. The level of detail is inappropriate for this particular review, but the reader is encouraged to use these as references for further inquiry.²^–²⁰

High-quality guidelines ideally have an author group that consists of a multidisciplinary panel of recognized experts in the disease process studied. Depending on the disorder studied, multidisciplinary may mean two related specialties (e.g., orthopaedics and neurosurgery for cervical spine trauma—no other specialties regularly deal with this issue) or perhaps members drawn from five or six disparate specialties (e.g., the American College of Radiology imaging appropriateness criteria, in which multiple specialties treat common clinical scenarios such as low back pain). Epidemiologic support is also crucial, and having an epidemiologist on the author panel is an ideal solution. All panel members should have some understanding of basic statistical methods and access to a statistician.

Conflict of interest is an important issue in the formation of a guidelines author group. Disclosure of such conflicts is the first step in managing conflicts, and the organizing body, be it a medical society, university work group, or insurance carrier, must decide how to manage or resolve the conflict. In some situations, compromises are necessary in order to garner sufficient topical expertise. In most situations, however, author groups can be constructed and organized to mitigate against the possibility of industry-related conflicts. It is our opinion that industry-sponsored “study groups” are an inappropriate source for clinical practice guidelines because the membership of and strategic direction of these panels may be easily influenced by the sponsoring body. Similarly, technology assessments produced by centers that are funded largely by third-party payers cannot be considered practice guidelines since they are paid for by entities primarily desiring to limit economic exposure as opposed to evaluating clinical efficacy. Furthermore, these panels notoriously lack relevant physician input and tend to place a higher value on study design and author interpretation of data than common sense and clinical fact. (e.g., go to www.ecri.org and review their assessment of “decompressive procedures for lumbosacral pain.” You will note that the author group contained only one physician, an Emergency Care Research Institute [ECRI] employee who practices internal medicine. No spine surgeon, physical therapist, rehabilitation physician, or other specialist input was solicited, and the topic is clearly ridiculous to anyone who regularly cares for these patients—decompression is not done as a treatment for low back pain, it is done for radiculopathy or stenosis.)

Those in the field of organized spine surgery, including the American Association of Neurological Surgeons and Congress of Neurological Surgeons Joint Section on Disorders of the Spine (Spine Section) and the North American Spine Society (NASS), have been active in guidelines development. The first significant product produced using modern evidence-based review techniques was the set of clinical practice guidelines dealing with cervical spine and spinal cord injury.²¹ The author group was recruited by Mark Hadley and consisted exclusively of neurosurgeons, both because of the funding agency (the spine section) and because of relative inexperience in guidelines formation. The group included general neurosurgeons, pediatric neurosurgeons, and neurosurgical spine specialists. Beverly Walters, a neurosurgeon who had trained in clinical epidemiology at McMaster University served as the epidemiologist. Each of the authors was employed at an academic center and had the support of local expertise in library science and statistics if necessary. The authors were tutored in evidence-based medicine techniques during 4-week-long sessions in order to solidify their ability to interpret the medical literature.

These guidelines were unique in the spine world and were qualitatively different from the various consensus-based guidelines that had been published previously (e.g., the NASS Low Back Pain Treatment Guidelines published in 1999). Because they applied to a relatively small patient population and because they were originally published as a supplement to Neurosurgery, a journal with virtually no penetrance into emergency medicine or orthopaedics, they did not receive immediate notoriety. With the exception of chapters dealing with the administration of steroids and the safety of traction reduction without MRI, very few recommendations were considered controversial.²¹

The AANS/CNS spine section was then charged with organizing a set of guidelines dealing with the topic of lumbar fusion. The section actively sought input from orthopaedic surgeons and physical medicine specialists in addition to neurosurgeons. Beverly Walters agreed to continue on in an advisory capacity, and several members of the cervical spine injury group, including Mark Hadley, were recruited to lend their expertise to the project. Because of the novelty of the process and the time commitment (a month away from home in addition to the time spent working on the project), it was difficult to recruit non-neurosurgeons. After being turned down four times by well known orthopaedic surgeons, the chairman of the NASS clinical care council, Bill Watters, volunteered himself and helped recruit Jeff Wang from UCLA to be the orthopaedic representatives on the panel. We were unable to recruit a physical medicine and rehabilitation physician to the panel, despite overtures to both local and national contacts.

Since the publication of the lumbar fusion guidelines, the visibility of guidelines formation has increased substantially. The economic effect of the recommendations, the timeliness of the publication in relationship to a political and popular examination of lumbar fusion, a more easily searchable publication format, and inclusion in the NGC substantially improved penetrance of these guidelines compared with that for the cervical spine injury guidelines. Vocal objections to the formation of clinical practice guidelines by “grass roots” neurosurgeons (via the Council of State Neurosurgical Societies) and others focused attention on the process. The use of guidelines to support continued patient access to spinal surgeons in Washington State and in several national insurance plans by a coalition of national organizations, including the AANS, CNS, NASS, American Association of Orthopedic Surgeons (AAOS), and Scoliosis Research Society (SRS), further highlighted the importance of such activities.

Subsequent guidelines efforts sponsored by the spine section or NASS have uniformly included broad representation of relevant clinical specialties. Both organizations require intensive training of author panel members. The AANS/CNS guidelines committee continues to rely on a didactic series of lectures developed by Beverly Walters and moderated by the chairs of the guidelines committee (currently Mark Linskey and Tim Ryken). The NASS has employed an online training module combined with on-the-job training. Bill Watters and Chris Bono have effectively used the NASS infrastructure to develop a primarily web-based mechanism for guidelines formation. Both organizations have now developed a cadre of well-trained clinician authors, both support multidisciplinary guidelines formation, and both support consultation with professional epidemiologists as needed.

Question Formation

Once an author group is formed, a set of questions is developed. The questions asked are a very important determinant of the utility of the ultimate guideline document. Questions need to be both relevant and answerable. A question such as, “What is the best treatment for low back pain?” is unanswerable. Patients with low back pain are a heterogeneous population. Back pain may be caused by muscular strain, traumatic injury, degeneration of the intervertebral disc, or spinal tumors. It may be a symptom of renal calculi, dissecting aortic aneurysm, or a somatization disorder. There is, therefore, no one best treatment for back pain, and attempting to answer such a question is a frustrating and fruitless endeavor. A better question would be “In a patient with recalcitrant low back pain and neurogenic claudication due to spondylolisthesis and stenosis, does surgical intervention improve outcomes compared with the natural history of the disease?” Here, the patient population is well defined and the treatment modalities are well described, allowing a meaningful review of the medical literature. During the literature search, it may become apparent that multiple surgical interventions are employed, resulting in the parsing of the question into subcomponents related to individual surgical techniques.

Literature Search

The availability of computerized search engines has greatly simplified the ability to identify potentially useful references. Most guidelines groups use two different search engines and databases to ensure a thorough search. Familiarity with mesh headings or consultation with a librarian is very useful in creating an effective search that will not be overly inclusive. Unfortunately, the era of electronic publishing has greatly increased the number of potentially useful references (when just the title and abstract are available for initial screening), and it is not uncommon to obtain several hundred or even several thousand references that require individual review. Several strategies can be used to speed this process. First of all, if sufficient high-quality evidence, such as several concordant randomized trials, exists, lower-quality evidence may be ignored except as background information. For example, about 7 billion papers deal with the use of microdiscectomy for lumbar radiculopathy (OK, an exaggeration). Of these papers, 99.9% are case reports, small case series, technical notes, or historic anecdotes. There are a few large cohort studies with admittedly fatal flaws. Fortunately, several attempted randomized studies have been published within the last few years ^22,²³ that provide higher-quality evidence than all of the other papers. Instead of spending months describing each case series, we can focus our review on a detailed analysis of the higher-quality papers and simply summarize the findings of the various case series. If the primary references are flawed, however, then we must incorporate the lower-quality evidence into the analysis.

Another way to speed up the literature search and review is to set minimum acceptable criteria for inclusion in the database. This is the strategy used by the Cochrane group, who only consider randomized clinical trials (RCT) as evidence worthy of review.²⁴ Although this strategy certainly speeds up the review process, many relevant questions in the surgical realm are not particularly amendable to study via RCT. If the Cochrane criteria were applied to the surgical management of symptomatic intracranial extradural hematomas, the conclusion would necessarily be that there is no evidence to support the evacuation of such hematomas. No RCT has ever been performed on this patient population. Although academically pure, the adherence to such high standards breaks down in the trenches. A very humorous article in the British Medical Journal pointed out that since skydiving with a parachute was associated with occasional fatality, and that survival following falling out of plane with no or a malfunctioning parachute had been described, in the absence of a randomized trial, it must be concluded that there is no evidence to support the use of a parachute to increase survival when jumping out of an airplane.²⁵

Evidence Grading

Once a dataset of relevant papers has been created, the papers must be read by several members of the author group and graded in terms of the quality of evidence provided. Several grading schemes are commonly used; however, most current guidelines use either a three- or five-point scale, with appropriately designed and performed RCTs being considered the highest level of evidence and expert opinion holding the lowest spot in the rankings. Criteria of quality do differ, however, according to the type of question asked (see Sackett, p. 173, for a useful summary table).¹ For example, in evaluating a diagnostic test, if a “gold standard” exists, then a simple comparison between the new test and the gold standard in a single patient population with adequate reporting of results (true and false positives, true and false negatives) is considered high-quality evidence. In the therapeutic realm, where most of our questions exist (what is the best treatment for a patient with a known diagnosis), RCTs are king, with cohort studies (in which two groups of patients are treated for the same disease process with two different strategies) and case-control studies (in which characteristics of a group of interest are compared with characteristics of the general population) providing intermediate levels of evidence above that provided by case series (in which there is nothing to compare the results to) or case reports.¹

Identifying the type of study used can be tricky sometimes, and even when studies appear to be well designed, they often have flaws that result in downgrading of the evidence to a lower class. The most common reasons for downgrading evidence derived from clinical studies include flaws in study design, the selection of the study sample, and the nature and quality of outcomes measures. The Spine Patient Outcomes Research Trial (SPORT) is an example of how problems with study design, whether planned or not, can decrease the quality of evidence derived from randomized studies. The SPORT investigators set out to perform a randomized controlled clinical trial to establish the efficacy of surgery for one of three disorders: lumbar disc herniation, spondylolisthesis, or lumbar stenosis.^23,²⁶^–²⁸ Patients were screened for eligibility and then offered the opportunity to participate in the clinical trials. The first methodologic concern relates to the fact that only about a quarter of eligible patients consented to participate in the study. The fact that most of the eligible patients declined participation immediately raises the concern that the patients who did consent were different from the general population—perhaps these patients had less severe symptoms, perhaps they were already improving, or perhaps they had a genetic predisposition toward risk taking. To their credit, the investigators did keep track of a group of patients who declined randomization to try and address this concern.

Once randomized, patient results were analyzed based on an “intent to treat” basis. This means that patient results stayed in the group that the patient was assigned to, regardless of what happened to the patient. Therefore, if a patient was assigned to nonoperative management, failed, then had surgery, and had a great result from the surgery, the great result was credited to the nonoperative management. Patients who fail in the surgical arm cannot cross back over to the nonoperative arm. This type of analysis creates a significant bias against any nonreversible intervention such as surgery. In fact, if crossover is high enough, then the analysis must be abandoned, which is what happened. A tremendous amount of crossover took place in both directions, creating comparison groups that did not differ in terms of the treatment received. About half of the patients who were randomized to nonsurgical treatment had surgery, as did about half of the patients randomized to surgery in all three arms of the study.^23,²⁶^–²⁸ No matter how effective any treatment is, it would be impossible to detect a difference between groups if the same number of patients in each group received the treatment. The authors had to resort to an “as treated” analysis, that is, a reporting of what actually happened to the patients. In all three studies, patients treated surgically enjoyed significant benefit in every outcome measure and at every time point. This is great news for proponents of surgery; however, the study is no longer a randomized study. Because the patients were able to pretty much choose their treatment regardless of randomization, the study is actually a cohort study and would be considered to provide lower-quality evidence than a randomized trial.

Patient selection can also be manipulated on purpose for specific reasons. When organizing a clinical trial, investigators, particularly those who are motivated to achieve a positive result, try to select a patient population most likely to benefit from the intervention being studied. For example, recently published studies looking at lumbar disc arthroplasty included a very select group of patients without significant spondylosis, facet arthropathy, spondylolisthesis, or stenosis.^29,³⁰ Although the arthroplasty group results were equivalent or, in some cases, marginally better than the comparison fusion groups, many authors have pointed out that the population operated on was not representative of the usual lumbar fusion patient population. For example, Wong et al. reported that none of 100 consecutive patients offered lumbar fusion in their practice would have been a candidate for arthroplasty had the study criteria been applied.³¹ Others have pointed out that the study population did not represent patients to whom many surgeons would offer a lumbar fusion in the first place³² and have questioned the relevance of the arthroplasty studies’ data to the broader fusion population. Therefore, although the trials were well organized and used valid outcomes instruments, any evidence drawn from the data presented must be interpreted in light of the fact that the data are based on a highly select and perhaps irrelevant patient population. For this reason, the data derived from such studies would not be considered to provide high-quality evidence for the general fusion population.

Outcomes measures must be used to report any sort of results. Outcomes measures may be patient reported (e.g., satisfaction scores), pain scales, or disability indexes. Some measures may be investigator reported, such as the absence or presence of neurologic deficits or other surgical complications. Other measures, such as radiographic measures, laboratory values, and survival statistics, may be reported independently of investigator or patient interpretation. Choice of an outcome measure is important and can influence the quality of information derived from a study. For example, many authors have reported different strategies for enhancing the success rate of lumbar fusion. Comparison of plain radiographs to more definitive assessments of fusion (e.g., operative exploration) has revealed that plain radiographs are a relatively poor diagnostic tool for detecting nonunion.³³ Therefore, studies that rely on plain radiographs as an outcome measure for healed fusion would be considered to provide only low-quality data. Similarly, if the objective of a procedure is to provide good patient outcomes, and fusion rates do not necessarily correlate with good outcomes, then measuring fusion rates would not provide useful data about patient outcomes no matter what method was used to assess bone growth.

These factors and others may lead to downgrading of what, at first, appears to be high-quality evidence. Unfortunately, it is rare that a surgical trial is free from methodologic flaws. Sometimes these could have been anticipated, and sometimes they are not clear until after analysis has taken place. It is important to recognize that sometimes, “the enemy of the good is the perfect” and that pretty good evidence is likely the best that we are ever going to have (e.g., the SPORT results, a trial that is unlikely to be repeated). When no attempt has been made to develop high-quality data, the literature review can reveal prime areas for future research.

Creation of Recommendations

A common misconception is that once the literature review is completed, recommendations naturally follow. Although this is the case in some circumstances, it is usually the exception rather than the rule. Value judgments must be made, and here is where clinical expertise, broad representation, and appreciation of patient-centered outcomes are crucial. Sometimes two equally weighted studies have conflicting results—consider the Fritzell et al. and Brox et al. studies comparing surgical to nonsurgical treatment for low back pain.^34,³⁵ Both of these studies were randomized studies performed in a roughly similar patient population. The Fritzell et al. group found that surgery was more effective than nonsurgical care for improving patient outcomes, yet the Brox et al. group found that no significant differences occurred in the outcomes measures in which they were interested.^34,³⁵ Different guidelines groups have made different recommendations regarding the performance of lumbar fusion after reviewing the same literature. A surgical group, focusing on the results reported for back and leg pain, strongly recommended fusion as a treatment strategy in selected patients, whereas a largely medical group, focusing on fear avoidance behavior and work history, offered a much less enthusiastic recommendation.^33,³⁶ In other situations, there is simply a mishmash of low-quality data from which the author group needs to draw some conclusion. This would be the situation illustrated by the guidelines dedicated to the surgical management of head injury—a field in which it is ethically impossible (at least in North America and most of the developed world) to perform randomized studies.

Authors of guidelines documents use various means to achieve and describe the degree of consensus regarding a particular recommendation. Some formalized processes are based primarily on voting, in which a recommendation is provided along with an indication of the degree of consensus among the author group. In some more informal processes, the verbiage of the recommendation is altered to convey the degree of uncertainty of the author panel. Consider the 2002 recommendation for the use of steroids following cervical spinal cord injury: “Options: Treatment with methylprednisolone for either 24 or 48 hours is recommended as an option in the treatment of patients with acute spinal cord injuries that should be undertaken only with the knowledge that the evidence suggesting harmful side effects is more consistent than any suggestion of clinical benefit.”²¹

First of all, despite the fact that the data source for the recommendation was an RCT, a low-level (option) recommendation was made, reflecting the author group’s concerns regarding the study design and in particular the post hoc analysis of data.³⁷ Second, the recommendation, although positive, is riddled with caveats specifically designed to cause the reader to carefully consider the enthusiasm that the author group had for the recommendation. Clearly, although these investigators wanted to preserve the use of steroids as an option for physicians managing spinal cord injury, they felt it important to emphasize that steroids were not necessarily required for optimal treatment of such patients.

The creation of recommendations requires clinical judgment. Therefore, those who do not have experience treating patients with the topical disorders are not well equipped to make such a judgment. Multiple “technology assessments” and other literature-based reviews created by freestanding centers for hire include recommendations that may not reflect reality. For example, the ECRI was hired by the Washington State Worker’s Compensation Board to review evidence about the performance of lumbar fusion. The firm created recommendations based on the previously discussed Brox et al. studies,³⁵ encouraging the use of the nonsurgical treatment described by Brox et al. as an alternative to lumbar fusion. In response to this, the board issued a coverage decision essentially eliminating the performance of lumbar fusion in the worker’s compensation population. It was not until a group from organized spinal surgery, again through a coalition between the AANS/CNS spine section and NASS, pointed out that the population of patients treated in the Brox et al. studies did not match the vast majority of patients treated in Washington State and that the “Brox protocol” did not exist in North America that the decision was reconsidered.

To streamline the process, NASS has developed standardized language for the description of recommendations. The nature of the language was intended to maintain focus on the precise question asked. For example, in the antibiotic prophylaxis guidelines recently published, one of the recommendations was “Prophylactic antibiotics are recommended to decrease the rate of spinal infections following uninstrumented lumbar spinal surgery.” This was the answer to the systematic review question: “For patients undergoing spine surgery without spinal implants, does antibiotic prophylaxis result in decreased infection rates as compared to patients who do not receive prophylaxis?” Importantly, this is a stand-alone recommendation that pertains to a specific population of patients undergoing spinal surgery. It does not comment on duration, or frequency or on patients who are undergoing instrumented surgery, because these variables are dealt with in other recommendations. To demonstrate this pattern, the recommendation for patients with spinal implants was “Prophylactic antibiotics are recommended to decrease the rate of infections following instrumented spine fusion.”

An emerging issue with guidelines creation is conflict of interest. Although searching a library and grading papers are not likely to be influenced by corporate interests, the interpretation of the evidence and creation of recommendations is a step that can be influenced consciously or subconsciously. Transparency, the use of widely representative author panels, the practice of recusal, and multiple levels of peer review are mechanisms that serve to mitigate such conflicts. An even more basic conflict also must be taken into consideration—can a recommendation for a surgical procedure authored solely by those who perform the procedure really be taken seriously? For these reasons, industry sponsorship of practice guidelines should be discouraged, and multidisciplinary panels should be encouraged whenever possible.