Hypotheses usually involve demonstrating that one treatment is significantly better than another (superiority) or that one treatment is not meaningfully worse than another treatment (non-inferiority, equivalence) (Figure 19-1). When testing an active drug or treatment compared with placebo, the study should be designed to detect the smallest clinically meaningful improvement as well as the cost and side effects of the drug. Placebo-controlled trials should always be superiority trials if the goal is to test the efficacy of the new drug. Sometimes non-inferiority placebo-controlled trials are used to establish the safety of a new drug.

FIGURE 19-1 Types of comparisons.

It is important to determine, a priori, if the goal is to prove superiority or non-inferiority. If the new drug has efficacy comparable with that of an accepted treatment but a secondary property such as side-effect profile or cost is reduced, then demonstration of non-inferiority would likely be acceptable. If a study designed as a non-inferiority trial shows superiority, then the superiority of the new drug can be the stated result. However, it is important to remember that a failed superiority trial is not a non-inferiority trial. Non-inferiority trials are much larger than superiority trials by design. The driving factor that makes non-inferiority trials large is the non-inferiority margin, or the amount that the new drug could differ from the accepted one before claiming that the new drug is not worse than the accepted one.

Other Design Parameters

It is possible to be “unlucky” in a clinical trial in two key ways, which means that the result of the clinical trial leads to a conclusion that does not hold in the general population of patients. These are given the non-mnemonic names type I error and type II error. Type I error, often denoted alpha, is basically the chance that a difference is detected for the patients in the experiment but that difference does not hold for patients in general. It is common to set this error tolerance to 5%. Type II error, often denoted beta, is basically the chance that no difference is detected for patients in the experiment, but a difference exists for patients in general. The power of a study, or 1 minus beta, is basically the chance that if a treatment difference exists, the experiment will detect it.

Primary Outcome

The human phase of development for a drug or intervention is divided into four parts, and each part has unique objectives (Table 19-1). The choice of the outcome measure should be based on the goals of the development phase.

Table 19-1 Human Phase of Development for a Drug or Intervention

PHASE	GOALS	EXAMPLES OF OUTCOMES
I	Maximum tolerated dose, toxicity, and safety profile	Pharmacokinetic parameters, adverse events
II	Evidence of biological activity	Surrogate markers* such as CRP, blood pressure, cholesterol, arterial plaque measure by intravascular ultrasound, hospitalization for CHF
III	Evidence of impact on hard clinical endpoints	Time to death, myocardial infarction, stroke, hospitalization for CHF, some combination of these events
IV	Postmarketing studies collecting additional information from widespread use of drugs or devices	Death, myocardial infarction, lead fracture, stroke, heart failure, liver failure

CRP, C-reactive protein; CHF, congestive heart failure.

* A surrogate marker is an event that precedes the clinical event, which is (ideally) in the causal pathway to the clinical event. For example, if a drug is thought to decrease risk of myocardial infarction (MI) by reducing arterial calcification, changes in arterial calcification might be used as a surrogate endpoint because these changes should occur sooner than MI, leading to a reduction in trial time. Hospitalization is a challenging surrogate marker because it is not clearly in the causal pathway; hospitalization does not cause MI, but heart failure worsening to the degree that hospitalization is required is in the causal pathway. When hospitalization is used as a measure of new or worsening disease, it is important that the change in disease status and not just the hospitalization event is captured.

Randomization

Randomization is the process of “randomly” assigning individuals or groups of individuals to one of two or more different treatment options. The term random means that the process is governed by chance. Different trial designs may implement randomization in different ways as will be described below. The simplest design randomly allocates study participants between one of two treatment arms. A particular participant is equally likely to be assigned to one or the other arm.

Why Randomize?

The notion of randomly assigning individual observational units to one treatment modality or another was first discussed by R.A. Fisher in the 1920s.

Randomization, then, tends to even out any differences between the study participants assigned to one treatment arm compared with those to the other. The Coronary Artery Surgery Study (CASS) randomly assigned patients with stable class III angina to initial treatment with bypass surgery or medical therapy.² If this trial had not been randomized and treatment assignment had been left to the discretion of the enrolling physician, it is likely that these physicians would have selected patients who they believed would be “good surgical candidates” for the bypass surgery arm. This would have led to a comparison of patients who were good surgical candidates receiving bypass surgery with a group of patients who, for one reason or another, were not good surgical candidates receiving medical treatment. It is likely that the medically selected patients would have been sicker, with more comorbidities than the patient selected for surgery. This design would not result in a fair comparison of the two treatment strategies. Randomization levels the playing field.

Intention to Treat

To make randomization work, analysis of RCT data needs to adhere to the principle of “intention to treat.” In its purest form, this means that data from each study participant are analyzed according to the treatment arm to which they were randomized, period. In the case of the Antiarrhythmics Versus Implantable Defibrillators (AVID) trial, this meant that data from patients randomly assigned to the implantable defibrillator arm were analyzed in that arm, whether or not they ever received the device.³ Data from patients randomly assigned to the antiarrhythmic drug treatment arm were analyzed with that arm, even if they had received a defibrillator. This may not make obvious sense, and many trial sponsors have argued that their new treatment just could not show an effect if the patient never received the new treatment. However, the principle of “intention to treat” protects the integrity of the trial by removing a large source of potential bias. In a trial to examine the efficacy of a new antiarrhythmic drug for preventing sudden cardiac death (SCD), for example, a sponsor might be tempted to count events only while the patient was still taking the drug. How, they would argue, could the drug have a benefit if the patient was not taking it? But, if, as has happened, the drug exacerbates congestive heart failure (CHF), then patients assigned to the experimental drug would be likely to discontinue taking the drug. And any subsequent SCD or cardiac arrests would not be attributed to the drug. In fact, it could be argued that the drugs led to a situation where the patient was more likely to die of an arrhythmia.

Stratification and Site Effect

Although randomization is quite effective in evening out differences between populations assigned to one treatment arm versus another, it is not a perfect method. In some instances, important clinical differences have appeared between the two treatment arms. At one interim analysis of the Cardiac Arrhythmia Suppression Trial (CAST) (personal communication), one treatment arm had a markedly lower mean ejection fraction than the other. This difference evened out before the trial was stopped, but if it had not, adjustments would have been required to analyze the data accounting for this very important prognostic difference between the treatment groups.

When an important clinical factor or a potential differential effect of therapy on one group versus another exists, stratification is used to balance the number of patients assigned to each treatment arm within strata. The AVID trial and many other multicenter RCTs are stratified by clinical site. In the case of arrhythmia trials, the skill and experience of the arrhythmia teams or the availability of particular devices at the sites may vary, leading to different outcomes, depending on where the participants are randomized. By stratifying by site, it can be ensured that participants at each site have an equal probability of being assigned to surgical treatment versus medical treatment. Other important differences may vary by clinical site, for example, in surgical trials where the skill and experience of the surgeon can have an effect on the outcome.

Some investigators get carried away with the concept of stratification to try to design randomization in such a way that all possible factors are balanced. It is easy to design a trial with too many strata. Take, for example, a trial stratifying on the basis of ejection fraction at baseline (<30%, 30% to 50%, and >50%) and a required history of myocardial infarction (MI) being “recent,” that is, within the past 6 months, or distant, that is, more than 6 months ago. With this, six strata have been created, a reasonable number for a sample size of, say, 200 or more subjects, randomized to one of two treatment options (Table 19-2). But if the decision is made to stratify by site, with 10 sites, for example, more strata than expected patients would be created. It has been shown that as the number of strata in a conventional randomization design is increased, the probability of imbalances between treatment groups is, in fact, increased as well.⁴

Table 19-2 History of Myocardial Infarction

It is important to adjust for stratification factors in the analysis of clinical trials. Failure to adjust for the “nonrandomness” in the randomization will influence the results.

Types of Randomization Designs

The most commonly used and most straightforward randomization design is a “permuted block” design. Using this method, a trial for any number of treatment arms can be designed and the proportion of patients assigned to each treatment arm can be set. Randomization does not mean that it is necessary to have equal numbers of patients assigned to each arm. In many trials of new drugs, in order to gain information on side effects in early-phase studies, the sponsor may decide to allocate twice as many patients to the new treatment as to the control (a 2 : 1 allocation).

Permuted block randomization can best be described as constructing small decks of cards, shuffling the cards, and then dealing them out. For a design with two treatment options, two decks with two types of cards (say, hearts and clubs) would be created. For equal allocation, each deck would contain an equal number of hearts and clubs. The size of each deck, or block, would depend on stratification and other factors. The deck is shuffled and as each patient is randomized, the next card is dealt. When all of the cards in the deck have been dealt, a new deck is shuffled and the process is repeated. The size of each deck needs to be an even multiple of the number of treatment arms and can vary over the course of the randomization sequence. In actual practice, the size of the decks is determined in advance, and the decks are shuffled in advance.

Permuted block designs can lead to the same problems as with too many strata. For this reason, adaptive randomization is sometimes used. Several types of adaptive designs exist. Basically, the next randomization assignment is dependent, in some way, on the characteristics of the patients who have been randomized so far. Baseline adaptive techniques adjust the probabilities of assigning the next patient to one treatment arm or the other on the basis of the baseline characteristics of the patient compared with other patients already randomized.⁵ In the study design described above where randomization will be stratified by the class of angina, recent or distant history of MI, and ejection fraction, the objective is to keep a balance of treatment assignments within each stratum. So, as each patient is randomized, the randomization algorithm will look at the existing balance in that stratum and assign treatment on the basis of a biased coin toss.

Blinding or Masking Therapy

Ideally, all clinical trials should be double-blind or double-masked studies. That is, neither the patient receiving the treatment nor the medical staff treating the patient should have knowledge of the patient’s treatment assignment. In this way, bias in outcome assessment can be minimized. If the patient is aware of receiving the experimental treatment, he or she may be more likely to report side effects than those who believe that they are receiving placebo or do not know which treatment they are receiving. Similarly, an investigator who knows that the patient has received the experimental treatment may be more likely to see a benefit than if the investigator believes that the patient is receiving a placebo.

Ethically and logistically, many trials cannot be conducted as double-blind trials. For example, trials involving surgical intervention for only one of the study arms cannot, in almost all cases, be conducted as a double-blind study. In a single-blind trial, the patient is unaware of the treatment assignment, but the treating physician is aware of the assignment. Trials of pacemakers might be conducted in a single-blind fashion where participants know that they have a pacemaker but are unaware of the programming mode.

Since the purpose of blinding or masking of therapy is to minimize bias in outcome assessments, a strategy to minimize bias in an unblinded trial is to make use of a blinded event assessor. The Stroke Prevention in Atrial Fibrillation studies used a neurologist unassociated with the routine care of the patient to evaluate the patient in the event that symptoms of stroke were reported. The blinded event assessor was presented with medical records masked to therapy, in this case warfarin versus aspirin. The blinded event assessor evaluated the patient and drafted a narrative based on his or her clinical findings.

In a triple-blind study, the patient, the treating medical staff, and the data coordinating center are all masked to individual treatment assignments. Where the data coordination is being provided by a commercial sponsor, the sponsor can elect to remain blinded to interim study results because of the apparent conflict of interests in making decisions regarding study endpoints. Clinical trials rely on scientific equipoise. As soon as a trend favoring one of the treatment assignments is evident, sponsors and clinicians may make decisions different from what they would make if they had no knowledge of the emerging trend. Early trends often do not pan out, and the experiment can become compromised. It is advisable, whenever possible, to keep the sponsor blinded.

Accuracy of Measurement and Event Ascertainment

Endpoint selection is one of the most critical components of trial design. In considering primary and secondary outcomes, the following principles are important to remember:

1. The outcome should be measurable.

2. It should be possible to define the outcome clearly and unambiguously.

3. The experimental treatment should have a measurable effect on the outcome measure, and the outcome should be important to the patient population being studied.

Principles of Outcome Selection

The Outcome Should Be Measurable

In designing a trial to decrease the frequency of episodes of paroxysmal arrhythmias, measurement of the outcome would be a challenge. From an experimental standpoint, every study participant should be outfitted with a telemetry system that would constantly monitor heart rhythm and report any episodes of the arrhythmia. This is not possible practically. Thus, most studies of antiarrhythmic drugs have used 24-hour Holter recordings as a way of sampling the patient’s heart rhythm and concluding whether the frequency of episodes of atrial fibrillation (AF) or ventricular tachycardia (VT) has decreased or the episodes have stopped.

The FDA now advocates the use of patient-reported outcomes in the case of chronic diseases. Patient-reported outcomes include standardized quality-of-life questionnaires, symptom questionnaires, and electronic or paper patient diaries. The outcome measure is usually a computed scale derived from the patient report. One motivation for the use of such instruments is that it enables the patient to contribute to determining whether the treatment is working or not.

To be convincing as measurable outcomes, such questionnaires need to be validated. Some key components to validation of such questionnaires are as follows: (1) Does the instrument reliably measure what it is intended to assess? In a questionnaire used to assess pain or discomfort, the instrument needs to reveal the presence and intensity of pain as reported by a variety of patients. (2) Are the results reproducible? That is, if the questionnaire is administered several times to a person with the same level of pain or discomfort, will the questionnaire give similar results?

The Outcome Should Be Unambiguously Defined

The American College of Cardiology (ACC) has devoted much effort to standardizing the definition for MI. Whereas earlier trials such as CASS had to define MI for their trials, current studies refer to the ACC definition.⁶

Suppression of arrhythmia as an endpoint is not clearly defined unless the investigators add to the protocol specifics such as the following definition from CAST:

• ≥80% suppression of the frequency of ventricular premature depolarizations (VPDs) and

• ≥90% suppression of episodes of unsustained VT

• As measured on a 24-hour Holter monitor ⁷

The Outcome Should Be Important to the Patient Population

For a patient with AF, what is important? Clearly, avoiding death or disability caused by a stroke or other cardiovascular cause is most important. Thus, some of the most critical trials in patients with AF were the Atrial Fibrillation Follow-up Investigation of Rhythm Management trial (AFFIRM), which examined the effect of two treatment strategies, heart rate control, and restoration of sinus rhythm, to prevent overall mortality in patients with atrial fibrillation,⁸ and the Stroke Prevention in Atrial Fibrillation (SPAF),⁹ Aspirin vs. Warfarin Standard Dose (AFASAK),¹⁰ and other trials of anticoagulation to prevent stroke.¹¹ Then, is functional status important? The AFFIRM study attempted to measure change in functional status using the New York Heart Association (NYHA) functional class scale, Canadian Cardiovascular Society Angina Classification, a Mini-Mental State Examination, and a 6-minute walk test.¹² The investigators in the study were able to detect a difference in functional status related to the presence or absence of AF. In contrast, inflammatory measures such as C-reactive protein and degree of stenosis may be good measures of disease burden, but changes in these measures may not be identifiable to patients.

Overreads and Clinical Endpoint Adjudication

As clinical trials have to reach further and include many more sites to recruit enough patients, the study is dependent on the clinical judgment of a large number of physicians to assess study endpoints. In the AFFIRM trial, for example, the more than 200 clinical sites were located primarily in cardiology practices. One of the important endpoints was cardioembolic stroke. Although the patients were, of course, treated by neurologists, the study physicians were, in almost all cases, not neurologists. To make sure that all of the strokes included in the study-defined endpoint met the protocol definition for cardioembolic stroke, the investigators used a clinical events committee (CEC) to evaluate each potential study endpoint by reviewing collected medical records. These records were masked to anything that could reveal the patient’s assigned treatment arm (rhythm or rate control). Two separate CEC neurologists examined records that included imaging reports, hospital discharge summaries, and a narrative discussion provided by the local physician. Concordance was required between two CEC neurologists or consensus by the committee in order for an event to be “ruled in” (E. Nasco, personal communication). In the final analysis, only events confirmed by the CEC were included. In the AFFIRM trial, only 171 of 247 reported strokes were ruled in. In the Trial With Dronedarone to Prevent Hospitalization or Death in Patients with Atrial Fibrillation (ATHENA), an events adjudication committee categorized the causes of death using a modification to a previously published classification scheme that standardized the criteria for determining arrhythmic, nonarrhythmic cardiac, noncardiac vascular, or noncardiovascular death.¹³

Some of the reasons to use CECs are (1) The assessment of an endpoint event is done without the knowledge of treatment assignment, and (2) the assessment is done consistently by a small group of trained physicians. The use of CECs can thus minimize bias and inconsistency in determining whether a study-defined endpoint has occurred.

Interim Analyses and Adaptive Designs

In the simplest approach, a clinical trial would be designed, the trial would be launched, and participants would be enrolled, treated, and followed up. Follow-up would continue up to a prespecified endpoint, such as 6 months after the last patient was treated. While the trial is ongoing, data would be collected, but no one would look at the results. At the end of the trial, the data would be completed, closed, and analyzed. However, for many trials conducted these days, investigators have valid reasons to look at the data even while the trial is ongoing. In the case of experimental treatments, the sponsor may want an independent group monitoring the trial to make sure that participants are not exposed to undue risk or harm from the adverse effects of the treatment. In the case of trials with long-term follow-ups, it may be possible to reach a reasonable conclusion about the treatment that is being evaluated without continuing the trial to its planned conclusion. For these reasons, many trials are planned with “interim analyses.” Further, regulatory agencies have, in some situations, strongly recommended the use of external, independent monitoring committees, commonly referred to as Data Monitoring Committees (DMCs) or Data and Safety Monitoring Boards.^1,¹⁴ Although the specific charge of DMCs will vary from one trial to another, in general, DMCs are charged with monitoring the trial for the following reasons:

• To protect the safety of the participants

• To protect the scientific integrity of the trial

• In some cases, to monitor for unexpected early demonstration of conclusive efficacy or lack of efficacy

The DMC works in an advisory capacity to the sponsor. That is, the committee makes recommendations to the sponsor about continuing the trial as designed, continuing with changes, or prematurely discontinuing the trial. The actual decision to continue, change, or discontinue is made by the sponsor; however, it is rare that a sponsor would not accept the recommendations of the DMC.

Much has been written about DMCs, which are being used more frequently for late phase II and phase III trials. Ellenburg and colleagues make the following recommendations regarding when a DMC should be used ¹⁵:

• Phase IIb, III, and IV in life-threatening diseases or in diseases causing irreversible and serious morbidities

• Trials of novel treatments with potentially life-threatening complications

• Trials in emergency settings where individual consent is not possible

• Trials involving vulnerable populations (e.g., children)

In some other situations such as early-phase trials where dose selection would best be made by a group of experts independent of the investigators or sponsor, an independent DMC might be recommended.

In some trials, it is not practical to implement the interim reviews of data, for example, trials with short-term outcomes such as surgical studies of 30-day mortality rate that have a quick enrollment period. If it is expected that all or most of the patients would be treated in a short period, all of the patients would have been treated by the time data could be collected and reported to a DMC for review; the DMC would not be able to protect patients from exposure if the treatment shows evidence of harm. This is why some device trials are deliberately designed with a pause in recruitment. Such a pause would allow a DMC to examine the results of treatment of the first cohort of patients before proceeding to treat more patients.

For a DMC to adequately monitor a trial, the committee needs information that is sufficiently complete and current so that valid conclusions about the study can be made. The DMC’s most important responsibility is to ensure the safety of the trial participants. For this, complete reporting of adverse events is vital. The evaluation of safety will be made by the DMC in the context of risk/benefit. That is, the DMC and the trial sponsor may be more tolerant of moderate to severe adverse events if the benefit expected is great and the disease is severe. Cancer therapies commonly have a more severe adverse effect profile than do cardioprotective drugs such as β-blockers or statins.

Reports to DMCs should be succinct but give a clear and complete picture of the risks and benefits of the study treatment. It is important to understand that DMC reports are not the same as full, final study reports. They should be focused on the areas that the DMC is charged to review, such as enrollment and performance of the clinical sites in treating and monitoring appropriate study participants, safety of the treatment, adherence of patients to treatment protocols as it would affect the power of the trial, and important measures of efficacy.

The practicalities of collecting and cleaning study data put constraints on the DMC report. While the DMC needs to have data that are as accurate and complete as possible, access to information that is as current as possible is also important. It is common practice to create a snapshot of the study data to prepare reports for the DMC. When to create that snapshot in relation to the planned meeting depends on the mode of data collection and other logistic issues. In general, the snapshot should be made no more than 6 weeks in advance of a DMC meeting where paper case report forms are used and no more than 4 weeks in advance of a DMC meeting where electronic CRFs are utilized.

During the CAST I study, the DSMB was concerned that the reports contained information about all of the deaths that had occurred. For this reason, the coordinating center organized and performed an “events sweep.” During a 2-week period, research coordinators at each of the clinical sites were instructed to call all of their participants to find out, primarily, if they were still alive.¹⁶ This ensured that the DMC had current (within the past 4 weeks) follow-up data on all randomized participants. This information supplemented the data from a normal database freeze.

Interim monitoring can be classified into two types. (1) In the first type, the DMC is solely concerned with the safety of the patients in the study. Safety data are reviewed periodically during the study, and the DMC may recommend that the trial stop for qualitative safety reasons or futility if something about the execution of the trial suggests that the experiment will not be able to test the hypothesis in question. (2) The second type of DMC includes group sequential trials and many adaptive designs; the DMC may stop the study if sufficient evidence of efficacy based on preset stopping rules exists, or the committee may make modifications to the study design.

When the DMC is charged with making recommendations regarding early termination of a trial for early demonstration of efficacy, certain statistical techniques guide the analyses that are presented to the DMC. Several approaches are commonly used. Why are these statistical adjustments made? Recall that the alpha level of a trial is the probability that a particular treatment effect is being observed purely by chance. The more often that a treatment effect is looked for, the more likely it is that it will be seen, even if that effect is not real. For this reason, the P value or alpha level is adjusted while taking interim “looks” at the data. The P value is not relevant if the trial does not reach a positive conclusion, so the adjustment is based on the efficacy boundary, which is the criterion at any point in time for claiming success. The DMC can recommend stopping for safety at any time, regardless of whether a boundary has been crossed.

In “group sequential” study designs, the trial is designed and conducted assuming a clinically important treatment effect; this is accompanied by a sample size and follow-up to ensure that if things turn out as expected, enough information will be available at the end of the trial to detect that treatment effect if it is present (“power”) at the significance level (“alpha”) selected. Built into a group sequential design are pre-planned interim examinations of data. At prespecified times, data are analyzed (by treatment group) and a P value for the treatment comparison is made. If the results indicate that the treatment is effective with a sufficiently small P value, the DMC might recommend that the trial be discontinued and results disclosed.

A number of approaches to the group sequential study design exist. Basically, they each take a different approach to the P values or alpha levels that will be used to determine whether or not to recommend curtailment of the trial. At the interim analysis, data are analyzed by the treatment group, and the appropriate statistic is computed. The P value associated with that statistic is compared with a prespecified alpha. If the computed P value is smaller than the prespecified alpha for the boundary at that look, the DMC might conclude that efficacy has been proven and might recommend that the trial be curtailed. Each time that the DMC formally examines the data, the trial “spends” alpha. This is because the chances that the expected treatment effect is observed by chance alone increase, roughly by the alpha level used as the boundary at that look. So, at the end of a trial, instead of using a P value, for example, of 0.05 to determine if the trial was successful, the data would have to result in a slightly smaller P value to say that the results were significant.

The various approaches to a group sequential design can best be illustrated by Figure 19-2, which depicts a typical O’Brien-Fleming boundary. The O’Brien-Fleming approach is very conservative at early looks at the data and is less conservative as more information is available. The Pocock approach assigns the same alpha level at each look. The actual shape of the O’Brien-Fleming boundary can be adjusted based on the risk tolerance and objectives of the sponsor. The decisions regarding the number of formal interim analyses and the shape of the boundary are important for management to consider carefully.

FIGURE 19-2 Normalized Z value is the computed statistic that will correspond to a P value. The larger the Z value, the smaller will be the P value. The upper boundary corresponds to efficacy and the lower boundary to lack of efficacy.

The same approach can be taken to monitor for lack of efficacy. While few trials continue long enough to prove “harm,” many are designed with guidance for the DMC so that the trial could be curtailed if the experimental treatment is much less effective than expected or the trial is unlikely to show a benefit for the treatment. In this case, a second boundary is selected to monitor for lack of efficacy. The Multicenter Automatic Defibrillator Implantation Trial (MADIT) describes a “triangular” design. In the design of this trial, early examinations of the data should have shown definite harm in order to suggest a recommendation to stop the trial for lack of efficacy. But, as the trial progressed, this boundary for inefficacy approached the boundary for efficacy, that is, if the results were not trending to a sufficiently large benefit, the investigators would not have wanted to continue to the end of the trial. CAST was designed with a symmetric boundary; that is, the alpha level at each interim analysis was the same whether the direction of treatment effect was benefit or harm. CAST I was discontinued early because the results exceeded the boundary for harm. In hindsight, CAST might have been designed with an asymmetric bound, that is, a more conservative boundary for efficacy than for lack of efficacy. As it was, the results were so conclusive that class III antiarrhythmics are no longer used for post-MI ventricular arrhythmias such as those studied in CAST.

Adaptive clinical trials are basically a class of trial where some aspect of the design is modified on the basis of accumulated data within that trial. One of the oldest examples of an adaptive trial is a study to determine the maximum tolerated dose. For this experiment, three patients are started on a low dose, and if no side effects occur, the dose is increased for the next three patients and so on until side effects emerge. The dose for the next three patients is always determined by the result of the three previous patients, so the dose is adapted on the basis of accumulated data. A second example is when phase II is started with placebo compared with high-dose and low-dose medications. The goal is to determine which dose should be taken into phase III. In a traditional model, the investigators stop after phase II, choose the dose to take forward, and begin a new study. An adaptive model might allow the decision to be made within the study or pause for an interim analysis between phases II and III, but enrollment is never stopped (the latter is called a “seamless” trial).

It is not uncommon to test patients in phase II who are from a population different from patients in phase III. For example, phase II might enroll patients with permanent AF, and phase III might enroll patients with permanent or paroxysmal AF. Because the patients are different, the rates observed in phase II are not likely to be correct for phase III. Phase III could be started using the phase II rates, and then after 10% of the information is collected, the sample size calculation could be done again with the rate seen so far in phase III to get a more accurate sample size. Most of these adaptive designs cost something in terms of type I error, and controversy surrounds the interpretation because the design of the experiment is not fixed a priori (although this is true for group-sequential trials as well). The most important point is that adaptations must be specified before the trial begins; adaptive trials are not a remedy for poor planning.

Issues Related to the Conduct of Clinical Trials

Recruitment

In designing a trial, delineating the population to be studied requires careful thought. These definitions will ultimately define the conclusions that can be drawn from the trial. For example, the Physicians Health Study studied 22,017 male physicians in the United States.¹⁷ By excluding women from their study, the investigators were limited in any conclusions that could be drawn from their data as they related to women.

The inclusion criteria define the disease condition to be studied. For example, in CAST, the inclusion criteria stated that potential participants should have had an MI 6 months to 2 years before enrollment and should have demonstrated sufficient ventricular arrhythmia to warrant treatment (≥6 VPDs per hour on a 24-hour Holter monitor). The investigators gave considerable thought to setting a cutoff of 6 or more VPDs per hour. Would 10 or more VPDs have defined a population that would have better benefited from the treatment? Would a cutoff of 10 VPDs per hour have excluded too many patients? The Cardiac Arrhythmia Pilot Study (CAPS) was designed as a feasibility study to see if a long-term trial could be conducted to test the hypothesis that suppression of ventricular arrhythmias in a post-MI population would decrease the rate of arrhythmic and overall death.¹⁸ Data from CAPS provided needed information about the proportion of patients likely to be excluded if a cutoff of 10 VPDs per hour was used.

Exclusion criteria specify (1) participants in whom the planned treatments would be contraindicated, (2) those who would be unlikely to benefit from the planned treatment, or (3) those who could not be adequately followed up to ascertain outcome.

It has been said that as soon as a trial begins, the prevalence of the disease decreases. In some respects, this is true about trials in cardiology. Over the past 20 years, the rate of death from cardiovascular cause has decreased to the point that it is expected that cancer will exceed cardiovascular disease as the primary cause of death in the United States.¹⁹ The number of clinical trials being conducted in the United States has increased in the past 20 years to such an extent that in any community more competition occurs for the pool of available patients. Nearly all large clinical trials have struggled with recruitment.

In recent years, more and more trials have expanded into Europe, Asia, and South America to find sufficient numbers of patients to meet enrollment targets. Global trials certainly have their logistical challenges. But, more importantly, they increase the heterogeneity of the population studied and the background care that participants receive. In the United States, for example, patients with even slightly elevated cholesterol have been treated with statins and other lipid-modifying drugs for many years. While these drugs are available in many nations, they are not as widely prescribed for a variety of reasons, including regional differences in diet and general access to medical care to monitor their use.

The ATHENA trial was a multicenter randomized trial of dronedarone in patients with AF.²⁰ Patients with paroxysmal or persistent AF were eligible if they were older than 70 years or had at least one risk factor for cardiac complications (arterial hypertension, diabetes mellitus, previous stroke, transient ischemic attack or systemic embolism, enlarged left atrium, or depressed ejection fraction). As the trial proceeded, overall mortality rates were lower than expected. For this reason, the investigators modified the entry criteria to increase the risk profile of patients enrolled. Patients between ages 70 and 75 years were eligible only if they had one or more of the above risk factors for cardiac death. Patients younger than 70 years were no longer eligible.

Adherence

Adherence describes the ability of the trial participants to “adhere to” the study protocol. It includes both their ability to continue to take a prescribed medication or treatment as part of the trial and to come to follow-up visits as well as their participation in any patient-reported outcome measures and other activities that might be part of the study protocol. Inherent in sensible trial design is a buffer for expected lack of adherence. In real life, things happen that prevent participants from taking their medication as prescribed, from attending every follow-up visit, and from having the required blood draws done. The poorer the expected adherence, the higher the number of participants who need to be enrolled and followed up to collect enough information to draw a valid conclusion. The effect on sample size can be considerable. Recruitment of additional participants is expensive. It is a lot cheaper to put efforts to improve adherence than it is to enroll a sufficient number of participants to overcome poor adherence.

In drug trials, adherence to treatment is frequently measured by counting tablets returned at follow-up. This method of ascertaining whether the participant took the study drug as prescribed is an accepted measure, even though it has its problems. Depending on the medication, if participants take 75% or more of expected doses of study drug, they are considered to be “good” adherers.

In the ATHENA study, the study drug was discontinued in 696 (30.2%) of 2301 patients assigned to dronedarone and 716 (30.8%) of 2327 patients assigned to placebo. This trial was designed with 80% power to show a 15% reduction in mortality and hospitalization rates for cardiac cause. If all of the patients had remained on their assigned study drug, the trial could have been completed with far fewer participants.

Conclusion

Well-designed clinical trials are powerful tools for advancing medical science. The conduct of trials presents challenges to the study leadership, sponsors, recruiting sites, and study participants. The true benefit from a clinical trial, however, cannot be realized without clear, accurate, and complete dissemination of the results, whether positive or not.

Medical journals publishing the results of clinical trials have set certain standards for reporting these results. This is been done to prevent the publication of false or misleading reports.²¹ Consolidated Standards of Reporting Trials (CONSORT) is an organization that has examined the quality of reporting of medical studies.²² CONSORT was formed by a group of clinical trialists and journal editors who originally met in 1993 to set a scale for evaluating published results from clinical trials. As a result of their examination, CONSORT has published standards for results manuscripts, which include the following key points:

• Clear definition of the inclusion and exclusion criteria and settings in which participants were recruited

• Clear definition of the primary and secondary objectives and outcome measures, including any measures taken to enhance the quality of the measures

• Description of treatment allocation and randomization as well as masking of therapy

• Succinct summary of participants as they flowed through the study from recruitment through randomization, treatment, follow-up, and drop-out

• Description of the sample size determination, interim analyses, and statistical method for comparing treatment groups and for measuring effect size

• Summary of baseline and demographic characteristics of the participant population by treatment group

The primary results from the ATHENA trial is a good example of adherence to CONSORT. In addition to the points listed above, the publication includes a study diagram clearly depicting the disposition of all patients randomized.

The real result of clinical trials research is improvement in the way that diseases are treated and prevented. If the results of a clinical trial are not published or disseminated in a trustworthy manner, it is unlikely that the practice of health care professionals will change on the basis of trial results. It has been estimated that it takes, on average, 17 years for the results of a clinical trial to bring about an effect in general practice.²³ The Beta-blocker Heart Attack Trial (BHAT) in 1982 showed that treatment with β-blockers after an MI would significantly decrease mortality in this at-risk population.²⁴ And yet, the widespread use of β-blockers did not become the standard of care until the AHA/ACC published guidelines for the use of β-blockers in 1996 that this class of drug became standard of care.^25,²⁶ The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack (ALLHAT) trial was a National Heart, Lung, and Blood Institute sponsored trial of antihypertensives and lipid-lowering treatment to reduce the rate of cardiac death and nonfatal MI in participants with hypertension and at least one other risk factor for cardiovascular disease.^27,²⁸ The trial showed that aggressive treatment with antihypertensives could significantly decrease the mortality rate in this population. Following the conclusion of the trial and the publication of the initial results, the study group for ALLHAT embarked on a project to focus on the dissemination of the study results in such a way as to maximize the impact of the trial’s results.²⁹ Their dissemination plan included the usual press releases, presentations at national medical meetings, and refereed articles in major medical journals. In addition, they targeted the “determinants of clinician behavior” to communicate more directly with clinicians about the implications of the ALLHAT results. They took their message to places where physicians routinely went to get information and conducted numerous in-person, interactive sessions. Their approach included, among other strategies, Web forums, online continuing medical education sources, and other more immediately available media.

Thus, if medical practice is to improve, well-controlled as well as well-conducted clinical trials and clear, accurate, and complete reporting of their results are needed.

References

1 International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use. Statistical principles for clinical trials (E9). February 5, 1998: Available at http://www.ich.org/fileadmin/Public_Web_site/ICH_Products/Guidelines/Efficacy/E9/Step4/E9_Guideline.pdf Accessed October 19, 2009

2 CASS Principal Investigators. National Heart, Lung, and Blood Institute Coronary Artery Surgery Study. Circulation. 1981;63(II):6.

3 AVID Investigators. Antiarrhythmics Versus Implantable Defibrillators (AVID)—rationale, design, and methods. Am J Cardiol. 1995;75:470-475.

4 Hallstrom AP, Davis K. Imbalance in treatment assignments in stratified blocked randomization. Control Clin Trials. 1988;9:375-382.

5 Efron B: Forcing a sequential experiment to be balanced, Biometrika. 1971;58:403-417.

6 The Joint European Society of Cardiology/American College of Cardiology Committee. Myocardial infarction redefined—a consensus document of the Joint European Society of Cardiology/American College of Cardiology committee for the redefinition of myocardial infarction. J Am Coll Cardiol. 2000;36:959-969.

7 Epstein AE, Hallstrom AP, Rogers WJ, et alfor the CAST Investigators. Mortality following ventricular arrhythmia suppression by encainide, flecainide, and moricizine after myocardial infarction: The original design concept of the Cardiac Arrhythmia Suppression Trial (CAST). JAMA. 1993;270(20):2451-2455.

8 The AFFIRM Investigators. A comparison of rate control and rhythm control in patients with atrial fibrillation. N Engl J Med. 2002;347(23):1825-1833.

9 The Stroke Prevention in Atrial Fibrillation Investigators. Stroke prevention in atrial fibrillation study, final results. Circulation. 1991;84:527-539.

10 Petersen P, Boyson G, Godtfredsen J, et al. Placebo-controlled, randomised trial of warfarin and aspirin for prevention of thromboembolic complications in chronic atrial fibrillation. The Copenhagen AFASAK study. Lancet. 1989;i:175-179.