Chapter 12 Statistics and Clinical Trials
We define a clinical trial as a designed study involving the treatment of prospectively accrued humans that is specified by a document (protocol) with specific goals and analysis plans. Meinert1 describes some of the unique aspects of clinical trials that distinguish them from other medical research studies, including, among others, observational studies and case-control studies, and enumerates the requirements of a good protocol document.
Interaction between an Investigator and a Statistician
Clinical Trials
Phase I Trials
Traditional Cohorts-of-3 Phase I Design
Although the cohort-of-3 phase I design has considerable appeal (it is straightforward to conduct, easy to explain, and has considerable historical precedent), particularly with newer agents that will be discussed later, use of the cohort-of-3 design poses increasing challenges in modern clinical trials. For instance, one challenge in a cohort-of-3 design is that it may require a long time and many patients to reach the MTD, with many patients treated at suboptimal dose levels. Goldberg and colleagues2 conducted a cohort-of-3 phase I study to determine the MTD and DLT of CPT-11 in the regimen of CPT-11/5-FU/LV for patients with metastatic or locally advanced cancer, where CPT-11 was administered on day 1 and 5-fluorouracil/leucovorin was administered on days 2 to 5 of a 21-day cycle. Fifty-six patients were accrued during a period of 38 months, and 13 dose levels were studied. Use of a different design coupled with a different end point may have shortened the study significantly.
Newer Phase I Designs
Cytostatic or targeted therapies are novel cancer therapies in which the shape of the dose-response curve is unknown. In theory, the dose-response curve could be a monotone nondecreasing, quadratic, or increasing with a plateau curve. As such, the assumption that increasing doses will always lead to increasing efficacy, which is an appropriate assumption for cytotoxic agents, is no longer reasonable. In a phase I trial with such an agent, an appropriate goal is to estimate the biologically optimal dose (BOD) (i.e., the dose that has maximal efficacy with acceptable toxicity). In other words, the trial must incorporate both toxicity and efficacy to estimate the BOD. One developmental pathway is to use the continual reassessment method (CRM), a Bayesian approach, in modeling the dose-response relationship. The CRM was first introduced by O’Quigley and coauthors3; the main idea is to calculate a starting dose using a preliminary estimate (prior information) of the MTD and then use each patient’s DLT status at a given dose level and a statistical model to define future dose levels. The CRM has been widely considered in testing cytostatic or targeted therapies.
Phase II Trials
In phase II trials, the goal is to establish clinical activity and further evaluate the treatment’s toxicity. Unlike phase I trials, which commonly accrue patients with a variety of cancers, a phase II trial should restrict accrual to a reasonably well-defined patient population. Historically, the response rate has been the most common end point for phase II trials. However, in the last decade, progression-free survival/disease-free survival (PFS/DFS) rates and overall survival (OS) rates have been used increasingly as the primary end point in the phase II setting, in addition to a series of new end points. Because imaging subtleties for PFS/DFS are much less problematic than for the response end point (one only needs to distinguish between progression and no progression, not tumor response) and because PFS/DFS are not affected by therapies administered subsequent to the therapy under study (OS is affected), PFS is gaining popularity as the primary end point for both phase II and phase III trials in advanced disease. As might be expected, PFS/DFS as an end point does present multiple challenges; details can be found in the study of Wu and Sargent.4
Traditional Single-Arm Phase II Designs: One-Stage Design
The considerations just given might lead to the following study design criteria:
Using well-known methods, the statistician could determine that each of these criteria would be satisfied if 40 patients were accrued, and the treatment would be recommended for further consideration if, and only if, at least 7 of the 40 patients responded. Table 12-1 gives the probabilities that the treatment will be carried forward for various hypothetical response rates.
It is worth noting that the formal decision rule is to conclude the treatment is worth carrying forward if the estimated response rate is at least 7/40 = 17.5% but not if the response rate is less than or equal to 6/40 = 15%. As Table 12-1 indicates, the probability of observing a response rate of at least 17.5% in our trial if the true response rate is 10% is only .10. In contrast, the probability of observing a response rate of at least 17.5% is .90 if the true response rate is 25%, and a positive recommendation is extremely likely (probability = .98) if the true response rate is as high as 30%. Other factors may enter into the final decision, particularly if the observed response rate is close to 15% or 20%.
Traditional Single-Arm Phase II Designs: Two-Stage Design
This new design has some nice properties. The maximum number of patients required is the same as before, and, as shown in Table 12-2, the new design still satisfies each of the desired criteria 1 to 3 listed previously. However, if the new treatment has a true response rate of 5% or less, the study has a 0.46 chance of stopping after 15 patients have been accrued, therefore sparing 25 patients from treatment with an inactive regimen.
Now consider a design that requires early stopping if only 0 or 1 of the first 15 patients responds; otherwise, 25 more patients are accrued, and the treatment is recommended for further consideration if at least 7 patients respond. This design provides much better protection against the possibility of accruing an excessive number of patients to an ineffective regimen. However, if this design is used, it would be slightly less likely that a truly effective regimen would be recommended for further consideration. Table 12-3 shows the properties of this design.
Numerous authors discussed optimal strategies for choosing phase II designs, including Gehan,5 Herson,6 Lee, Staquet and Simon,7 Fleming,8 Chang and associates,9 Simon,10 Therneau, Wieand and Chang,11 Bryant and Day,12 and Thall, Simon and Estey.13
Newer Phase II Designs: Randomized Phase II Design
Single-arm phase II designs can demonstrate biologic activity of an agent with relatively small sample size and short study duration. However, the potential for patient selection bias, the evolving methodologies for the assessment of “success” (i.e., changes in CT scanners, changes in response criteria), and a general lack of robustness of historical rates of this type of design result in a high possibility that a positive single-arm phase II trial will be followed by a negative phase III study.14 A potential solution to this fundamental limitation of single-arm phase II designs is the use of a randomized design. For example, two types of randomized phase II designs, the randomized selection designs and the randomized screening designs, have been proposed and used widely.
In a selection design, all arms are considered experimental arms. These could be different new regimens or different doses or schedules of the same regimen. The sample size is calculated to guarantee that with a very small probability, perhaps 10%, an inferior arm (for instance, one in which the response rate is 15% lower) will be selected for the future phase III study. In a standard selection design, the arm with the highest estimated success rate for the primary end point will be recommended for a follow-up phase III study, no matter how small the difference of estimates might be among arms.15 On the other hand, a flexible selection design, or “pick-the-winner” design, will recommend the arm with the best estimated primary end point if the difference is larger than a prespecified criterion.16 Otherwise, toxicities, expense, and other factors will be taken into consideration when making the decision.
If a standard of care is included in the comparison, a screening design should be applied.17 A screening phase II design is similar to a randomized phase III trial but has larger type I and type II errors. For instance, a range of 10% to 20% for both error rates is acceptable. A screening phase II design provides a head-to-head comparison between the experimental regimen and the standard of care using a relatively smaller sample size and clear guidance as to the likelihood of success if the agent is moved forward into phase III testing.
Phase III Trials
It is generally recognized that judging the value of a new therapy by comparing it with historical data may give an erroneous impression of the therapy’s efficacy. Pocock18 related a number of illuminating examples of this phenomenon. Therefore the phase III trial, in which a new agent or modality is tested against an accepted standard treatment in a randomized comparison, is considered the most satisfactory method of establishing the value of the proposed treatment.
Under most circumstances, the goal of a phase III trial is to determine whether the proposed treatment is superior to the standard treatment (superiority trial). Occasionally, the goal is to “show that the difference between the new and active control treatment is small, small enough to the known effectiveness of the active control to support the conclusion that the new test drug is also effective” (noninferiority trial).19 Because the goal of a phase III trial is to make a definitive conclusion regarding a new treatment’s efficacy, enough patients need to be accrued to guarantee a small probability of false-positive results (i.e., declaring that a regimen is effective when in fact it is not) and a large probability of true-positive results (i.e., declaring that a regimen is effective when in fact it is). The probability of false-positive results is also called the size of the study, which usually is limited to 5% or less. The probability of true-positive results is also called the power of the study, which usually is set at 80% to 95%. The primary end points for comparison are usually survival and PFS/DFS, and a common secondary end point is quality of life (QOL). In this section, we discuss the rationale for requiring randomized treatment assignment and stratification in a phase III trial. We introduce the intent to treat principle, and we emphasize the importance of monitoring in ongoing trials.