Statistics and Clinical Trials

Published on 03/04/2015 by admin

Filed under Hematology, Oncology and Palliative Medicine

Last modified 03/04/2015

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 0 (0 votes)

This article have been viewed 1532 times

Chapter 12 Statistics and Clinical Trials

In this chapter we address some of the statistical issues associated with clinical trials. We begin with a short description of the basic role of statistics and statisticians and illustrate some sample communications between investigators and statisticians. We provide an overview of the statistical issues relevant to each phase of clinical trials (phases I, II, and III) and discuss some of the unique aspects for phase II and III trials. We introduce the standard and newly developed phase II experimental designs. We are particularly interested in the essential roles of randomization and stratification in comparing new and standard therapies. We discuss the important intent to treat principle, which is fundamental to analyzing phase III trials. We emphasize the importance of appropriate monitoring in ongoing trials. In addition, we consider some of the special problems that arise in the analysis of survival data that are caused by the phenomenon of censoring, which occurs when a patient’s time to death cannot be completely determined either because he or she becomes lost to follow-up before death or is still alive when the data are to be analyzed. Finally, we briefly discuss several relevant topics in modern clinical trials: surrogate end points, biomarkers, and adaptive design.

Our intent in this chapter is to provide the reader with insight regarding the use of statistics (and statisticians) in the design and analysis of clinical trials rather than to provide all of the details required for an investigator to perform his or her own analyses. Several excellent texts provide the details required in performing analyses, and we cite several such references.

We define a clinical trial as a designed study involving the treatment of prospectively accrued humans that is specified by a document (protocol) with specific goals and analysis plans. Meinert1 describes some of the unique aspects of clinical trials that distinguish them from other medical research studies, including, among others, observational studies and case-control studies, and enumerates the requirements of a good protocol document.

Why Are Statistics and Statisticians Useful?

Collaboration between statisticians and clinical investigators can be extremely fruitful, and we hope that by the time the reader finishes this chapter, several advantages of such collaborations will have become apparent. However, at the most basic level, statisticians are interested in ensuring that at the end of an experiment, the conclusion is valid and reproducible. Mathematically, this dictates emphasis on two quantities: variance and bias. Any investigator involved with clinical research data quickly becomes aware of the tremendous variation between individuals. Two patients who present with cancer may have nearly identical clinical (and pathologic) characteristics and receive the same treatment, but one may fail (experience relapse or die) within months whereas the other may be cured. In fact, in many oncology clinical studies, particularly those involving the effect of treatment on the time to failure, we are still unable to “explain” as much as half the variation among patients when we fit a model using the known prognostic variables. This is often quite different from the experience of laboratory researchers, who may be accustomed to seeing little variation among sampling units relative to the size of the treatment effects they are trying to measure. A practical definition of variance could be the unexplained differences in outcomes that exist among seemingly similar patients.

If one looks at two sets of 15 patients who differ in one characteristic (possibly treatment) and notices that one group has a median survival time that is 2 months longer than the other group, one cannot immediately conclude that the characteristic is truly associated with survival difference. Statistical techniques will be required to determine whether a median survival difference of 2 months was quite likely to occur even if two sets of 15 patients who did not differ in the characteristic had been compared.

Bias, on the other hand, can occur in medical research studies if treatments are compared between groups of patients who are not equivalent in terms of characteristics that are associated with prognosis. The fundamental strategy used in clinical trials to avoid biased treatment comparisons is randomization of treatment assignments, which guarantees that the treatment assigned to a patient is not related to his or her prognosis. In observational studies, such an assurance is simply not possible. Although it may be feasible to adjust for potential sources of bias at the data analysis stage, based on our present knowledge of factors influencing treatment outcomes within individual patients, there is no way to ensure that patients receiving two different treatments are identical in the absence of randomization. Because it is difficult (practically impossible) to identify or quantify all sources of bias, results obtained from randomized clinical trials are generally accepted to be more reliable than those obtained from observational studies. Therefore the role of the statistician in controlling bias has two aspects: (1) ensuring patient comparability between treatment groups at the study design stage, and (2) using appropriate statistical methodology in an attempt to eliminate biases at the analysis stage.

Clinical Trials

Clinical trials performed to develop and test new agents or modalities in oncology are often categorized as phase I, phase II, or phase III trials according to their primary aims.

Phase I Trials

Phase I trials are generally the first trials involving human subjects in which a new treatment is tested. The goal of a phase I trial is to test a new regimen’s tolerability and toxicity. These trials usually enroll a limited number of patients who have exhausted other treatment options. Phase I trials may enroll patients with a specific tumor type only, or be open to patients with different tumors. Different primary end points and study designs should be considered for regimens with different cancer therapeutic mechanisms.

Traditional Cohorts-of-3 Phase I Design

Historically, cancer therapies have been designed to act as cytotoxic, or cell-killing, agents. The fundamental assumption regarding the dose-related activity of such agents is that there exists a monotone nondecreasing dose-response curve, meaning that as the size of the dose increases, tumor shrinkage will also increase, and this phenomenon should translate into increasing clinical benefit. Under this assumption, both the toxicity of and the clinical benefit of the agent under study will increase as the dose increases; therefore, an appropriate goal of a phase I trial is to find the highest dose with acceptable toxicity. Because the monotone nondecreasing dose-response curve has been observed for most cytotoxic therapies, toxicity has historically been used as the primary end point for identifying the dose that has the greatest likelihood of being effective in subsequent testing.

In this context, the typical goal for phase I clinical trials has been to determine the maximum tolerated dose (MTD), which has traditionally been defined as the highest dose level at which only one of six patients experiences unacceptable, or dose-limiting, toxicity (DLT). The traditional design for this type of agent is the cohort-of-3 design. Specifically, three patients are enrolled at the starting dose level. If no DLT is observed, three patients will be enrolled at the next higher dose level. If one DLT is observed, another three patients will be enrolled at the same dose level. Out of the six patients at a given dose level, if one DLT is observed, three patients will be enrolled at the next higher dose level. If two or more DLTs are observed (out of three or six patients at a given dose level), then the MTD will be considered as exceeded, and the next lower dose level will be defined as the MTD as long as six patients have been evaluated on that level.

Although the cohort-of-3 phase I design has considerable appeal (it is straightforward to conduct, easy to explain, and has considerable historical precedent), particularly with newer agents that will be discussed later, use of the cohort-of-3 design poses increasing challenges in modern clinical trials. For instance, one challenge in a cohort-of-3 design is that it may require a long time and many patients to reach the MTD, with many patients treated at suboptimal dose levels. Goldberg and colleagues2 conducted a cohort-of-3 phase I study to determine the MTD and DLT of CPT-11 in the regimen of CPT-11/5-FU/LV for patients with metastatic or locally advanced cancer, where CPT-11 was administered on day 1 and 5-fluorouracil/leucovorin was administered on days 2 to 5 of a 21-day cycle. Fifty-six patients were accrued during a period of 38 months, and 13 dose levels were studied. Use of a different design coupled with a different end point may have shortened the study significantly.

Newer Phase I Designs

Cytostatic or targeted therapies are novel cancer therapies in which the shape of the dose-response curve is unknown. In theory, the dose-response curve could be a monotone nondecreasing, quadratic, or increasing with a plateau curve. As such, the assumption that increasing doses will always lead to increasing efficacy, which is an appropriate assumption for cytotoxic agents, is no longer reasonable. In a phase I trial with such an agent, an appropriate goal is to estimate the biologically optimal dose (BOD) (i.e., the dose that has maximal efficacy with acceptable toxicity). In other words, the trial must incorporate both toxicity and efficacy to estimate the BOD. One developmental pathway is to use the continual reassessment method (CRM), a Bayesian approach, in modeling the dose-response relationship. The CRM was first introduced by O’Quigley and coauthors3; the main idea is to calculate a starting dose using a preliminary estimate (prior information) of the MTD and then use each patient’s DLT status at a given dose level and a statistical model to define future dose levels. The CRM has been widely considered in testing cytostatic or targeted therapies.

In addition to Bayesian-based designs, several other newer phase I designs have been proposed to overcome the challenges of the cohort-of-3 design. Each proposal has its unavoidable advantages and challenges. For instance, the original CRM was questioned for the recommended dose levels (a starting dose that was too high and an interval between dose levels that was too large) and the limited number of patients per dose level. Careful consideration and teamwork between the clinician and statistician are critical in selecting the most appropriate phase I design tailored to the specific clinical situation.

Phase II Trials

In phase II trials, the goal is to establish clinical activity and further evaluate the treatment’s toxicity. Unlike phase I trials, which commonly accrue patients with a variety of cancers, a phase II trial should restrict accrual to a reasonably well-defined patient population. Historically, the response rate has been the most common end point for phase II trials. However, in the last decade, progression-free survival/disease-free survival (PFS/DFS) rates and overall survival (OS) rates have been used increasingly as the primary end point in the phase II setting, in addition to a series of new end points. Because imaging subtleties for PFS/DFS are much less problematic than for the response end point (one only needs to distinguish between progression and no progression, not tumor response) and because PFS/DFS are not affected by therapies administered subsequent to the therapy under study (OS is affected), PFS is gaining popularity as the primary end point for both phase II and phase III trials in advanced disease. As might be expected, PFS/DFS as an end point does present multiple challenges; details can be found in the study of Wu and Sargent.4

Traditional Single-Arm Phase II Designs: One-Stage Design

The starting point for a discussion of phase II trials is the single-arm, one-stage design. In these trials, patients are accrued to a single arm and are treated at a single-dose level that has been suggested by earlier phase I trials. If the treatment appears sufficiently active in comparison with historical success rates (typically, tumor response) in patients with disease and prognosis similar to those accrued to the phase II trial, the treatment may be considered a viable candidate for the phase III setting. Toxicities, expense, and other considerations will often influence the decision as to whether the drug has enough promise compared with standard regimens to warrant further interest.

Let’s explain this type of design by means of an example. Suppose that in recent treatment trials for a given disease, the response rate ranged between 15% and 20%. The new trial might be designed to show that the new treatment will not be worth continuing if the true response rate is 10% or less and that the new treatment should be continued if the true response rate is 25% or greater. Such a design implies, by default, that if the response rate is in the range of 15% to 20%, the probability of either decision being made (i.e., to abandon the new therapy or to bring it to phase III testing) may be nearly equal. One critical consideration is to make sure that patients in the new study are comparable to those enrolled in the earlier studies, a factor that in many cases is impossible to verify.

As described, the typical phase II trial is structured to provide the basis for a recommendation either to abandon the therapy or to consider it for further testing. Working together, the investigator and the statistician choose an “unacceptable” response rate, p0 (10% in our example) and a “promising” response rate, p1 (25% in our example). The study is designed so that if the response rate is as low as p0 there will be little chance that the treatment will be recommended, whereas if the response rate is as high as p1 there is a high probability that the treatment will be recommended. These benchmark response rates should in no case be chosen arbitrarily; considerable thought must be given to their selection, and they should be fully justified by comparison with historical data in patients with comparable disease and prognosis.

The considerations just given might lead to the following study design criteria:

Using well-known methods, the statistician could determine that each of these criteria would be satisfied if 40 patients were accrued, and the treatment would be recommended for further consideration if, and only if, at least 7 of the 40 patients responded. Table 12-1 gives the probabilities that the treatment will be carried forward for various hypothetical response rates.

It is worth noting that the formal decision rule is to conclude the treatment is worth carrying forward if the estimated response rate is at least 7/40 = 17.5% but not if the response rate is less than or equal to 6/40 = 15%. As Table 12-1 indicates, the probability of observing a response rate of at least 17.5% in our trial if the true response rate is 10% is only .10. In contrast, the probability of observing a response rate of at least 17.5% is .90 if the true response rate is 25%, and a positive recommendation is extremely likely (probability = .98) if the true response rate is as high as 30%. Other factors may enter into the final decision, particularly if the observed response rate is close to 15% or 20%.

Traditional Single-Arm Phase II Designs: Two-Stage Design

Given that many experimental treatments ultimately do not demonstrate efficacy, ethical considerations lead to the usual recommendation of an early stopping rule for lack of efficacy in phase II trials. For example, suppose that the first 15 patients who are accrued do not respond to the therapy. At that point, an investigator might begin to believe that the treatment is no better than standard therapy and might even be worse. This raises the question of whether the trial should be terminated. On the one hand, there is some probability that 7 of the next 25 patients will respond, resulting in a recommendation for continued consideration of the treatment. Statisticians have developed clinical trial designs that allow a prospectively specified examination of the data from the ongoing trial to address this issue.

Generally, the issue of early stopping rules is as follows: When designing a trial, it may be appropriate to introduce the possibility of stopping early if there is strong evidence that the treatment is no more and possibly less effective than the standard treatment, as long as doing so would not substantially reduce the probability of detecting a true beneficial effect (power). Of course, one might also want to stop early if the treatment appears to be extremely effective, although there is no ethical constraint against assigning patients to an apparently effective treatment, unless doing so would unduly prolong the further development of a very promising therapy.

Consider the following modification of the phase II study design just described. Accrual will proceed in two stages. In the first stage, 15 patients will be accrued, treated, and observed for clinical response. If no patients respond, the trial will be terminated, and the candidate treatment will not be recommended for further consideration; otherwise, an additional 25 patients will be accrued. If in total at least eight responses are observed, the treatment will be recommended for further consideration; otherwise, it will not be so recommended.

This new design has some nice properties. The maximum number of patients required is the same as before, and, as shown in Table 12-2, the new design still satisfies each of the desired criteria 1 to 3 listed previously. However, if the new treatment has a true response rate of 5% or less, the study has a 0.46 chance of stopping after 15 patients have been accrued, therefore sparing 25 patients from treatment with an inactive regimen.

Now consider a design that requires early stopping if only 0 or 1 of the first 15 patients responds; otherwise, 25 more patients are accrued, and the treatment is recommended for further consideration if at least 7 patients respond. This design provides much better protection against the possibility of accruing an excessive number of patients to an ineffective regimen. However, if this design is used, it would be slightly less likely that a truly effective regimen would be recommended for further consideration. Table 12-3 shows the properties of this design.

Notice that with this design, there is a .83 probability that the study will terminate after 15 patients are entered if the regimen has only a 5% response rate. However, the probability of recommending the treatment is now only .86 if there is a 25% response rate.

Numerous authors discussed optimal strategies for choosing phase II designs, including Gehan,5 Herson,6 Lee, Staquet and Simon,7 Fleming,8 Chang and associates,9 Simon,10 Therneau, Wieand and Chang,11 Bryant and Day,12 and Thall, Simon and Estey.13

Newer Phase II Designs: Randomized Phase II Design

Single-arm phase II designs can demonstrate biologic activity of an agent with relatively small sample size and short study duration. However, the potential for patient selection bias, the evolving methodologies for the assessment of “success” (i.e., changes in CT scanners, changes in response criteria), and a general lack of robustness of historical rates of this type of design result in a high possibility that a positive single-arm phase II trial will be followed by a negative phase III study.14 A potential solution to this fundamental limitation of single-arm phase II designs is the use of a randomized design. For example, two types of randomized phase II designs, the randomized selection designs and the randomized screening designs, have been proposed and used widely.

In a selection design, all arms are considered experimental arms. These could be different new regimens or different doses or schedules of the same regimen. The sample size is calculated to guarantee that with a very small probability, perhaps 10%, an inferior arm (for instance, one in which the response rate is 15% lower) will be selected for the future phase III study. In a standard selection design, the arm with the highest estimated success rate for the primary end point will be recommended for a follow-up phase III study, no matter how small the difference of estimates might be among arms.15 On the other hand, a flexible selection design, or “pick-the-winner” design, will recommend the arm with the best estimated primary end point if the difference is larger than a prespecified criterion.16 Otherwise, toxicities, expense, and other factors will be taken into consideration when making the decision.

If a standard of care is included in the comparison, a screening design should be applied.17 A screening phase II design is similar to a randomized phase III trial but has larger type I and type II errors. For instance, a range of 10% to 20% for both error rates is acceptable. A screening phase II design provides a head-to-head comparison between the experimental regimen and the standard of care using a relatively smaller sample size and clear guidance as to the likelihood of success if the agent is moved forward into phase III testing.

Overall, randomized phase II designs balance prognostic factors among all arms, avoid patient selection bias, and provide robust arm comparison results. The major concerns for randomized phase II designs are (1) the larger required sample size, (2) the relatively high risk of false-negative and false-positive rates resulting from the small sample size, and (3) the possible desire (or ethical mandate) to skip a confirmatory phase III study given a positive phase II result.

Another type of newly developed randomized phase II design is the phase II/III design. A phase II/III design starts with a phase II component. Patients are randomized among several experimental arms or a standard-of-care arm versus one or more experimental arms. Experimental arms could be compared with the historical control or the standard-of-care control to select arms for entering the phase III component. The main advantage of this type of design is (1) the speed of transition from phase II to phase III and (2) the fact that data obtained from the phase II component could be applied toward the phase III comparison as long as the protocol treatment is not altered between phases.

Phase III Trials

It is generally recognized that judging the value of a new therapy by comparing it with historical data may give an erroneous impression of the therapy’s efficacy. Pocock18 related a number of illuminating examples of this phenomenon. Therefore the phase III trial, in which a new agent or modality is tested against an accepted standard treatment in a randomized comparison, is considered the most satisfactory method of establishing the value of the proposed treatment.

Under most circumstances, the goal of a phase III trial is to determine whether the proposed treatment is superior to the standard treatment (superiority trial). Occasionally, the goal is to “show that the difference between the new and active control treatment is small, small enough to the known effectiveness of the active control to support the conclusion that the new test drug is also effective” (noninferiority trial).19 Because the goal of a phase III trial is to make a definitive conclusion regarding a new treatment’s efficacy, enough patients need to be accrued to guarantee a small probability of false-positive results (i.e., declaring that a regimen is effective when in fact it is not) and a large probability of true-positive results (i.e., declaring that a regimen is effective when in fact it is). The probability of false-positive results is also called the size of the study, which usually is limited to 5% or less. The probability of true-positive results is also called the power of the study, which usually is set at 80% to 95%. The primary end points for comparison are usually survival and PFS/DFS, and a common secondary end point is quality of life (QOL). In this section, we discuss the rationale for requiring randomized treatment assignment and stratification in a phase III trial. We introduce the intent to treat principle, and we emphasize the importance of monitoring in ongoing trials.