Statistics and Clinical Trials

Published on 03/04/2015 by admin

Filed under Hematology, Oncology and Palliative Medicine

Last modified 22/04/2025

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 0 (0 votes)

This article have been viewed 1611 times

Chapter 12 Statistics and Clinical Trials

In this chapter we address some of the statistical issues associated with clinical trials. We begin with a short description of the basic role of statistics and statisticians and illustrate some sample communications between investigators and statisticians. We provide an overview of the statistical issues relevant to each phase of clinical trials (phases I, II, and III) and discuss some of the unique aspects for phase II and III trials. We introduce the standard and newly developed phase II experimental designs. We are particularly interested in the essential roles of randomization and stratification in comparing new and standard therapies. We discuss the important intent to treat principle, which is fundamental to analyzing phase III trials. We emphasize the importance of appropriate monitoring in ongoing trials. In addition, we consider some of the special problems that arise in the analysis of survival data that are caused by the phenomenon of censoring, which occurs when a patient’s time to death cannot be completely determined either because he or she becomes lost to follow-up before death or is still alive when the data are to be analyzed. Finally, we briefly discuss several relevant topics in modern clinical trials: surrogate end points, biomarkers, and adaptive design.

Our intent in this chapter is to provide the reader with insight regarding the use of statistics (and statisticians) in the design and analysis of clinical trials rather than to provide all of the details required for an investigator to perform his or her own analyses. Several excellent texts provide the details required in performing analyses, and we cite several such references.

We define a clinical trial as a designed study involving the treatment of prospectively accrued humans that is specified by a document (protocol) with specific goals and analysis plans. Meinert1 describes some of the unique aspects of clinical trials that distinguish them from other medical research studies, including, among others, observational studies and case-control studies, and enumerates the requirements of a good protocol document.

Why Are Statistics and Statisticians Useful?

Collaboration between statisticians and clinical investigators can be extremely fruitful, and we hope that by the time the reader finishes this chapter, several advantages of such collaborations will have become apparent. However, at the most basic level, statisticians are interested in ensuring that at the end of an experiment, the conclusion is valid and reproducible. Mathematically, this dictates emphasis on two quantities: variance and bias. Any investigator involved with clinical research data quickly becomes aware of the tremendous variation between individuals. Two patients who present with cancer may have nearly identical clinical (and pathologic) characteristics and receive the same treatment, but one may fail (experience relapse or die) within months whereas the other may be cured. In fact, in many oncology clinical studies, particularly those involving the effect of treatment on the time to failure, we are still unable to “explain” as much as half the variation among patients when we fit a model using the known prognostic variables. This is often quite different from the experience of laboratory researchers, who may be accustomed to seeing little variation among sampling units relative to the size of the treatment effects they are trying to measure. A practical definition of variance could be the unexplained differences in outcomes that exist among seemingly similar patients.

If one looks at two sets of 15 patients who differ in one characteristic (possibly treatment) and notices that one group has a median survival time that is 2 months longer than the other group, one cannot immediately conclude that the characteristic is truly associated with survival difference. Statistical techniques will be required to determine whether a median survival difference of 2 months was quite likely to occur even if two sets of 15 patients who did not differ in the characteristic had been compared.

Bias, on the other hand, can occur in medical research studies if treatments are compared between groups of patients who are not equivalent in terms of characteristics that are associated with prognosis. The fundamental strategy used in clinical trials to avoid biased treatment comparisons is randomization of treatment assignments, which guarantees that the treatment assigned to a patient is not related to his or her prognosis. In observational studies, such an assurance is simply not possible. Although it may be feasible to adjust for potential sources of bias at the data analysis stage, based on our present knowledge of factors influencing treatment outcomes within individual patients, there is no way to ensure that patients receiving two different treatments are identical in the absence of randomization. Because it is difficult (practically impossible) to identify or quantify all sources of bias, results obtained from randomized clinical trials are generally accepted to be more reliable than those obtained from observational studies. Therefore the role of the statistician in controlling bias has two aspects: (1) ensuring patient comparability between treatment groups at the study design stage, and (2) using appropriate statistical methodology in an attempt to eliminate biases at the analysis stage.

Clinical Trials

Clinical trials performed to develop and test new agents or modalities in oncology are often categorized as phase I, phase II, or phase III trials according to their primary aims.

Phase I Trials

Phase I trials are generally the first trials involving human subjects in which a new treatment is tested. The goal of a phase I trial is to test a new regimen’s tolerability and toxicity. These trials usually enroll a limited number of patients who have exhausted other treatment options. Phase I trials may enroll patients with a specific tumor type only, or be open to patients with different tumors. Different primary end points and study designs should be considered for regimens with different cancer therapeutic mechanisms.

Traditional Cohorts-of-3 Phase I Design

Historically, cancer therapies have been designed to act as cytotoxic, or cell-killing, agents. The fundamental assumption regarding the dose-related activity of such agents is that there exists a monotone nondecreasing dose-response curve, meaning that as the size of the dose increases, tumor shrinkage will also increase, and this phenomenon should translate into increasing clinical benefit. Under this assumption, both the toxicity of and the clinical benefit of the agent under study will increase as the dose increases; therefore, an appropriate goal of a phase I trial is to find the highest dose with acceptable toxicity. Because the monotone nondecreasing dose-response curve has been observed for most cytotoxic therapies, toxicity has historically been used as the primary end point for identifying the dose that has the greatest likelihood of being effective in subsequent testing.

In this context, the typical goal for phase I clinical trials has been to determine the maximum tolerated dose (MTD), which has traditionally been defined as the highest dose level at which only one of six patients experiences unacceptable, or dose-limiting, toxicity (DLT). The traditional design for this type of agent is the cohort-of-3 design. Specifically, three patients are enrolled at the starting dose level. If no DLT is observed, three patients will be enrolled at the next higher dose level. If one DLT is observed, another three patients will be enrolled at the same dose level. Out of the six patients at a given dose level, if one DLT is observed, three patients will be enrolled at the next higher dose level. If two or more DLTs are observed (out of three or six patients at a given dose level), then the MTD will be considered as exceeded, and the next lower dose level will be defined as the MTD as long as six patients have been evaluated on that level.

Although the cohort-of-3 phase I design has considerable appeal (it is straightforward to conduct, easy to explain, and has considerable historical precedent), particularly with newer agents that will be discussed later, use of the cohort-of-3 design poses increasing challenges in modern clinical trials. For instance, one challenge in a cohort-of-3 design is that it may require a long time and many patients to reach the MTD, with many patients treated at suboptimal dose levels. Goldberg and colleagues2 conducted a cohort-of-3 phase I study to determine the MTD and DLT of CPT-11 in the regimen of CPT-11/5-FU/LV for patients with metastatic or locally advanced cancer, where CPT-11 was administered on day 1 and 5-fluorouracil/leucovorin was administered on days 2 to 5 of a 21-day cycle. Fifty-six patients were accrued during a period of 38 months, and 13 dose levels were studied. Use of a different design coupled with a different end point may have shortened the study significantly.

Newer Phase I Designs

Cytostatic or targeted therapies are novel cancer therapies in which the shape of the dose-response curve is unknown. In theory, the dose-response curve could be a monotone nondecreasing, quadratic, or increasing with a plateau curve. As such, the assumption that increasing doses will always lead to increasing efficacy, which is an appropriate assumption for cytotoxic agents, is no longer reasonable. In a phase I trial with such an agent, an appropriate goal is to estimate the biologically optimal dose (BOD) (i.e., the dose that has maximal efficacy with acceptable toxicity). In other words, the trial must incorporate both toxicity and efficacy to estimate the BOD. One developmental pathway is to use the continual reassessment method (CRM), a Bayesian approach, in modeling the dose-response relationship. The CRM was first introduced by O’Quigley and coauthors3; the main idea is to calculate a starting dose using a preliminary estimate (prior information) of the MTD and then use each patient’s DLT status at a given dose level and a statistical model to define future dose levels. The CRM has been widely considered in testing cytostatic or targeted therapies.

In addition to Bayesian-based designs, several other newer phase I designs have been proposed to overcome the challenges of the cohort-of-3 design. Each proposal has its unavoidable advantages and challenges. For instance, the original CRM was questioned for the recommended dose levels (a starting dose that was too high and an interval between dose levels that was too large) and the limited number of patients per dose level. Careful consideration and teamwork between the clinician and statistician are critical in selecting the most appropriate phase I design tailored to the specific clinical situation.

Phase II Trials

In phase II trials, the goal is to establish clinical activity and further evaluate the treatment’s toxicity. Unlike phase I trials, which commonly accrue patients with a variety of cancers, a phase II trial should restrict accrual to a reasonably well-defined patient population. Historically, the response rate has been the most common end point for phase II trials. However, in the last decade, progression-free survival/disease-free survival (PFS/DFS) rates and overall survival (OS) rates have been used increasingly as the primary end point in the phase II setting, in addition to a series of new end points. Because imaging subtleties for PFS/DFS are much less problematic than for the response end point (one only needs to distinguish between progression and no progression, not tumor response) and because PFS/DFS are not affected by therapies administered subsequent to the therapy under study (OS is affected), PFS is gaining popularity as the primary end point for both phase II and phase III trials in advanced disease. As might be expected, PFS/DFS as an end point does present multiple challenges; details can be found in the study of Wu and Sargent.4

Traditional Single-Arm Phase II Designs: One-Stage Design

The starting point for a discussion of phase II trials is the single-arm, one-stage design. In these trials, patients are accrued to a single arm and are treated at a single-dose level that has been suggested by earlier phase I trials. If the treatment appears sufficiently active in comparison with historical success rates (typically, tumor response) in patients with disease and prognosis similar to those accrued to the phase II trial, the treatment may be considered a viable candidate for the phase III setting. Toxicities, expense, and other considerations will often influence the decision as to whether the drug has enough promise compared with standard regimens to warrant further interest.

Let’s explain this type of design by means of an example. Suppose that in recent treatment trials for a given disease, the response rate ranged between 15% and 20%. The new trial might be designed to show that the new treatment will not be worth continuing if the true response rate is 10% or less and that the new treatment should be continued if the true response rate is 25% or greater. Such a design implies, by default, that if the response rate is in the range of 15% to 20%, the probability of either decision being made (i.e., to abandon the new therapy or to bring it to phase III testing) may be nearly equal. One critical consideration is to make sure that patients in the new study are comparable to those enrolled in the earlier studies, a factor that in many cases is impossible to verify.

As described, the typical phase II trial is structured to provide the basis for a recommendation either to abandon the therapy or to consider it for further testing. Working together, the investigator and the statistician choose an “unacceptable” response rate, p0 (10% in our example) and a “promising” response rate, p1 (25% in our example). The study is designed so that if the response rate is as low as p0 there will be little chance that the treatment will be recommended, whereas if the response rate is as high as p1 there is a high probability that the treatment will be recommended. These benchmark response rates should in no case be chosen arbitrarily; considerable thought must be given to their selection, and they should be fully justified by comparison with historical data in patients with comparable disease and prognosis.

The considerations just given might lead to the following study design criteria:

Using well-known methods, the statistician could determine that each of these criteria would be satisfied if 40 patients were accrued, and the treatment would be recommended for further consideration if, and only if, at least 7 of the 40 patients responded. Table 12-1 gives the probabilities that the treatment will be carried forward for various hypothetical response rates.

It is worth noting that the formal decision rule is to conclude the treatment is worth carrying forward if the estimated response rate is at least 7/40 = 17.5% but not if the response rate is less than or equal to 6/40 = 15%. As Table 12-1 indicates, the probability of observing a response rate of at least 17.5% in our trial if the true response rate is 10% is only .10. In contrast, the probability of observing a response rate of at least 17.5% is .90 if the true response rate is 25%, and a positive recommendation is extremely likely (probability = .98) if the true response rate is as high as 30%. Other factors may enter into the final decision, particularly if the observed response rate is close to 15% or 20%.

Traditional Single-Arm Phase II Designs: Two-Stage Design

Given that many experimental treatments ultimately do not demonstrate efficacy, ethical considerations lead to the usual recommendation of an early stopping rule for lack of efficacy in phase II trials. For example, suppose that the first 15 patients who are accrued do not respond to the therapy. At that point, an investigator might begin to believe that the treatment is no better than standard therapy and might even be worse. This raises the question of whether the trial should be terminated. On the one hand, there is some probability that 7 of the next 25 patients will respond, resulting in a recommendation for continued consideration of the treatment. Statisticians have developed clinical trial designs that allow a prospectively specified examination of the data from the ongoing trial to address this issue.

Generally, the issue of early stopping rules is as follows: When designing a trial, it may be appropriate to introduce the possibility of stopping early if there is strong evidence that the treatment is no more and possibly less effective than the standard treatment, as long as doing so would not substantially reduce the probability of detecting a true beneficial effect (power). Of course, one might also want to stop early if the treatment appears to be extremely effective, although there is no ethical constraint against assigning patients to an apparently effective treatment, unless doing so would unduly prolong the further development of a very promising therapy.

Consider the following modification of the phase II study design just described. Accrual will proceed in two stages. In the first stage, 15 patients will be accrued, treated, and observed for clinical response. If no patients respond, the trial will be terminated, and the candidate treatment will not be recommended for further consideration; otherwise, an additional 25 patients will be accrued. If in total at least eight responses are observed, the treatment will be recommended for further consideration; otherwise, it will not be so recommended.

This new design has some nice properties. The maximum number of patients required is the same as before, and, as shown in Table 12-2, the new design still satisfies each of the desired criteria 1 to 3 listed previously. However, if the new treatment has a true response rate of 5% or less, the study has a 0.46 chance of stopping after 15 patients have been accrued, therefore sparing 25 patients from treatment with an inactive regimen.

Now consider a design that requires early stopping if only 0 or 1 of the first 15 patients responds; otherwise, 25 more patients are accrued, and the treatment is recommended for further consideration if at least 7 patients respond. This design provides much better protection against the possibility of accruing an excessive number of patients to an ineffective regimen. However, if this design is used, it would be slightly less likely that a truly effective regimen would be recommended for further consideration. Table 12-3 shows the properties of this design.

Notice that with this design, there is a .83 probability that the study will terminate after 15 patients are entered if the regimen has only a 5% response rate. However, the probability of recommending the treatment is now only .86 if there is a 25% response rate.

Numerous authors discussed optimal strategies for choosing phase II designs, including Gehan,5 Herson,6 Lee, Staquet and Simon,7 Fleming,8 Chang and associates,9 Simon,10 Therneau, Wieand and Chang,11 Bryant and Day,12 and Thall, Simon and Estey.13

Newer Phase II Designs: Randomized Phase II Design

Single-arm phase II designs can demonstrate biologic activity of an agent with relatively small sample size and short study duration. However, the potential for patient selection bias, the evolving methodologies for the assessment of “success” (i.e., changes in CT scanners, changes in response criteria), and a general lack of robustness of historical rates of this type of design result in a high possibility that a positive single-arm phase II trial will be followed by a negative phase III study.14 A potential solution to this fundamental limitation of single-arm phase II designs is the use of a randomized design. For example, two types of randomized phase II designs, the randomized selection designs and the randomized screening designs, have been proposed and used widely.

In a selection design, all arms are considered experimental arms. These could be different new regimens or different doses or schedules of the same regimen. The sample size is calculated to guarantee that with a very small probability, perhaps 10%, an inferior arm (for instance, one in which the response rate is 15% lower) will be selected for the future phase III study. In a standard selection design, the arm with the highest estimated success rate for the primary end point will be recommended for a follow-up phase III study, no matter how small the difference of estimates might be among arms.15 On the other hand, a flexible selection design, or “pick-the-winner” design, will recommend the arm with the best estimated primary end point if the difference is larger than a prespecified criterion.16 Otherwise, toxicities, expense, and other factors will be taken into consideration when making the decision.

If a standard of care is included in the comparison, a screening design should be applied.17 A screening phase II design is similar to a randomized phase III trial but has larger type I and type II errors. For instance, a range of 10% to 20% for both error rates is acceptable. A screening phase II design provides a head-to-head comparison between the experimental regimen and the standard of care using a relatively smaller sample size and clear guidance as to the likelihood of success if the agent is moved forward into phase III testing.

Overall, randomized phase II designs balance prognostic factors among all arms, avoid patient selection bias, and provide robust arm comparison results. The major concerns for randomized phase II designs are (1) the larger required sample size, (2) the relatively high risk of false-negative and false-positive rates resulting from the small sample size, and (3) the possible desire (or ethical mandate) to skip a confirmatory phase III study given a positive phase II result.

Another type of newly developed randomized phase II design is the phase II/III design. A phase II/III design starts with a phase II component. Patients are randomized among several experimental arms or a standard-of-care arm versus one or more experimental arms. Experimental arms could be compared with the historical control or the standard-of-care control to select arms for entering the phase III component. The main advantage of this type of design is (1) the speed of transition from phase II to phase III and (2) the fact that data obtained from the phase II component could be applied toward the phase III comparison as long as the protocol treatment is not altered between phases.

Phase III Trials

It is generally recognized that judging the value of a new therapy by comparing it with historical data may give an erroneous impression of the therapy’s efficacy. Pocock18 related a number of illuminating examples of this phenomenon. Therefore the phase III trial, in which a new agent or modality is tested against an accepted standard treatment in a randomized comparison, is considered the most satisfactory method of establishing the value of the proposed treatment.

Under most circumstances, the goal of a phase III trial is to determine whether the proposed treatment is superior to the standard treatment (superiority trial). Occasionally, the goal is to “show that the difference between the new and active control treatment is small, small enough to the known effectiveness of the active control to support the conclusion that the new test drug is also effective” (noninferiority trial).19 Because the goal of a phase III trial is to make a definitive conclusion regarding a new treatment’s efficacy, enough patients need to be accrued to guarantee a small probability of false-positive results (i.e., declaring that a regimen is effective when in fact it is not) and a large probability of true-positive results (i.e., declaring that a regimen is effective when in fact it is). The probability of false-positive results is also called the size of the study, which usually is limited to 5% or less. The probability of true-positive results is also called the power of the study, which usually is set at 80% to 95%. The primary end points for comparison are usually survival and PFS/DFS, and a common secondary end point is quality of life (QOL). In this section, we discuss the rationale for requiring randomized treatment assignment and stratification in a phase III trial. We introduce the intent to treat principle, and we emphasize the importance of monitoring in ongoing trials.

Rationale for Randomization

A random assignment is defined as any assignment resulting from a mechanism ensuring an equal probability to all possible assignments. The most obvious reason for randomized assignment is to eliminate the possibility of bias, that is, the intentional or inadvertent assignment of treatments by means of procedures that allow patient characteristics to influence the assignments in a way that may promote an imbalance in patient characteristics between the two treatment arms. The basic idea is very simple. If chance alone determines the assignments and enough patients are accrued, the law of averages will ensure that the characteristics of patients in different treatment arms will be very nearly identical.

Treatments must be assigned in such a way that the accruing physician or coordinator will have no reasonable basis for predicting the arm to which a patient will be assigned. For example, if an accruing physician suspects that at some point he or she may be assigned to a new treatment study, he or she may recruit patients who might have more favorable outcomes, thereby allowing subtle differences in patient characteristics to influence treatment comparisons. For this reason, it is common for the randomization of patients to be permitted only through contact with a centralized randomization coordinator.

Other potential sources of bias exist that cannot be eliminated by randomized treatment assignment. In some diseases, application of a treatment may convey psychological effects or promote expectations that affect some end points of interest. In cancer research, the most obvious example is that treatment with chemotherapy or other agents gives rise to expectations of toxicity. Therefore, whenever it is feasible without causing undue hardship to patients, treatment assignments should be masked by use of placebos. Perhaps even more important is the need to keep medical personnel unaware of actual treatment assignments whenever possible to eliminate bias in assessing response, relapse, or attribution of causality to adverse events.

Stratification of Treatment Assignments

For trials with large sample sizes, randomization of treatment assignments—together with the law of averages—will result in nearly balanced patient characteristics across treatment arms. However, if certain patient characteristics are known to be strongly prognostic of outcome, a better balance can be achieved by allowing separate randomization of treatments to patients within subpopulations defined by the various levels of prognostic variables. Such a restricted randomization scheme is referred to as stratified random assignment, and the prognostic variables are referred to as stratification variables. For example, in clinical trials of adjuvant therapy for breast cancer, it is known that rates of relapse are strongly related to the number of axillary nodes found to be positive for tumor on histologic examination and are less strongly related to the age of the patient. Therefore, the assignment of treatments to patients might be stratified within eight subpopulations, defined by combinations of nodal status (node negative, 1 to 3 positive nodes, 4 to 9 positive nodes, 10+ positive nodes) and age at randomization (≤49 years, ≥50 years). In this way, stratification (to balance the effects of strongly prognostic characteristics) and randomization (to prohibit bias resulting from less obvious or even unforeseen factors) may be used together to promote an unbiased comparison of treatments.

In some cases, it is difficult or even impossible to achieve completely balanced treatment assignments within every combination of the levels of all potential stratification variables. For example, in multicenter cancer clinical trials, it is generally thought to be good practice to balance treatment assignments within each accruing institution because patient selection may differ from institution to institution. However, it is not unusual for cooperative groups in oncology to conduct phase III clinical trials using several hundred accrual sites. Clearly, it would be difficult to stratify completely in the accruing institutions together with other potentially prognostic variables, because the number of resulting combinations of levels of all the stratification variables would be excessive. To circumvent this problem, cooperative groups commonly use algorithms that dynamically maintain nearly balanced assignment of treatments across the levels of all stratification variables marginally but that do not guarantee balance within all combinations of the levels of all of the stratification variables. This approach also results in nearly balanced treatment assignments at all times; this will be beneficial if, as often happens, the characteristics of patients for whom the trial is considered most appropriate in the judgment of accruing physicians tend to change as the trial matures. Details of several such dynamic balancing algorithms may be found in articles by Taves,20 Pocock and Simon,21 Freedman and White,22 and Begg and Iglewicz.23

Intent to Treat

If treatment assignments are randomized as described earlier, one can be reasonably confident that important prognostic factors will be balanced across treatment arms. However, if any patients are excluded at the time the data are analyzed, the possibility of bias arises once again. An analysis that includes all randomized patients, regarded as though each had received precisely the treatment to which he or she had been randomized, irrespective of compliance or administrative errors (e.g., errors in eligibility or other protocol deviations), is an intent to treat (ITT) analysis. Another type of analysis is called per-protocol (PP) analysis, in which only eligible patients who adhered to the treatment according to the protocol are included. In this section, we show how treatment comparisons may be seriously biased by excluding patients from an analysis, even for apparently valid reasons. We then show how an intent to treat analysis may effectively reduce biases that can creep into analyses when not all patients actually receive the treatments to which they have been randomized. We discuss these issues in the context of the following example:

Example 1. A clinical trial was designed to compare the effect of radiation therapy relative to observation in patients with rectal cancer. Three hundred patients were randomized between radiation therapy (arm A) and observation (arm B), 150 to each arm. Suppose that all patients actually received all of their intended therapy and that after 7 years of follow-up, survival curves were as shown in Figure 12-1A. Although the survival curve for patients receiving radiation therapy lies above that for patients in the observation-only arm, suggestive of a benefit from radiation therapy, the difference is not quite statistically significant (p = .058).

In any clinical trial, the patients who are accrued exhibit considerable variation in their prognoses. Suppose that we could make use of certain prognostic patient characteristics that were unavailable in the analysis summarized previously. Suppose further that by using this new information, we could identify the 10% of patients with the worst prognoses. Then we might redraw the survival curves after stratifying according to risk, resulting in the plots shown in Figure 12-1B. An analysis of these data that takes this stratification into account again yields a p value of .058, which agrees with the p value from the unstratified analysis to three decimal places. This insensitivity of analysis to stratification is a direct consequence of the randomization used in the design of the trial, which tends to balance the poor-prognosis patients equally across the two arms of the trial.

Returning to the original analysis of these data, suppose that not all patients in arm A had actually received the radiation therapy to which they had been randomized. The statistician must decide what to do with these noncompliant cases. There are at least three ways to proceed: (1) it may seem logical to discard data corresponding to the noncompliant patients because they did not receive the radiation therapy that was assigned to them; (2) it might even seem reasonable to analyze the noncompliant patients as though they had been randomized to arm B rather than arm A because, in fact, they received no adjuvant therapy, similar to those patients who were randomized to observation; (3) the intent to treat principle states that the noncompliers should be analyzed just as though they had received radiation therapy, even though in actuality they did not.

We now consider the possible consequences associated with these three possibilities. Under the assumption that patients’ reasons for refusal of treatment are completely unassociated with prognosis, methods 1 and 2 just presented should lead to estimated treatment differences that are essentially identical to the analysis of Figure 12-1A. If there is a treatment difference, the intent to treat analysis will tend to underestimate this difference because some of the patients who were not irradiated are counted as having received that therapy. The differences in the three analyses should be rather small under this assumption of no association between refusal of treatment and prognosis.

Alternatively, it is rather likely that reasons for refusal of treatment will be associated with patient prognosis. For example, patients with poorer performance status, or their physicians, may be less liable to accept assignment to a treatment that is known to be associated with significant toxicity. Deteriorating health may lead to difficulties in traveling to receive treatment; advancing disease may correlate with a patient’s level of depression or anxiety, which, in turn, may correlate with compliance. Suppose that it is the 10% of patients with the worst prognosis who refuse to accept their radiotherapy. Under these conditions, analysis of the data after deleting the noncompliant patients overstates the effect of radiotherapy, as shown in Figure 12-2A. This is because the worst-prognosis patients are deleted from consideration in arm A, but no similar deletion of poor-prognosis patients is made in arm B. The result is a spuriously significant p value of .021. Treating the noncompliers as though they had been randomized to observation (see Fig. 12-2B) results in an even more serious overstatement of the radiotherapy effect, associated with a p value of less than .009. In contrast, the intent to treat analysis yields a slightly attenuated treatment effect (p = .065) (Fig. 12-2C) but no serious misrepresentation of the true effect.

Why did the intent to treat analysis correspond to the “truth” more closely than the analysis that omitted the subset of patients who did not receive their assigned radiation? This is because the effect of the prognostic factor (risk group) was larger than the treatment effect and the likelihood of patient refusal was associated with both the treatment received and the prognostic group of the patient. Although patient prognoses were balanced as randomized, they are not balanced as treated. In this case, the baseline prognosis of patients who actually received radiation was better than the baseline prognosis of those who were under observation. It is precisely this type of imbalance that the intent to treat analysis is designed to prevent.

There is no one correct answer as to the best approach to analysis when not all patients receive their assigned therapy. However, there is general agreement among statisticians that it is always appropriate to perform an intent to treat analysis in which each patient is analyzed according to the treatment assigned at randomization, regardless of what treatment was actually received. Results of any other analysis should be compared with the results of the intent to treat analysis, and if the results differ substantively, the interpretability of the data must be questioned.

The prior examples illustrated only one type of problem that may be addressed by an intent to treat analysis. Biases similar to those just described can occur if patients elect to cross over study arms, accept a nonstudy therapy, or receive only a portion of the assigned therapy or if they are determined to be ineligible. Therefore we advocate attempting to obtain complete follow-up for all patients registered to every phase III trial, and performing an intent to treat analysis. Sometimes this may not be possible, because patients who refuse protocol therapy may also refuse to be followed. If only 1% or 2% of patients fall into this category, there is little danger of important bias, but if this number is larger, say 5% to 10%, there is a real danger that the study results may be biased. One might then want to perform what are called sensitivity analyses to examine the possible effect of having so many patients without follow-up. In its simplest form, one might perform one analysis assuming that all of the patients without outcome data on arm A failed early on, whereas those on arm B survived; and then perform an analysis assuming that all of the patients without outcome data on arm A survived, whereas those on arm B failed. If one reaches the same conclusion about treatment effect in either analysis, then the exclusions do not represent a serious problem.

There is often disagreement concerning the inclusion of ineligible patients in analyses, particularly in analyses dependent on patient covariates, because ineligible patients may not fit any reasonable classification. The decision as to whether it is appropriate to exclude ineligible patients depends on the methods used for determining ineligibility. For example, in some studies, each patient is reviewed for eligibility in a uniform way by a reviewer (or review committee) who is unaware of the assigned treatment and uses only information that was obtained before randomization. If that is the only mechanism for classifying patients as ineligible, one could reasonably exclude such patients from analyses. However, any less stringent approach that allows patients to be classified as ineligible after randomization could introduce subtle biases. For example, in a trial of radiation versus observation, one might routinely perform a pretreatment examination immediately after a patient is assigned to radiation therapy. If, during that examination, it is determined that the patient had not met an eligibility requirement, exclusion would cause a bias because a similar patient not assigned to radiation would not have had the pretreatment examination and would, therefore, not have been determined to be ineligible.

To summarize, we recommend always performing an intent to treat analysis using all patients as randomized and reporting the results of that analysis in any publication. Further analyses may be performed, especially for noninferiority trials, but one should be aware of potential biases similar to those described previously. Further discussion of these issues can be found in articles by Pocock18 and Gail.24

The data used in Example 1 were generated using the following assumptions. The expected 5-year survival rate for an untreated good-risk patient was .73, and for an untreated bad-risk patient, .54. The reduction in the death rate associated with radiation treatment was assumed to be 20% in each group. For Figure 12-2, we assumed that all the poor-risk patients had an expected 5-year survival rate of .54 because none received treatment.

Monitoring Ongoing Trials

In clinical trials, patients enter a study sequentially over time; therefore, information about the treatment accumulates as the trial progresses. An interim analysis is a planned analysis conducted before the final planned analysis, which allows the study sponsor to evaluate the trial’s success probability while controlling the overall error rates. As the trial progress, interim analyses are usually conducted at prespecified time points, which are quantified by the number of total events. Interim analyses could be used to monitor superiority or futility, or both: If a new regimen is convincingly demonstrated to be superior to the active control, there is an ethical obligation to stop the randomization and provide the new regimen to every patient. If an interim analysis shows that there is little chance for the new regimen to outperform the control, it would be prudent to terminate the trial and save patients for other promising regimens.

There are different approaches in setting interim monitoring boundaries for superiority. For instance, Pocock25 proposed to spend the type I error (false-positive rate) equally throughout all interim and final analyses. O’Brien and Fleming26 proposed to reserve most of the type I error for the final analysis. The general approach is to set the interim boundaries conservatively so that the trial will not stop early for efficacy unless very convincing evidence is present.

Typically, setting the interim boundaries for futility is not done as conservatively as is setting them for superiority. A nice rule of thumb proposed by Wieand and coauthors27 is as follows: Assuming that a study has a time-to-event primary end point, an interim futility check will be conducted when 50% of events targeted for final analysis have been observed. If the hazard ratio of the experimental treatment versus the control is larger than 1 (i.e., the outcomes on the experimental arm are poorer than those on the control arm at that time), the trial should be stopped for reasons of futility. The rule is simple to follow, and the probability of falsely determining that an experimental regimen is no better than the control situation when, in fact, the experimental regimen actually works is 2% or less.

Interim monitoring mainly focuses on treatment efficacy. To monitor the trial in more aspects—for instance, to protect the safety of enrolled patients; to ensure the validity of study results; or to identify unacceptably slow accrual rates, unusually high dropout rates, or unacceptable ineligibility rates—a Data Safety Monitoring Board (DSMB) is usually recommended, if not required. The DSMB is composed of an independent panel of experts. Usually, a physician and a statistician are required. Other representatives could include epidemiologists, laboratory scientists, a patient representative, or representatives of different groups. DSMB members have access to all data and report directly to the study sponsor. In current clinical trial practice, the U.S. Food and Drug Administration (FDA) strongly recommends using a DSMB for all phase III trials and many institutional internal review boards now require use of a DSMB.

Survival Analysis

Survival analysis differs from other types of statistical analysis in that the analysis of time to an event often is complicated by the lack of complete follow-up for every patient (i.e., there is censored data). An example illustrates why this is important. The example uses hypothetical data chosen to illustrate clearly the problem that censoring can cause.

Example 2. Suppose a radiation oncologist decides to review the outcomes of rectal cancer patients treated with radiation therapy. Of primary interest is the determination of the proportion of patients who remain relapse free for 5 years. Suppose patients have been accrued over the past image years, and by chance the patients have started treatment in pairs at intervals of 1 year. Suppose, furthermore, that if no more patients were accrued and the oncologist could wait 5 more years to obtain 5-year follow-up for every patient, he or she would find that half of them were relapse free for more than 5 years. This (unobserved) relapse pattern is shown in Figure 12-3.

Twelve years after the first patient was entered, the 5-year relapse-free pattern would be as shown in Figure 12-4 (time 0 is the date of entry for each patient). If the investigator had waited 12 years (i.e., until 5-year data for all 16 patients were complete), he or she would have observed that half of the patients were relapse free 5 years after entry (i.e., the 5-year relapse-free rate is .50). Suppose, in fact, that the investigator decided to look at the patients’ experience image years after the first patient had entered. Then the investigator would have observed everything to the left of the vertical line in Figure 12-5. Translated into time from entry, this would be represented as shown in Figure 12-6. In that case, the investigator would have complete 5-year data for seven of the eight patients who relapsed and for three patients who were relapse free at 5 years (those who were entered at year 0, 1, or 2). For the other six patients, his or her knowledge would be that they were relapse free for some length of time less than 5 years. The investigator’s first instinct might be to exclude these six patients from analysis because he or she does not know what their relapse status will be after 5 years of follow-up. Of the remaining 10 patients, 7 are known to have relapsed and 3 are known to be disease free for more than 5 years, so that the estimated relapse-free rate at 5 years is .30. This estimate does not seem to reflect the data accurately.

The estimate is so different from .50 because the method of calculation is biased. The cause of the bias can be seen by studying the seventh and eighth patients in the series of patients (i.e., the patients entered 3 years after the first patient entered). Notice that the seventh patient entered was still relapse free when the radiation oncologist performed his or her analysis but had only been followed for image years (i.e., did not have a known status at 5 years) and, hence, was excluded. However, the eighth patient (who was entered on the same day) had relapsed by the time of analysis and, therefore, was included. The bias is that patients who relapse have a better chance of being included in the analysis than those who do not relapse. One way to avoid a biased estimate would be to exclude all patients who did not have the potential to be monitored for 5 years. Therefore the oncologist would only be able to use the information from the first six patients entered. In this artificial example, this would have led to a correct estimate of the 5-year relapse-free rate because three of the first six patients relapsed within 5 years.

This method is unsatisfying, because not all of the available information is used. In statistical terms, the disadvantage of our proposed solution is that there is considerably more variance associated with an estimate that uses the data from only six patients than one that uses all of the available data. For example, if patients 6 and 7 had entered the study in the opposite order, the relapse-free estimate would jump from 50% to 67%. A better method is described in the next section.

Kaplan-Meier Method

A common way to summarize survival data is to estimate the “survival curve,” which shows—for each value of time—the proportion of subjects who survive at least that length of time. Figure 12-7 shows survival curves for patients with operable breast cancer who have been treated with either preoperative or postoperative chemotherapy (doxorubicin and cyclophosphamide) in a large clinical trial. In this section, we discuss the estimation of survival curves using a method proposed by Kaplan and Meier.28

Several excellent references are available that show exactly how a Kaplan-Meier estimate is computed and describe other properties such as the variance of the estimator. A nice summary with formulas for computations can be found in Chapter 2 of Parmar and Machin’s text.29 Here, we offer two examples that provide some insight regarding the Kaplan-Meier method. The first example concerns a data set having no censored observations; the second example extends these ideas to accommodate censored observations.

Example 3. Suppose four patients are entered into a study, possibly at different times, and all four die, with death times of 10 months, 20 months, 25 months, and 40 months, respectively, from time of entry. We want to estimate the 3-year survival rate. Three of the four patients died within 3 years from entry, and one patient remained alive more than 3 years from entry (i.e., one fourth of the patients entered lived at least 3 years). Therefore a reasonable estimate of the 3-year survival rate is S(3) = .25. This estimate (the proportion of patients who remained alive at 3 years) is referred to as the empirical survival estimate, and the graph of estimates obtained this way at all time points is called the empirical survival curve.

Example 4. Suppose that in the previous example three of the patients have died at 10 months, 25 months, and 40 months, respectively. The fourth patient is still alive but has been monitored for only 20 months. We want to estimate the 3-year survival rate. Notice that if we could wait another 16 months to obtain the estimate (so that we would have 3-year data for all patients), we would obtain an estimate of either one fourth (if the fourth patient dies soon) or one half (if the patient remains alive for another 16 months). The rather intuitive approach we used in Example 3 to estimate the 3-year survival rate does not work here, because we do not know whether there will be two or three survivors when all the patients have died or have been monitored for 3 years (i.e., we do not know how to handle the patient whose follow-up was censored at 20 months). Kaplan and Meier proposed an approach that updates the survival estimate at the time of each death using only those patients who are at risk of failing at each update.

Because none of the patients died during the first 10 months, the Kaplan-Meier estimate of the probability of surviving to any time point less than 10 months is equal to 1. Because four patients are alive at 10 months but only three quarters of them survive beyond 10 months, the Kaplan-Meier estimate changes to three quarters for time points beyond 10 months but before the next death. Although one patient is censored at 20 months, this gives no information regarding the likelihood of a death, so the Kaplan-Meier estimate remains at three quarters for all time points between 10 months and 25 months. One of the two patients who are still at risk just before 25 months dies at that time. Therefore the estimate of the probability of a patient surviving beyond 25 months, given that the patient survived at least 25 months, is one half. The Kaplan-Meier estimate of surviving more than 25 months is the product of three quarters (the estimated probability a patient will survive until 25 months) times one half (the estimated probability that a patient will survive more than 25 months given that the patient was alive at 25 months), which is three eighths. There are no other deaths before 3 years, so the Kaplan-Meier estimate of the 3-year survival rate is three eighths. The Kaplan-Meier method uses all the relevant available data from each patient but excludes the patient who was censored at 20 months from all computations beyond that time.

If the Kaplan-Meier approach is applied to the data in Example 3, the estimate S(3) is equal to the (probability of surviving 10 months) × (probability of surviving more than 10 months given survival of 10 months) × (probability of surviving more than 20 months given survival of 20 months) × (probability of surviving more than 25 months given survival of 25 months), or image, which matches the empirical survival estimate. In fact, the Kaplan-Meier estimate and the empirical estimate always match when they are applied to uncensored data.

The reader may verify that application of the Kaplan-Meier approach to the data in Figure 12-6 will result in an estimate of .45 for the probability of remaining relapse free through 5 years.

Log-Rank Statistic

Perhaps the most common application of survival analysis techniques is the comparison of survival times (or other times to event, such as time to disease relapse) between two or more groups of patients that differ in some aspect (e.g., male versus female, or treated versus untreated). We discuss the log-rank statistic, the stratified log-rank statistic, and the Cox proportional hazard model in some detail. These methods are commonly used to compare survival times between groups. We avoid excessive mathematical detail; instead, we concentrate on showing how these methods accommodate censored survival times and, in cases of the stratified log-rank statistic and the Cox proportional hazard model, how differences in individual patient characteristics may be controlled to avoid biased group comparisons.

The log-rank statistic deals with the problem of censoring by comparing the groups only when a patient within any of the groups experiences an “event” (if survival times are to be compared across groups, an event would be a death; if times to relapse are to be compared, an event would be a relapse, and so on). This idea is most easily explained in the context of a simple example. Suppose we monitored three patients receiving a standard treatment regimen (this might even be no treatment), which we refer to as treatment A, and three other patients receiving an experimental regimen, which we refer to as treatment B. Suppose, furthermore, that the survival times for the patients receiving treatment A are 10, 40+, and 55 days, respectively, and for the patients receiving treatment B, 15, 50, and 60 days, respectively (a plus sign after a value refers to a censored time (i.e., a patient with a time of 40+ days was last known to be alive at 40 days and no further follow-up is available). These data are represented graphically in Figure 12-8; the three patients receiving treatment A are labeled A1, A2, and A3, and those receiving treatment B are labeled B1, B2, and B3. Deaths are denoted by the letter D, and censored survival times are labeled with the letter C.

When evaluating the log-rank statistic, the first computation occurs at time t = 10, the time at which the first death is observed. Just before this point in time, all six patients are known to be alive (and, hence, are “at risk” to die at time t = 10). Three of these patients received treatment A and three received treatment B. Exactly one of these six patients is known to have died at time t = 10, and he or she received treatment A. Table 12-4 summarizes the status of patients at this time point. Notice that at the time of this computation there are three patients on each arm. Therefore, if treatment B was equivalent to treatment A and only one death occurred on one arm or the other, there would be a one-half chance that the death is on arm A. In fact, the death is on arm A; hence, we observe one death when the probability of observing a death on arm A is one half (i.e., there is one half more death on arm A than is expected).

Observations at time t = 15 are listed in Table 12-5. At this point there would be a two-fifths probability that the death would occur on treatment A (if the treatments were equivalent), but the death did not occur on treatment A, so there were two-fifths fewer deaths on arm A than expected (i.e., the observed deaths minus the expected number is image).

The next computation occurs at time t = 50; observations are listed in Table 12-6. Notice that patient A2 is not included in this table, even though she is not known to have died at any time before t = 50. Because she was lost to follow-up (censored) at time 40, she is no longer “at risk” at time 50. At this time, there would be a one-third probability that the death would occur on treatment A (if the treatments were equivalent), but the death does not occur on treatment A, so there are one-third fewer deaths on arm A than expected (i.e., the observed deaths minus the expected number is image).

One may go through the same computations at time t = 55 (Table 12-7) and will determine that the number of deaths minus expected deaths on treatment A is image. At time t = 60, all remaining patients are on the same arm, so the observed minus the expected number of deaths must be 0. Adding up the observed minus the expected number of deaths at times 10, 15, 50, and 55, one obtains 2 observed deaths minus 1.733 expected deaths, so that there were 0.27 deaths more than expected on arm A, indicating that this treatment might be harmful. However, one’s intuition is that this is not a significant difference (i.e., such a small difference could easily be attributed to the play of chance), and in fact this is true. Therefore there is no strong evidence in this example that treatments A and B differ.

More formally, the log-rank statistic is defined to be the difference in observed and expected numbers of deaths on one of the two treatment arms. The statistic may be standardized by dividing by the square root of its variance, yielding a score that, under the hypothesis of equivalent treatments, is approximately standard normal. In the present example, the variance of the log-rank statistic can be shown to equal 0.9622, so the standardized test statistic is (2 − 1.733)/0.9622 = 0.27, corresponding to a two-sided p value of about .79.

Stratified Log-Rank Statistic

The log-rank test as defined previously uses only survival information and group membership. Other patient characteristics are not taken into account. We now introduce a method to adjust for the possible effects of other covariates. Suppose that we had the following survival times (in days) for a group of breast cancer patients who had differing numbers of positive lymph nodes and who received either treatment A or treatment B:

1 to 3 nodes  
Treatment A 91, 160+, 230+
Treatment B 32, 101, 131+, 155+, 190+, 210+
4 or more nodes
Treatment A 11, 22, 42, 63, 72, 110, 120, 141, 155, 180+, 200+, 220+
Treatment B 51, 83, 110, 170+

If we were to ignore our knowledge of the patients’ lymph node status and compute a log-rank statistic as we did before, our computations at times 11 and 91 would be as shown in Tables 12-8 and 12-9. If one completes these computations at each time of death, the total for arm A is 10 observed deaths when 8.85 were expected. This translates to 1.15 excess events on arm A.

However, when one examines the distribution of patients, it is not surprising that there is an excess number of events on arm A. After all, 12 of 15 patients on arm A had four or more positive nodes, whereas only 4 of 10 patients on arm B had four or more positive nodes. Given that one would expect patients with four or more positive nodes to be at greater risk than those with one to three positive nodes, it is not clear whether the observation of 1.15 excess deaths on arm A represents a possible treatment effect or reflects the nodal imbalance. That is, the comparison of treatments A and B using the ordinary log-rank statistic is biased by an imbalance of disease stage between the two study arms.

The stratified log-rank test provides a method to control for the association of nodal status with patient survival. To illustrate how this occurs, let us recompute the observed and expected deaths at times of 11 and 91 days, but this time we adjust our definition of risk sets to include nodal status. Notice that when we performed these computations for the log-rank test at t = 11 (as shown previously), we estimated the likelihood that the death that occurred was on arm A to be 15/25, based on the fact that 15 of the 25 patients at risk at that time were on arm A. This implicitly assumed that in the absence of a treatment effect, all patients are at the same risk of dying regardless of nodal status. However, the first patient who died was one who had four or more positive nodes. The patients who would be at the same risk of dying as that patient are the other patients with four or more nodes. Because we knew there were 16 patients with four or more nodes who were at risk of dying at t = 11 and that 12 of these patients were on arm A, perhaps we should have estimated the probability to be three quarters that if a patient with four or more nodes died at time t = 11, the patient would be on arm A (given no treatment effect). This is the method used when computing the stratified log-rank test. We now examine the computations at times of 11 and 91 days using this stratified approach (Tables 12-10 and 12-11).

If one restricts these computations only to the patients with four or more positive nodes, the sum of the observed deaths on treatment A minus the expected is 9 − 9.11 = −.11. In the patients with one to three positive nodes, similar computations yield 1 observed death and 0.99 expected death. Summing these values across both strata gives 10 observed deaths on treatment A compared with 10.10 expected deaths for a difference of −.10. Thus the stratified approach leads to almost no difference between the number of observed and expected deaths, indicating that there is essentially no estimated treatment difference after adjusting for nodal status.

Notice that the difference between the two approaches is seen in the computation of the probability of a death occurring on treatment A at each time point. If one understands the method of computing these probabilities (expected values), one can begin to understand the possible biases of different approaches.

In the prior example, the (unstratified) log-rank test had a potential bias because no adjustment was made for the probable increased risk associated with having four or more positive nodes. This was important because there was a large imbalance in the proportion of such patients by treatment. If this characteristic had been in balance across treatments, the stratified and unstratified approaches would have given much closer results.

Cox Proportional Hazard Model

The stratified log-rank test as defined previously provides a method to control for the association of nodal status with patient survival. However, in terms of adjusting for additional explanatory variables, the most popular method is the Cox proportional hazard model.30 The Cox model explores the relationship between survival experience and prognostic variables or explanatory variables. It estimates and tests the hazard ratio between different groups. In the Cox regression model, explanatory variables could be continuous (for instance, age or weight or systolic blood pressure), categorical (for instance, gender or race or Eastern Cooperative Oncology Group [ECOG] performance score), or an interaction between different categorical variables. The Cox regression model does not make any parametric assumption about the survival probability distribution of each group; however, it does assume a proportional hazard between two groups. If we use hi(t) and h0(t) to denote the hazards of death at time t for the ith patient and the baseline patient, that is, the patient with all explanatory variables taking values 0, respectively, the proportional hazard model can be expressed as

image

where x1, x2, … xp are p different explanatory variables, h0(t) is called the baseline hazard function, and hi(t)/ h0(t) is a constant, called the hazard ratio. The βs are the regression coefficients. For a continuous explanatory variable xp, βp stands for the log of the hazard ratio between the ith patient and the baseline patient. For a categorical explanatory variable xl, for instance, assume that xl = 0 if the patient is randomized to the control group, or xl = 1 if the patient is randomized to the experimental group. Then βl stands for the log of the hazard ratio between patients in the experimental group versus the control group, whereas all other explanatory variables are the same.

Because the Cox regression model assumes the proportional hazard between different groups, it is necessary to check this assumption before concluding the estimations. Most of the model check procedures are based on graphs or plots of model-fitting residuals. If the proportional hazard assumption does not fit the data even after the data transformation, other models should be considered.

Current Topics in Phase III Clinical Trials

Surrogate Endpoint

A primary end point is the primary measure of a treatment’s efficacy; therefore it needs to be clinically relevant and sensitive to the effect of the treatment. The overall survival (OS) rate has historically been the primary outcome for most phase III trials because of its clear virtues: it is simple to measure, unambiguous, the least susceptible to investigator bias, and of unquestionable clinical relevance. Despite these many advantages, the role of OS as a primary endpoint is challenged in modern phase III clinical trials. Two primary challenges to the OS endpoint are that in many cases an extensive follow-up period is needed to obtain survival status and that in many diseases, multiple effective therapies are now available. In this case, OS is affected by all therapies given to a patient and, as such, this endpoint is insensitive to the impact of changing a single line of therapy.

One of the alternatives is to use a surrogate endpoint instead of a clinical endpoint in a phase III trial. A surrogate endpoint is an endpoint obtained sooner, at lesser cost, or less invasively than the long-term clinical efficacy endpoint. When using a surrogate endpoint, one would like to make the same inference as if one had observed a true endpoint. In circumstances where the speed of the study is a major concern, where second-line therapies that affect the OS potentially obscure the assessment of first-line treatment benefit, or where cross-over becomes unavoidable, a surrogate endpoint might be considered in place of the long-term clinical endpoint.

In validating an endpoint as a legitimate surrogate endpoint, a meta-analysis is usually required because relationships presented in one trial may not be generalizable to another. In addition, heterogeneity between the trials included in the meta-analysis strengthens the robustness of results from individual trials. In conducting a meta-analysis, using individual patient data instead of summary data is highly recommended, and, ideally, both positive and negative trials are included. Examples of meta-analyses conducted to examine potential surrogate endpoints can be found in articles by Sargent and colleagues31 and Burzykowski and associates.32

Biomarkers

A marker is a single trait, or a group of traits, that differentiates patients with respect to an outcome of interest. If a marker can be used to identify patients with differing risks of a clinical outcome (such as progression or death) in the absence of therapy or when receiving nontargeted standard treatment, the marker is usually called a prognostic marker. If a marker predicts differential efficacy of a specific therapy, the marker is called a predictive marker.

Validation of a prognostic marker is usually conducted retrospectively in patients treated with placebo or a standard treatment. Validation of a predictive marker could also be conducted retrospectively based on data from a randomized controlled trial.33 However, a prospective randomized controlled trial would be ideal. Two types of clinical designs can be used to validate a predictive marker: a targeted/selection design or an unselected design. A targeted trial enrolls only patients who are most likely to respond to the experimental therapy based on their molecular expression levels. On the one hand, a targeted trial could result in a large savings of patients for other trials. On the other hand, it could miss efficacy in other patients, and miss the opportunity to test the association of the biologic endpoints with clinical outcomes.

Different types of unselected designs have been proposed and discussed. For instance, the marker-by-treatment interaction design (Fig. 12-9) stratifies patients according to their marker status and randomly assigns patients in each marker group to two different treatments. Hypothesis setting and sample size estimation could be different in each marker-defined subgroup. Either a formal test for marker-by-treatment interaction or a separate superiority test within each marker group could be conducted. Another popular unselected design is the marker-based strategy design (Fig. 12-10), in which the random treatment assignment could be based on the patient’s marker status or could be independent of it. Examples for each type of design can be found in the article by Sargent and colleagues.34

Adaptive Design

An adaptive design could be defined as a design “that allows adaptations to trial procedures (for instance, eligibility criteria, study dose, or treatment procedure, etc.) and/or statistical procedures (for instance, randomization, study design, study hypothesis, sample size, or analysis methods, etc.) of the trial after its initiation without undermining the validity and integrity of the trial.”35 Many types of adaptation could be applied to an ongoing trial. For instance, a study could be resized based on an interim review of the outcomes in the control group. Assume, for example, that in the design of a study, the PFS under the standard of care was assumed to be 12 months. However, on an interim review of accrued data from the control arm, the estimated PFS was 15 months. In this case, more patients need to be accrued for the study or patients need to be followed longer for the same hazard ratio to be detected. No statistical penalty is required for this review, because only data from the control arm have been analyzed.

For another example of an adaptive design, assume that a study started with n experimental arms and one control arm. Within this study, based on a prespecified plan, all experimental arms are to be compared with the control arm at the interim analysis, and only the most promising experimental arm will be used with the control arm when patient accrual is resumed. In other words, the study will have started with n + 1 arms and ended with 2 arms, and n − 1 arms will have been dropped after the interim analysis. Because no conclusions regarding superiority were allowed to be made at the interim analysis, no statistical penalty for the interim review is needed. Under other circumstances, if an adaptation relates to (1) altering the randomization ratio between the arms, (2) resizing the trial based on interim comparison between the experimental and control arms, or (3) changing the primary endpoint, then a penalty for looking at the data before the end of the study is definitely needed. In planning an adaptive design, possible adaptations must be specified in the protocol before accrual takes place. In applying an adaptive design, an efficient mechanism for processing and analyzing data is critical.

References

1 Meinert CL. Clinical Trials: Design, Conduct, and Analysis. Oxford: Oxford University Press; 1989.

2 Goldberg RM, Kaufmann SH, Atherton P, et al. A phase I study of sequential irinotecan and 5-Fluorouracil/leucovorin. Ann Oncol. 2002;13:1674-1680.

3 O’Quigley J, Pepe M, Fisher L. Continual reassessment method. A practical design for phase I clinical trials in cancer. Biometrics. 1990;46:33-48.

4 Wu W, Sargent D. Choice of endpoints in cancer clinical trials. In: Kelly K, Halabi S, editors. Fundamentals of Oncology Clinical Trials: The Art and Science Behind Successful Clinical Trials. New York: Demos Medical, 2008.

5 Gehan A. The determination of the number of patients required in a follow-up trial of new chemotherapeutic agent. J Chronic Dis. 1961;13:346-353.

6 Herson J. Predictive probability early termination plans for phase II clinical trials. Biometrics. 1979;35:775-783.

7 Lee YJ, Staquet M, Simon R. Two-stage plans for patient accrual in phase II cancer clinical trials. Cancer Treat Rep. 1979;63:1721-1726.

8 Fleming TR. One-sample multiple testing procedure for phase II clinical trials. Biometrics. 1982;38:143-152.

9 Chang MN, Therneau TM, Wieand HS, Cha S. Designs for group sequential phase II clinical trials. Biometrics. 1987;43:865-874.

10 Simon R. Optimal two-stage designs for phase II clinical trials. Control Clin Trials. 1989;10:1-10.

11 Therneau TM, Wieand HS, Chang M. Optimal designs for a grouped sequential binomial trial. Biometrics. 1990;46:771-781.

12 Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage phase II clinical trials. Biometrics. 1995;51:1372-1383.

13 Thall PF, Simon RM, Estey EH. Bayesian sequential monitoring designs for single-arm clinical trials with multiple outcomes. Stat Med. 1995;14:357-379.

14 Tang H, Foster NR, Grothey A, et al. Comparison of error rates in single-arm versus randomized phase II cancer clinical trials. J Clin Oncol. 2010;28(11):1936-1941.

15 Simon R, Wittes RE, Ellenberg SS. Randomized phase II clinical trials. Cancer Treat Rep. 1985;69:1375-1381.

16 Sargent DJ, Goldberg RM. A flexible design for multiple armed screening trials. Stat Med. 2001;20:1051-1060.

17 Rubinstein LV, Korn EL, Freidlin B, et al. Design issues of randomized phase II trials and a proposal for phase II screening trials. J Clin Oncol. 2005;23(28):7199-7206.

18 Pocock SJ. Clinical Trials: A Practical Approach. New York: Wiley; 1984. pp 182-186

19 Guidance for Industry Non-Inferiority Clinical Trials. U.S. Department of Health and Human Services, Food and Drug Administration, March 2010. Available at www.fda.gov

20 Taves DR. Minimization. A new method of assigning patients to treatment and control groups. Clin Pharmacol Ther. 1974;15:443-453.

21 Pocock SJ, Simon R. Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics. 1975;31:103-115.

22 Freedman LS, White SJ. On the use of Pocock and Simon’s method for balancing treatment numbers over prognostic factors in the controlled clinical trial. Biometrics. 1976;32:691-694.

23 Begg CB, Iglewicz B. A treatment allocation procedure for sequential clinical trials. Biometrics. 1980;36:81-90.

24 Gail MH. Eligibility exclusions, losses to follow-up, removal of randomized patients, and uncounted events in cancer clinical trials. Cancer Treat Rep. 1985;69:1107-1113.

25 Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977;64(2):191-199.

26 O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics. 1979;35(3):549-556.

27 Wieand S, Schroeder G, O’Fallon JR. Stopping when the experimental regimen does not appear to help. Stat Med. 1994;13:1453-1458.

28 Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Statist Assoc. 1958;53:457-481.

29 Parmar KB, Machin D. Survival Analysis: A Practical Approach. New York: Wiley; 1995.

30 Cox DR. Regression models and life-tables (with discussion). J Roy Statist Soc Ser B. 1972;34:187-220.

31 Sargent DJ, Wieand HS, Haller DG, et al. Disease free survival versus overall survival as a primary end point for adjuvant colon cancer studies. Individual patient data from 20898 patients on 18 randomized trials. J Clin Oncol. 2005;23(34):8664-8670.

32 Burzykowski T, Buyse M, Piccart-Gebhart MJ. Evaluation of tumor response, disease control, progression-free survival, and time to progression as potential surrogate end points in metastatic breast cancer. J Clin Oncol. 2008;26(12):1987-1992.

33 Simon RM, Paik S, Hayes DF. Use of archived specimens in evaluation of prognostic and predictive biomarkers. J Natl Cancer Inst. 2009;101(21):1446-1452.

34 Sargent DJ, Conley BA, Allegra C, Collette L. Clinical trial designs for predictive marker validation in cancer treatment trials. J Clin Oncol. 2005;23(9):2020-2027.

35 Chow SC, Chang M. Adaptive design methods in clinical trials—A review. Orphanet J Rare Dis. 2008;3:11.