It is worth noting that the formal decision rule is to conclude the treatment is worth carrying forward if the estimated response rate is at least 7/40 = 17.5% but not if the response rate is less than or equal to 6/40 = 15%. As Table 12-1 indicates, the probability of observing a response rate of at least 17.5% in our trial if the true response rate is 10% is only .10. In contrast, the probability of observing a response rate of at least 17.5% is .90 if the true response rate is 25%, and a positive recommendation is extremely likely (probability = .98) if the true response rate is as high as 30%. Other factors may enter into the final decision, particularly if the observed response rate is close to 15% or 20%.

Traditional Single-Arm Phase II Designs: Two-Stage Design

Given that many experimental treatments ultimately do not demonstrate efficacy, ethical considerations lead to the usual recommendation of an early stopping rule for lack of efficacy in phase II trials. For example, suppose that the first 15 patients who are accrued do not respond to the therapy. At that point, an investigator might begin to believe that the treatment is no better than standard therapy and might even be worse. This raises the question of whether the trial should be terminated. On the one hand, there is some probability that 7 of the next 25 patients will respond, resulting in a recommendation for continued consideration of the treatment. Statisticians have developed clinical trial designs that allow a prospectively specified examination of the data from the ongoing trial to address this issue.

Generally, the issue of early stopping rules is as follows: When designing a trial, it may be appropriate to introduce the possibility of stopping early if there is strong evidence that the treatment is no more and possibly less effective than the standard treatment, as long as doing so would not substantially reduce the probability of detecting a true beneficial effect (power). Of course, one might also want to stop early if the treatment appears to be extremely effective, although there is no ethical constraint against assigning patients to an apparently effective treatment, unless doing so would unduly prolong the further development of a very promising therapy.

Consider the following modification of the phase II study design just described. Accrual will proceed in two stages. In the first stage, 15 patients will be accrued, treated, and observed for clinical response. If no patients respond, the trial will be terminated, and the candidate treatment will not be recommended for further consideration; otherwise, an additional 25 patients will be accrued. If in total at least eight responses are observed, the treatment will be recommended for further consideration; otherwise, it will not be so recommended.

This new design has some nice properties. The maximum number of patients required is the same as before, and, as shown in Table 12-2, the new design still satisfies each of the desired criteria 1 to 3 listed previously. However, if the new treatment has a true response rate of 5% or less, the study has a 0.46 chance of stopping after 15 patients have been accrued, therefore sparing 25 patients from treatment with an inactive regimen.

TABLE 12-2 Operating Characteristics of a Two-Stage Phase II Study Design Allowing Early Stopping for Zero Responses in the First 15 Patients

Now consider a design that requires early stopping if only 0 or 1 of the first 15 patients responds; otherwise, 25 more patients are accrued, and the treatment is recommended for further consideration if at least 7 patients respond. This design provides much better protection against the possibility of accruing an excessive number of patients to an ineffective regimen. However, if this design is used, it would be slightly less likely that a truly effective regimen would be recommended for further consideration. Table 12-3 shows the properties of this design.

TABLE 12-3 Operating Characteristics of a Two-Stage Phase II Study Design Allowing Early Stopping for Zero Responses or One Response in the First 15 Patients

Notice that with this design, there is a .83 probability that the study will terminate after 15 patients are entered if the regimen has only a 5% response rate. However, the probability of recommending the treatment is now only .86 if there is a 25% response rate.

Numerous authors discussed optimal strategies for choosing phase II designs, including Gehan,⁵ Herson,⁶ Lee, Staquet and Simon,⁷ Fleming,⁸ Chang and associates,⁹ Simon,¹⁰ Therneau, Wieand and Chang,¹¹ Bryant and Day,¹² and Thall, Simon and Estey.¹³

Newer Phase II Designs: Randomized Phase II Design

Single-arm phase II designs can demonstrate biologic activity of an agent with relatively small sample size and short study duration. However, the potential for patient selection bias, the evolving methodologies for the assessment of “success” (i.e., changes in CT scanners, changes in response criteria), and a general lack of robustness of historical rates of this type of design result in a high possibility that a positive single-arm phase II trial will be followed by a negative phase III study.¹⁴ A potential solution to this fundamental limitation of single-arm phase II designs is the use of a randomized design. For example, two types of randomized phase II designs, the randomized selection designs and the randomized screening designs, have been proposed and used widely.

In a selection design, all arms are considered experimental arms. These could be different new regimens or different doses or schedules of the same regimen. The sample size is calculated to guarantee that with a very small probability, perhaps 10%, an inferior arm (for instance, one in which the response rate is 15% lower) will be selected for the future phase III study. In a standard selection design, the arm with the highest estimated success rate for the primary end point will be recommended for a follow-up phase III study, no matter how small the difference of estimates might be among arms.¹⁵ On the other hand, a flexible selection design, or “pick-the-winner” design, will recommend the arm with the best estimated primary end point if the difference is larger than a prespecified criterion.¹⁶ Otherwise, toxicities, expense, and other factors will be taken into consideration when making the decision.

If a standard of care is included in the comparison, a screening design should be applied.¹⁷ A screening phase II design is similar to a randomized phase III trial but has larger type I and type II errors. For instance, a range of 10% to 20% for both error rates is acceptable. A screening phase II design provides a head-to-head comparison between the experimental regimen and the standard of care using a relatively smaller sample size and clear guidance as to the likelihood of success if the agent is moved forward into phase III testing.

Overall, randomized phase II designs balance prognostic factors among all arms, avoid patient selection bias, and provide robust arm comparison results. The major concerns for randomized phase II designs are (1) the larger required sample size, (2) the relatively high risk of false-negative and false-positive rates resulting from the small sample size, and (3) the possible desire (or ethical mandate) to skip a confirmatory phase III study given a positive phase II result.

Another type of newly developed randomized phase II design is the phase II/III design. A phase II/III design starts with a phase II component. Patients are randomized among several experimental arms or a standard-of-care arm versus one or more experimental arms. Experimental arms could be compared with the historical control or the standard-of-care control to select arms for entering the phase III component. The main advantage of this type of design is (1) the speed of transition from phase II to phase III and (2) the fact that data obtained from the phase II component could be applied toward the phase III comparison as long as the protocol treatment is not altered between phases.

Phase III Trials

It is generally recognized that judging the value of a new therapy by comparing it with historical data may give an erroneous impression of the therapy’s efficacy. Pocock ¹⁸ related a number of illuminating examples of this phenomenon. Therefore the phase III trial, in which a new agent or modality is tested against an accepted standard treatment in a randomized comparison, is considered the most satisfactory method of establishing the value of the proposed treatment.

Under most circumstances, the goal of a phase III trial is to determine whether the proposed treatment is superior to the standard treatment (superiority trial). Occasionally, the goal is to “show that the difference between the new and active control treatment is small, small enough to the known effectiveness of the active control to support the conclusion that the new test drug is also effective” (noninferiority trial).¹⁹ Because the goal of a phase III trial is to make a definitive conclusion regarding a new treatment’s efficacy, enough patients need to be accrued to guarantee a small probability of false-positive results (i.e., declaring that a regimen is effective when in fact it is not) and a large probability of true-positive results (i.e., declaring that a regimen is effective when in fact it is). The probability of false-positive results is also called the size of the study, which usually is limited to 5% or less. The probability of true-positive results is also called the power of the study, which usually is set at 80% to 95%. The primary end points for comparison are usually survival and PFS/DFS, and a common secondary end point is quality of life (QOL). In this section, we discuss the rationale for requiring randomized treatment assignment and stratification in a phase III trial. We introduce the intent to treat principle, and we emphasize the importance of monitoring in ongoing trials.

We now consider the possible consequences associated with these three possibilities. Under the assumption that patients’ reasons for refusal of treatment are completely unassociated with prognosis, methods 1 and 2 just presented should lead to estimated treatment differences that are essentially identical to the analysis of Figure 12-1A. If there is a treatment difference, the intent to treat analysis will tend to underestimate this difference because some of the patients who were not irradiated are counted as having received that therapy. The differences in the three analyses should be rather small under this assumption of no association between refusal of treatment and prognosis.

Alternatively, it is rather likely that reasons for refusal of treatment will be associated with patient prognosis. For example, patients with poorer performance status, or their physicians, may be less liable to accept assignment to a treatment that is known to be associated with significant toxicity. Deteriorating health may lead to difficulties in traveling to receive treatment; advancing disease may correlate with a patient’s level of depression or anxiety, which, in turn, may correlate with compliance. Suppose that it is the 10% of patients with the worst prognosis who refuse to accept their radiotherapy. Under these conditions, analysis of the data after deleting the noncompliant patients overstates the effect of radiotherapy, as shown in Figure 12-2A. This is because the worst-prognosis patients are deleted from consideration in arm A, but no similar deletion of poor-prognosis patients is made in arm B. The result is a spuriously significant p value of .021. Treating the noncompliers as though they had been randomized to observation (see Fig. 12-2B) results in an even more serious overstatement of the radiotherapy effect, associated with a p value of less than .009. In contrast, the intent to treat analysis yields a slightly attenuated treatment effect (p = .065) (Fig. 12-2C) but no serious misrepresentation of the true effect.

Figure 12-2 A, Survival curves that would be observed if the poor-risk patients who refuse radiation therapy (RT) are excluded from the analyses (percent survival versus years from randomization). B, Survival curves that would be observed if the poor-risk patients who refuse radiation therapy are included as untreated patients (percent survival versus years from randomization). C, Survival curves that would be observed if the poor-risk patients who refuse radiation therapy are included as radiation therapy patients (i.e., the results that would be obtained using an intent to treat analysis [percent survival vs. years from randomization]).

Why did the intent to treat analysis correspond to the “truth” more closely than the analysis that omitted the subset of patients who did not receive their assigned radiation? This is because the effect of the prognostic factor (risk group) was larger than the treatment effect and the likelihood of patient refusal was associated with both the treatment received and the prognostic group of the patient. Although patient prognoses were balanced as randomized, they are not balanced as treated. In this case, the baseline prognosis of patients who actually received radiation was better than the baseline prognosis of those who were under observation. It is precisely this type of imbalance that the intent to treat analysis is designed to prevent.

There is no one correct answer as to the best approach to analysis when not all patients receive their assigned therapy. However, there is general agreement among statisticians that it is always appropriate to perform an intent to treat analysis in which each patient is analyzed according to the treatment assigned at randomization, regardless of what treatment was actually received. Results of any other analysis should be compared with the results of the intent to treat analysis, and if the results differ substantively, the interpretability of the data must be questioned.

The prior examples illustrated only one type of problem that may be addressed by an intent to treat analysis. Biases similar to those just described can occur if patients elect to cross over study arms, accept a nonstudy therapy, or receive only a portion of the assigned therapy or if they are determined to be ineligible. Therefore we advocate attempting to obtain complete follow-up for all patients registered to every phase III trial, and performing an intent to treat analysis. Sometimes this may not be possible, because patients who refuse protocol therapy may also refuse to be followed. If only 1% or 2% of patients fall into this category, there is little danger of important bias, but if this number is larger, say 5% to 10%, there is a real danger that the study results may be biased. One might then want to perform what are called sensitivity analyses to examine the possible effect of having so many patients without follow-up. In its simplest form, one might perform one analysis assuming that all of the patients without outcome data on arm A failed early on, whereas those on arm B survived; and then perform an analysis assuming that all of the patients without outcome data on arm A survived, whereas those on arm B failed. If one reaches the same conclusion about treatment effect in either analysis, then the exclusions do not represent a serious problem.

There is often disagreement concerning the inclusion of ineligible patients in analyses, particularly in analyses dependent on patient covariates, because ineligible patients may not fit any reasonable classification. The decision as to whether it is appropriate to exclude ineligible patients depends on the methods used for determining ineligibility. For example, in some studies, each patient is reviewed for eligibility in a uniform way by a reviewer (or review committee) who is unaware of the assigned treatment and uses only information that was obtained before randomization. If that is the only mechanism for classifying patients as ineligible, one could reasonably exclude such patients from analyses. However, any less stringent approach that allows patients to be classified as ineligible after randomization could introduce subtle biases. For example, in a trial of radiation versus observation, one might routinely perform a pretreatment examination immediately after a patient is assigned to radiation therapy. If, during that examination, it is determined that the patient had not met an eligibility requirement, exclusion would cause a bias because a similar patient not assigned to radiation would not have had the pretreatment examination and would, therefore, not have been determined to be ineligible.

To summarize, we recommend always performing an intent to treat analysis using all patients as randomized and reporting the results of that analysis in any publication. Further analyses may be performed, especially for noninferiority trials, but one should be aware of potential biases similar to those described previously. Further discussion of these issues can be found in articles by Pocock ¹⁸ and Gail.²⁴

The data used in Example 1 were generated using the following assumptions. The expected 5-year survival rate for an untreated good-risk patient was .73, and for an untreated bad-risk patient, .54. The reduction in the death rate associated with radiation treatment was assumed to be 20% in each group. For Figure 12-2, we assumed that all the poor-risk patients had an expected 5-year survival rate of .54 because none received treatment.

Monitoring Ongoing Trials

In clinical trials, patients enter a study sequentially over time; therefore, information about the treatment accumulates as the trial progresses. An interim analysis is a planned analysis conducted before the final planned analysis, which allows the study sponsor to evaluate the trial’s success probability while controlling the overall error rates. As the trial progress, interim analyses are usually conducted at prespecified time points, which are quantified by the number of total events. Interim analyses could be used to monitor superiority or futility, or both: If a new regimen is convincingly demonstrated to be superior to the active control, there is an ethical obligation to stop the randomization and provide the new regimen to every patient. If an interim analysis shows that there is little chance for the new regimen to outperform the control, it would be prudent to terminate the trial and save patients for other promising regimens.

There are different approaches in setting interim monitoring boundaries for superiority. For instance, Pocock ²⁵ proposed to spend the type I error (false-positive rate) equally throughout all interim and final analyses. O’Brien and Fleming²⁶ proposed to reserve most of the type I error for the final analysis. The general approach is to set the interim boundaries conservatively so that the trial will not stop early for efficacy unless very convincing evidence is present.

Typically, setting the interim boundaries for futility is not done as conservatively as is setting them for superiority. A nice rule of thumb proposed by Wieand and coauthors ²⁷ is as follows: Assuming that a study has a time-to-event primary end point, an interim futility check will be conducted when 50% of events targeted for final analysis have been observed. If the hazard ratio of the experimental treatment versus the control is larger than 1 (i.e., the outcomes on the experimental arm are poorer than those on the control arm at that time), the trial should be stopped for reasons of futility. The rule is simple to follow, and the probability of falsely determining that an experimental regimen is no better than the control situation when, in fact, the experimental regimen actually works is 2% or less.

Interim monitoring mainly focuses on treatment efficacy. To monitor the trial in more aspects—for instance, to protect the safety of enrolled patients; to ensure the validity of study results; or to identify unacceptably slow accrual rates, unusually high dropout rates, or unacceptable ineligibility rates—a Data Safety Monitoring Board (DSMB) is usually recommended, if not required. The DSMB is composed of an independent panel of experts. Usually, a physician and a statistician are required. Other representatives could include epidemiologists, laboratory scientists, a patient representative, or representatives of different groups. DSMB members have access to all data and report directly to the study sponsor. In current clinical trial practice, the U.S. Food and Drug Administration (FDA) strongly recommends using a DSMB for all phase III trials and many institutional internal review boards now require use of a DSMB.

Survival Analysis

Survival analysis differs from other types of statistical analysis in that the analysis of time to an event often is complicated by the lack of complete follow-up for every patient (i.e., there is censored data). An example illustrates why this is important. The example uses hypothetical data chosen to illustrate clearly the problem that censoring can cause.

Example 2. Suppose a radiation oncologist decides to review the outcomes of rectal cancer patients treated with radiation therapy. Of primary interest is the determination of the proportion of patients who remain relapse free for 5 years. Suppose patients have been accrued over the past years, and by chance the patients have started treatment in pairs at intervals of 1 year. Suppose, furthermore, that if no more patients were accrued and the oncologist could wait 5 more years to obtain 5-year follow-up for every patient, he or she would find that half of them were relapse free for more than 5 years. This (unobserved) relapse pattern is shown in Figure 12-3.

Figure 12-3 Date of entry (E) and relapse (R) of 16 patients monitored over a 12-year period; 8 did not experience relapse.

Twelve years after the first patient was entered, the 5-year relapse-free pattern would be as shown in Figure 12-4 (time 0 is the date of entry for each patient). If the investigator had waited 12 years (i.e., until 5-year data for all 16 patients were complete), he or she would have observed that half of the patients were relapse free 5 years after entry (i.e., the 5-year relapse-free rate is .50). Suppose, in fact, that the investigator decided to look at the patients’ experience years after the first patient had entered. Then the investigator would have observed everything to the left of the vertical line in Figure 12-5. Translated into time from entry, this would be represented as shown in Figure 12-6. In that case, the investigator would have complete 5-year data for seven of the eight patients who relapsed and for three patients who were relapse free at 5 years (those who were entered at year 0, 1, or 2). For the other six patients, his or her knowledge would be that they were relapse free for some length of time less than 5 years. The investigator’s first instinct might be to exclude these six patients from analysis because he or she does not know what their relapse status will be after 5 years of follow-up. Of the remaining 10 patients, 7 are known to have relapsed and 3 are known to be disease free for more than 5 years, so that the estimated relapse-free rate at 5 years is .30. This estimate does not seem to reflect the data accurately.

Figure 12-4 Relapse history of 16 patients seen 12 years after the beginning of the study: time from entry into the study until relapse or last follow-up. C, censored; R, relapsed.

Figure 12-5 Status of follow-up 7.5 years after the first patient was entered in the trial. E, entry; R, relapse.

Figure 12-6 Relapse history of 16 patients seen 7.5 years after the beginning of the study: time from entry into the study until relapse or last follow-up. C, censored; R, relapsed.

The estimate is so different from .50 because the method of calculation is biased. The cause of the bias can be seen by studying the seventh and eighth patients in the series of patients (i.e., the patients entered 3 years after the first patient entered). Notice that the seventh patient entered was still relapse free when the radiation oncologist performed his or her analysis but had only been followed for years (i.e., did not have a known status at 5 years) and, hence, was excluded. However, the eighth patient (who was entered on the same day) had relapsed by the time of analysis and, therefore, was included. The bias is that patients who relapse have a better chance of being included in the analysis than those who do not relapse. One way to avoid a biased estimate would be to exclude all patients who did not have the potential to be monitored for 5 years. Therefore the oncologist would only be able to use the information from the first six patients entered. In this artificial example, this would have led to a correct estimate of the 5-year relapse-free rate because three of the first six patients relapsed within 5 years.

This method is unsatisfying, because not all of the available information is used. In statistical terms, the disadvantage of our proposed solution is that there is considerably more variance associated with an estimate that uses the data from only six patients than one that uses all of the available data. For example, if patients 6 and 7 had entered the study in the opposite order, the relapse-free estimate would jump from 50% to 67%. A better method is described in the next section.

Kaplan-Meier Method

A common way to summarize survival data is to estimate the “survival curve,” which shows—for each value of time—the proportion of subjects who survive at least that length of time. Figure 12-7 shows survival curves for patients with operable breast cancer who have been treated with either preoperative or postoperative chemotherapy (doxorubicin and cyclophosphamide) in a large clinical trial. In this section, we discuss the estimation of survival curves using a method proposed by Kaplan and Meier.²⁸

Figure 12-7 Survival of patients with operable breast cancer treated with doxorubicin and cyclophosphamide.

Several excellent references are available that show exactly how a Kaplan-Meier estimate is computed and describe other properties such as the variance of the estimator. A nice summary with formulas for computations can be found in Chapter 2 of Parmar and Machin’s text.²⁹ Here, we offer two examples that provide some insight regarding the Kaplan-Meier method. The first example concerns a data set having no censored observations; the second example extends these ideas to accommodate censored observations.

Example 3. Suppose four patients are entered into a study, possibly at different times, and all four die, with death times of 10 months, 20 months, 25 months, and 40 months, respectively, from time of entry. We want to estimate the 3-year survival rate. Three of the four patients died within 3 years from entry, and one patient remained alive more than 3 years from entry (i.e., one fourth of the patients entered lived at least 3 years). Therefore a reasonable estimate of the 3-year survival rate is S(3) = .25. This estimate (the proportion of patients who remained alive at 3 years) is referred to as the empirical survival estimate, and the graph of estimates obtained this way at all time points is called the empirical survival curve.

Example 4. Suppose that in the previous example three of the patients have died at 10 months, 25 months, and 40 months, respectively. The fourth patient is still alive but has been monitored for only 20 months. We want to estimate the 3-year survival rate. Notice that if we could wait another 16 months to obtain the estimate (so that we would have 3-year data for all patients), we would obtain an estimate of either one fourth (if the fourth patient dies soon) or one half (if the patient remains alive for another 16 months). The rather intuitive approach we used in Example 3 to estimate the 3-year survival rate does not work here, because we do not know whether there will be two or three survivors when all the patients have died or have been monitored for 3 years (i.e., we do not know how to handle the patient whose follow-up was censored at 20 months). Kaplan and Meier proposed an approach that updates the survival estimate at the time of each death using only those patients who are at risk of failing at each update.

Because none of the patients died during the first 10 months, the Kaplan-Meier estimate of the probability of surviving to any time point less than 10 months is equal to 1. Because four patients are alive at 10 months but only three quarters of them survive beyond 10 months, the Kaplan-Meier estimate changes to three quarters for time points beyond 10 months but before the next death. Although one patient is censored at 20 months, this gives no information regarding the likelihood of a death, so the Kaplan-Meier estimate remains at three quarters for all time points between 10 months and 25 months. One of the two patients who are still at risk just before 25 months dies at that time. Therefore the estimate of the probability of a patient surviving beyond 25 months, given that the patient survived at least 25 months, is one half. The Kaplan-Meier estimate of surviving more than 25 months is the product of three quarters (the estimated probability a patient will survive until 25 months) times one half (the estimated probability that a patient will survive more than 25 months given that the patient was alive at 25 months), which is three eighths. There are no other deaths before 3 years, so the Kaplan-Meier estimate of the 3-year survival rate is three eighths. The Kaplan-Meier method uses all the relevant available data from each patient but excludes the patient who was censored at 20 months from all computations beyond that time.

If the Kaplan-Meier approach is applied to the data in Example 3, the estimate S(3) is equal to the (probability of surviving 10 months) × (probability of surviving more than 10 months given survival of 10 months) × (probability of surviving more than 20 months given survival of 20 months) × (probability of surviving more than 25 months given survival of 25 months), or , which matches the empirical survival estimate. In fact, the Kaplan-Meier estimate and the empirical estimate always match when they are applied to uncensored data.

The reader may verify that application of the Kaplan-Meier approach to the data in Figure 12-6 will result in an estimate of .45 for the probability of remaining relapse free through 5 years.

Log-Rank Statistic

Perhaps the most common application of survival analysis techniques is the comparison of survival times (or other times to event, such as time to disease relapse) between two or more groups of patients that differ in some aspect (e.g., male versus female, or treated versus untreated). We discuss the log-rank statistic, the stratified log-rank statistic, and the Cox proportional hazard model in some detail. These methods are commonly used to compare survival times between groups. We avoid excessive mathematical detail; instead, we concentrate on showing how these methods accommodate censored survival times and, in cases of the stratified log-rank statistic and the Cox proportional hazard model, how differences in individual patient characteristics may be controlled to avoid biased group comparisons.

The log-rank statistic deals with the problem of censoring by comparing the groups only when a patient within any of the groups experiences an “event” (if survival times are to be compared across groups, an event would be a death; if times to relapse are to be compared, an event would be a relapse, and so on). This idea is most easily explained in the context of a simple example. Suppose we monitored three patients receiving a standard treatment regimen (this might even be no treatment), which we refer to as treatment A, and three other patients receiving an experimental regimen, which we refer to as treatment B. Suppose, furthermore, that the survival times for the patients receiving treatment A are 10, 40+, and 55 days, respectively, and for the patients receiving treatment B, 15, 50, and 60 days, respectively (a plus sign after a value refers to a censored time (i.e., a patient with a time of 40+ days was last known to be alive at 40 days and no further follow-up is available). These data are represented graphically in Figure 12-8; the three patients receiving treatment A are labeled A1, A2, and A3, and those receiving treatment B are labeled B1, B2, and B3. Deaths are denoted by the letter D, and censored survival times are labeled with the letter C.

Figure 12-8 Survival times of six patients treated with one of two regimens. C, censored; D, death.

When evaluating the log-rank statistic, the first computation occurs at time t = 10, the time at which the first death is observed. Just before this point in time, all six patients are known to be alive (and, hence, are “at risk” to die at time t = 10). Three of these patients received treatment A and three received treatment B. Exactly one of these six patients is known to have died at time t = 10, and he or she received treatment A. Table 12-4 summarizes the status of patients at this time point. Notice that at the time of this computation there are three patients on each arm. Therefore, if treatment B was equivalent to treatment A and only one death occurred on one arm or the other, there would be a one-half chance that the death is on arm A. In fact, the death is on arm A; hence, we observe one death when the probability of observing a death on arm A is one half (i.e., there is one half more death on arm A than is expected).

TABLE 12-4 Observed Deaths (t = 10)

Observations at time t = 15 are listed in Table 12-5. At this point there would be a two-fifths probability that the death would occur on treatment A (if the treatments were equivalent), but the death did not occur on treatment A, so there were two-fifths fewer deaths on arm A than expected (i.e., the observed deaths minus the expected number is ).

TABLE 12-5 Observed Deaths (t = 15)

The next computation occurs at time t = 50; observations are listed in Table 12-6. Notice that patient A2 is not included in this table, even though she is not known to have died at any time before t = 50. Because she was lost to follow-up (censored) at time 40, she is no longer “at risk” at time 50. At this time, there would be a one-third probability that the death would occur on treatment A (if the treatments were equivalent), but the death does not occur on treatment A, so there are one-third fewer deaths on arm A than expected (i.e., the observed deaths minus the expected number is ).

TABLE 12-6 Observed Deaths (t = 50)

One may go through the same computations at time t = 55 (Table 12-7) and will determine that the number of deaths minus expected deaths on treatment A is . At time t = 60, all remaining patients are on the same arm, so the observed minus the expected number of deaths must be 0. Adding up the observed minus the expected number of deaths at times 10, 15, 50, and 55, one obtains 2 observed deaths minus 1.733 expected deaths, so that there were 0.27 deaths more than expected on arm A, indicating that this treatment might be harmful. However, one’s intuition is that this is not a significant difference (i.e., such a small difference could easily be attributed to the play of chance), and in fact this is true. Therefore there is no strong evidence in this example that treatments A and B differ.

TABLE 12-7 Observed Deaths (t = 55)

More formally, the log-rank statistic is defined to be the difference in observed and expected numbers of deaths on one of the two treatment arms. The statistic may be standardized by dividing by the square root of its variance, yielding a score that, under the hypothesis of equivalent treatments, is approximately standard normal. In the present example, the variance of the log-rank statistic can be shown to equal 0.9622, so the standardized test statistic is (2 − 1.733)/0.9622 = 0.27, corresponding to a two-sided p value of about .79.

Stratified Log-Rank Statistic

The log-rank test as defined previously uses only survival information and group membership. Other patient characteristics are not taken into account. We now introduce a method to adjust for the possible effects of other covariates. Suppose that we had the following survival times (in days) for a group of breast cancer patients who had differing numbers of positive lymph nodes and who received either treatment A or treatment B:

1 to 3 nodes
Treatment A	91, 160+, 230+
Treatment B	32, 101, 131+, 155+, 190+, 210+

4 or more nodes
Treatment A	11, 22, 42, 63, 72, 110, 120, 141, 155, 180+, 200+, 220+
Treatment B	51, 83, 110, 170+

If we were to ignore our knowledge of the patients’ lymph node status and compute a log-rank statistic as we did before, our computations at times 11 and 91 would be as shown in Tables 12-8 and 12-9. If one completes these computations at each time of death, the total for arm A is 10 observed deaths when 8.85 were expected. This translates to 1.15 excess events on arm A.

TABLE 12-8 Observed Deaths (t = 11)*

TABLE 12-9 Observed Deaths (t = 91)*

However, when one examines the distribution of patients, it is not surprising that there is an excess number of events on arm A. After all, 12 of 15 patients on arm A had four or more positive nodes, whereas only 4 of 10 patients on arm B had four or more positive nodes. Given that one would expect patients with four or more positive nodes to be at greater risk than those with one to three positive nodes, it is not clear whether the observation of 1.15 excess deaths on arm A represents a possible treatment effect or reflects the nodal imbalance. That is, the comparison of treatments A and B using the ordinary log-rank statistic is biased by an imbalance of disease stage between the two study arms.

The stratified log-rank test provides a method to control for the association of nodal status with patient survival. To illustrate how this occurs, let us recompute the observed and expected deaths at times of 11 and 91 days, but this time we adjust our definition of risk sets to include nodal status. Notice that when we performed these computations for the log-rank test at t = 11 (as shown previously), we estimated the likelihood that the death that occurred was on arm A to be 15/25, based on the fact that 15 of the 25 patients at risk at that time were on arm A. This implicitly assumed that in the absence of a treatment effect, all patients are at the same risk of dying regardless of nodal status. However, the first patient who died was one who had four or more positive nodes. The patients who would be at the same risk of dying as that patient are the other patients with four or more nodes. Because we knew there were 16 patients with four or more nodes who were at risk of dying at t = 11 and that 12 of these patients were on arm A, perhaps we should have estimated the probability to be three quarters that if a patient with four or more nodes died at time t = 11, the patient would be on arm A (given no treatment effect). This is the method used when computing the stratified log-rank test. We now examine the computations at times of 11 and 91 days using this stratified approach (Tables 12-10 and 12-11).

TABLE 12-10 Observed Deaths (t = 11, 4 Positive Nodes)*

TABLE 12-11 Observed Deaths (t = 91, 1 to 3 Positive Nodes)*

If one restricts these computations only to the patients with four or more positive nodes, the sum of the observed deaths on treatment A minus the expected is 9 − 9.11 = −.11. In the patients with one to three positive nodes, similar computations yield 1 observed death and 0.99 expected death. Summing these values across both strata gives 10 observed deaths on treatment A compared with 10.10 expected deaths for a difference of −.10. Thus the stratified approach leads to almost no difference between the number of observed and expected deaths, indicating that there is essentially no estimated treatment difference after adjusting for nodal status.

Notice that the difference between the two approaches is seen in the computation of the probability of a death occurring on treatment A at each time point. If one understands the method of computing these probabilities (expected values), one can begin to understand the possible biases of different approaches.

In the prior example, the (unstratified) log-rank test had a potential bias because no adjustment was made for the probable increased risk associated with having four or more positive nodes. This was important because there was a large imbalance in the proportion of such patients by treatment. If this characteristic had been in balance across treatments, the stratified and unstratified approaches would have given much closer results.

Cox Proportional Hazard Model

The stratified log-rank test as defined previously provides a method to control for the association of nodal status with patient survival. However, in terms of adjusting for additional explanatory variables, the most popular method is the Cox proportional hazard model.³⁰ The Cox model explores the relationship between survival experience and prognostic variables or explanatory variables. It estimates and tests the hazard ratio between different groups. In the Cox regression model, explanatory variables could be continuous (for instance, age or weight or systolic blood pressure), categorical (for instance, gender or race or Eastern Cooperative Oncology Group [ECOG] performance score), or an interaction between different categorical variables. The Cox regression model does not make any parametric assumption about the survival probability distribution of each group; however, it does assume a proportional hazard between two groups. If we use h_i(t) and h₀(t) to denote the hazards of death at time t for the ith patient and the baseline patient, that is, the patient with all explanatory variables taking values 0, respectively, the proportional hazard model can be expressed as

where x_1, x_{2, …} x_p are p different explanatory variables, h₀(t) is called the baseline hazard function, and h_i(t)/ h₀(t) is a constant, called the hazard ratio. The βs are the regression coefficients. For a continuous explanatory variable x_p, β_p stands for the log of the hazard ratio between the ith patient and the baseline patient. For a categorical explanatory variable x_l, for instance, assume that x_l = 0 if the patient is randomized to the control group, or x_l = 1 if the patient is randomized to the experimental group. Then β_l stands for the log of the hazard ratio between patients in the experimental group versus the control group, whereas all other explanatory variables are the same.

Because the Cox regression model assumes the proportional hazard between different groups, it is necessary to check this assumption before concluding the estimations. Most of the model check procedures are based on graphs or plots of model-fitting residuals. If the proportional hazard assumption does not fit the data even after the data transformation, other models should be considered.

Current Topics in Phase III Clinical Trials

Surrogate Endpoint

A primary end point is the primary measure of a treatment’s efficacy; therefore it needs to be clinically relevant and sensitive to the effect of the treatment. The overall survival (OS) rate has historically been the primary outcome for most phase III trials because of its clear virtues: it is simple to measure, unambiguous, the least susceptible to investigator bias, and of unquestionable clinical relevance. Despite these many advantages, the role of OS as a primary endpoint is challenged in modern phase III clinical trials. Two primary challenges to the OS endpoint are that in many cases an extensive follow-up period is needed to obtain survival status and that in many diseases, multiple effective therapies are now available. In this case, OS is affected by all therapies given to a patient and, as such, this endpoint is insensitive to the impact of changing a single line of therapy.

One of the alternatives is to use a surrogate endpoint instead of a clinical endpoint in a phase III trial. A surrogate endpoint is an endpoint obtained sooner, at lesser cost, or less invasively than the long-term clinical efficacy endpoint. When using a surrogate endpoint, one would like to make the same inference as if one had observed a true endpoint. In circumstances where the speed of the study is a major concern, where second-line therapies that affect the OS potentially obscure the assessment of first-line treatment benefit, or where cross-over becomes unavoidable, a surrogate endpoint might be considered in place of the long-term clinical endpoint.

In validating an endpoint as a legitimate surrogate endpoint, a meta-analysis is usually required because relationships presented in one trial may not be generalizable to another. In addition, heterogeneity between the trials included in the meta-analysis strengthens the robustness of results from individual trials. In conducting a meta-analysis, using individual patient data instead of summary data is highly recommended, and, ideally, both positive and negative trials are included. Examples of meta-analyses conducted to examine potential surrogate endpoints can be found in articles by Sargent and colleagues ³¹ and Burzykowski and associates.³²

Biomarkers

A marker is a single trait, or a group of traits, that differentiates patients with respect to an outcome of interest. If a marker can be used to identify patients with differing risks of a clinical outcome (such as progression or death) in the absence of therapy or when receiving nontargeted standard treatment, the marker is usually called a prognostic marker. If a marker predicts differential efficacy of a specific therapy, the marker is called a predictive marker.

Validation of a prognostic marker is usually conducted retrospectively in patients treated with placebo or a standard treatment. Validation of a predictive marker could also be conducted retrospectively based on data from a randomized controlled trial.³³ However, a prospective randomized controlled trial would be ideal. Two types of clinical designs can be used to validate a predictive marker: a targeted/selection design or an unselected design. A targeted trial enrolls only patients who are most likely to respond to the experimental therapy based on their molecular expression levels. On the one hand, a targeted trial could result in a large savings of patients for other trials. On the other hand, it could miss efficacy in other patients, and miss the opportunity to test the association of the biologic endpoints with clinical outcomes.

Different types of unselected designs have been proposed and discussed. For instance, the marker-by-treatment interaction design (Fig. 12-9) stratifies patients according to their marker status and randomly assigns patients in each marker group to two different treatments. Hypothesis setting and sample size estimation could be different in each marker-defined subgroup. Either a formal test for marker-by-treatment interaction or a separate superiority test within each marker group could be conducted. Another popular unselected design is the marker-based strategy design (Fig. 12-10), in which the random treatment assignment could be based on the patient’s marker status or could be independent of it. Examples for each type of design can be found in the article by Sargent and colleagues.³⁴

Figure 12-9 Marker by treatment interaction to test a predictive factor question; same treatments in both prognostic groups.

Figure 12-10 A, Marker-based strategy design to test predictive factor question; no randomization in non–marker-based arm. B, Marker-based strategy design to test predictive factor question; randomization in both arms.

Adaptive Design

An adaptive design could be defined as a design “that allows adaptations to trial procedures (for instance, eligibility criteria, study dose, or treatment procedure, etc.) and/or statistical procedures (for instance, randomization, study design, study hypothesis, sample size, or analysis methods, etc.) of the trial after its initiation without undermining the validity and integrity of the trial.”³⁵ Many types of adaptation could be applied to an ongoing trial. For instance, a study could be resized based on an interim review of the outcomes in the control group. Assume, for example, that in the design of a study, the PFS under the standard of care was assumed to be 12 months. However, on an interim review of accrued data from the control arm, the estimated PFS was 15 months. In this case, more patients need to be accrued for the study or patients need to be followed longer for the same hazard ratio to be detected. No statistical penalty is required for this review, because only data from the control arm have been analyzed.

For another example of an adaptive design, assume that a study started with n experimental arms and one control arm. Within this study, based on a prespecified plan, all experimental arms are to be compared with the control arm at the interim analysis, and only the most promising experimental arm will be used with the control arm when patient accrual is resumed. In other words, the study will have started with n + 1 arms and ended with 2 arms, and n − 1 arms will have been dropped after the interim analysis. Because no conclusions regarding superiority were allowed to be made at the interim analysis, no statistical penalty for the interim review is needed. Under other circumstances, if an adaptation relates to (1) altering the randomization ratio between the arms, (2) resizing the trial based on interim comparison between the experimental and control arms, or (3) changing the primary endpoint, then a penalty for looking at the data before the end of the study is definitely needed. In planning an adaptive design, possible adaptations must be specified in the protocol before accrual takes place. In applying an adaptive design, an efficient mechanism for processing and analyzing data is critical.