Severity of Illness Scoring Systems

Published on 07/03/2015 by admin

Filed under Critical Care Medicine

Last modified 22/04/2025

Print this page

This article have been viewed 3372 times

Severity of Illness Scoring Systems

Rui Moreno

Chapter Outline

When you can measure a phenomenon about which you are talking and express it in numbers, you know something about it. But, when you can not express it in numbers, your knowledge is vague and unsatisfactory. It may be the beginning of knowledge, but you progressed very little toward the state of science.

LORD KELVIN (1824-1907)

The goal of intensive care is to provide the highest quality of treatment in order to achieve the best outcomes for critically ill patients. Although intensive care medicine has developed rapidly over the years, there exists, still, little scientific evidence as to what treatments and practices are really effective in the real world. Moreover, intensive care now faces major economic challenges, which increase the need to provide evidence not only on the effectiveness but also on the efficiency of practices. Intensive care is, however, a complex process, which is carried out on very heterogeneous populations and is influenced by several variables, including cultural background and different structure and organization of the health care systems. It is, therefore, extremely difficult to reduce the quality of intensive care to something measurable, to quantify it and then to compare it among different institutions. Also, in recent years, patient safety as a necessary dimension in the evaluation of quality of the care provided has become mandatory, leading to major changes in the way benchmarking should be evaluated and reported.¹

Although quality encompasses a variety of dimensions, the main interest to date is focused on effectiveness and efficiency: It is clear that other issues are less relevant if the care being provided is either ineffective or harmful. Therefore, the priority must be to evaluate effectiveness. The instruments available to measure effectiveness in intensive care derive from the science of outcome research. The starting point for this science was the high degree of variability in medical processes, which was found during the first part of the twentieth century, when epidemiologic research was developing. The variation in medical practices led to the search for the “optimal” therapy for each syndrome or disease through the repeated performance of randomized controlled trials (RCTs). However, the undertaking of RCTs in intensive care is fraught with ethical and other difficulties. For this reason, observational studies to evaluate the effects of intensive care treatment are still frequently employed and sometimes more informative than prior RCTs.² Outcome research provides the methods necessary to compare different patients or groups of patients, especially different institutions. Risk adjustment (also called case-mix adjustment) is the method of choice to standardize the severity of illness of the individual or groups of patients. The purpose of risk adjustment is to take into account all of the characteristics of patients known to affect their outcome, in order to understand the differences due to the treatment received and the conditions (timing, setting, standardization) in which that treatment has been delivered. Conceptually, the quantification of individual patients should be made by the use of severity scores, and the evaluation of groups of patients is done by summing up the probabilities of death given by the model for each individual patient and its comparison with actual fatality.

This chapter intends to describe the different methods and systems that are available for the purpose of accessing and comparing severity of illness and outcome in critically ill patients. Starting with a brief historical outline of the development of scoring systems over time, the reader should then become familiar with the way such systems have been designed and constructed. In succession, the chapter will describe available systems with their applications and limitations. Finally, the text will focus on the potential applications of these systems at both the patient level and the intensive care unit (ICU) level.

Historical Perspective

Scoring systems have been broadly used in medicine for several decades. In 1953 Virginia Apgar ³ published a very simple scoring tool, the first general severity score designed to be applicable to a general population of newborn children. It was composed of five variables, easily evaluated at the patient’s bedside, that reflect cardiopulmonary and central nervous system function. Its simplicity and accuracy have never been improved on, and any child born in a hospital today receives an Apgar score at 1 and 5 minutes after birth. Nearly 50 years ago Dr. Apgar commented on the state of research in neonatal resuscitation: “Seldom have there been such imaginative ideas, such enthusiasms and dislikes, and such unscientific observations and study about one clinical picture.” She suggested that part of the solution to this problem would be a “simple, clear classification or grading of newborn infants which can be used as the basis for discussion and comparison of the results of obstetric practices, types of maternal pain relief and the effects of resuscitation.” Thirty years later, physicians working in ICUs found themselves using the same tools and applying them in the same way.

Efforts to improve risk assessment during the 1960s and 1970s were directed at improving our ability to quickly select those patients most likely to benefit from promising new treatments. For example, Child and Turcotte ⁴ created a score to measure the severity of liver disease and estimate mortality risk for patients undergoing shunting. In 1967, Kilipp and Kimball classified the severity of acute myocardial infarction by the presence and severity of signs of congestive heart failure.⁵ In 1974 Teasdale and Jennett introduced the Glasgow Coma Scale (GCS) for reproducibly evaluating the severity of coma.⁶ The usefulness of the GCS score has been confirmed by the consistent relationship between poor outcome and a reduced score among patients with a variety of diseases. The GCS score is reliable and easy to perform, but problems with the timing of evaluation, the use of sedation, inter- and intraobserver variability, and its use in prognostication have caused strong controversies.⁷ Nevertheless, the GCS remains the most widely used neurologic measure for risk assessment.

The 1980s brought an explosive increase in the use of new technology and therapies in critical care. The rapidity of change and the large and growing investment in these high-cost services prompted demands for better evidence for the indications and benefit of critical care. For this reason several researchers developed systems to evaluate and compare the severity of illness and outcome of critically ill patients. The first of these systems was the Acute Physiology And Chronic Health Evaluation (APACHE) system, published by Knaus and associates in 1981,⁸ followed soon after by Le Gall and colleagues with the Simplified Acute Physiology Score (SAPS).⁹ The APACHE system was latter updated to APACHE II,¹⁰ and a new system, the Mortality Probability Model (MPM), joined the group.¹¹ By the beginning of the 1990s different systems were available to describe and classify ICU populations, to compare severity of illness, and to predict mortality risk in their patients. These systems performed well, but there were concerns about errors in prediction caused by differences in patient selection and lead-time bias. There were also concerns about the size and representativeness of the databases used to develop the three systems and about poor calibration within patient subgroups and across geographic locations. These concerns, in part, led to the development of their subsequent versions such as APACHE III,¹² the SAPS II,¹³ and the MPM II,¹⁴ all published between 1991 and 1993.

During the mid-1990s, the need to quantify not only mortality but also morbidity risks in specific groups of patients became evident and led to the development of the so-called organ dysfunction scores, such as the Multiple Organ Dysfunction Score (MODS),¹⁵ the Logistic Organ Dysfunction System (LODS) score,¹⁶ and the Sequential Organ Failure Assessment (SOFA) score.¹⁷

Severity of Illness Assessment and Outcome Prediction

The evaluation of severity of illness in the critically ill patient is made through the use of severity scores and outcome prediction models. This distinction is crucial to understanding the differences, limitations of use, and aims of each one.

• Severity scores are instruments that aim at stratifying patients based on their severity, assigning to each patient an increasing score as the severity of the illness increases.

• Outcome prediction models, apart from their ability to stratify patients according to their severity, aim at predicting a certain outcome (usually the vital status at hospital discharge) based on a given set of prognostic variables and a certain modeling equation.

The development of this kind of system, applicable to heterogeneous groups of critically ill patients, started in the 1980s (Table 73.1). The first general severity of illness score applicable to most critically ill patients was the APACHE score.⁸ Developed in the George Washington University Medical Center in 1981 by William Knaus and coworkers, the APACHE system was created to evaluate, in an accurate and reproducible form, the severity of disease in this population.^18–20 Two years later, Jean-Roger Le Gall and coworkers published a simplified version of this model, the SAPS.²¹ This model soon became very popular in Europe, especially in France. Another simplification of the original APACHE system, the APACHE II, was published in 1985 by the same authors of the original model.¹⁰ This system introduced the prediction of mortality risk, providing a major reason for ICU admission from a list comprising 50 operative and nonoperative diagnoses. The MPM,²² developed by Stanley Lemeshow, provided additional contributions for the prediction of prognosis, using logistic regression techniques. Further developments in this field include the third version of the APACHE system (APACHE III)¹² and the second versions of the SAPS (SAPS II)¹³ and MPM (MPM II).¹⁴ All of them use multiple logistic regression to select and weight the variables and are able to compute the probability of hospital mortality risk for groups of critically ill patients. It has been demonstrated that they perform better than their old counterparts,^23,24 and they represented the state of the art in this field by the end of the last century.

Table 73.1

General Severity Scores and Outcome Prediction Models

APACHE, Acute Physiology and Chronic Health Evaluation; ICUs, intensive care units; MPM, Mortality Probability Model; SAPS, Simplified Acute Physiology Score.

^*These models are based on previous versions, developed by the same investigators (Lemeshow et al 11,181; see online list of references for this chapter).

^†The numbers presented are those for the admission component of the model (MPM II0). MPM II24 was developed using data for 15,925 patients from the same ICUs.

^‡MPM II24 uses only 13 variables.

Since the early 1990s, owing to the progressive lack of calibration of these models, the performance of these instruments began to slowly deteriorate with the passage of time. Differences in the baseline characteristics of the admitted patients, in the circumstances of the ICU admission, and in the availability of general and specific therapeutic measures introduced an increasing gap between actual mortality rate and predicted mortality risk.²⁵ Overall, in the last years of the century, there was an increase in the mean age of the admitted patients, a larger number of chronically sick patients and immunosuppressed patients, and an increase in the number of ICU admissions due to sepsis.^26,27 Although most of the models kept an acceptable discrimination, their calibration (or prognostic accuracy) deteriorated to such a point that major changes were needed.

An inappropriate use of these instruments outside their sampling space was responsible also for some misapplication of the instruments, especially for risk adjustment in clinical trials.28–30

In the early 2000s, several attempts were made to improve the old models. However, a new generation of general outcome prediction models was built that included models such as the MPM III developed in the IMPACT database in the United States,³¹ new models based on computerized analysis by hierarchical regression developed by some of the authors of the APACHE systems,³² the APACHE IV,³³ and the SAPS 3 admission model, developed by hierarchical regression in a worldwide database.^34,35 Models based on other statistical techniques such as artificial neural networks and genetic algorithms have been proposed but, besides academic use, they never became used widely.^36,37 These approaches have been revised more than once,³⁸ and will be summarized later.

Recalibrating and Expanding Existing Models

All the existing general outcome prediction models used logistic regression equations to estimate the probabilities of a given outcome in a patient with a certain set of predictive variables. Consequently, the first approach to improve the calibration of a model when the original model is not able to adequately describe the population is to customize the model.³⁹ Several methods and suggestions have been proposed for this exercise,⁴⁰ based usually on one of two strategies:

• First-level customization, or the customization of the logit, developing a new equation relaying the score to the probability, such as one proposed by Le Gall or Apolone.41,42

• Second-level customization, or the customization of the coefficients of the variables in the model as described for the MPM II₀ model,³⁹ which can be made either by keeping unchanged the relative weight of the variables in the model or eventually by changing also these weights (this latter technique involves the limit of second-level customization: from this point forward, so the researcher is developing a new model and not customizing an existing one). Usually the researcher customizing an existing model assumes that the relative weight of the variables in the model is constant.

Both of these methods have been used in the past with a partial success in increasing the prognostic ability of the models.^39,43 However, both fail when the problem of the score is on discrimination or in its poor performance in subgroups of patients (poor uniformity of fit).⁴⁴ This fact can be justified by the lack of additional variables, more predictive in this specific context. The addition of new variables to an existing model has been done before^45,46 and can be an appropriate approach in some cases. It can lead to very complex models, needs the collection of special data, and is also more expensive and time-consuming. The best tradeoff between the burden of data collection and accuracy should be tailored case by case. It should be noted that the aim of first-level customization, which is nothing more than a mathematical translation of the original logit in order to get a different probability of mortality risk, is to improve the calibration of a model and not to improve discrimination. It should therefore not be considered when the improvement of this parameter is considered important.

A third level of customization can be imagined, through the introduction in the model of new prognostic variables and the recomputation of the weights and coefficients for all variables, but—as mentioned before—this technique crosses the borders of customizing a model versus building a new predictive model. In past years, all these approaches have been tried.47–49

Building New Models

Three general outcome prediction models have been developed and finally published: the SAPS 3 admission model in 2005, the APACHE IV in 2006, and the MPM III in 2007.⁵⁰

The SAPS 3 Admission Model

Developed by Rui Moreno, Philipp Metnitz, Eduardo Almeida, and Jean-Roger Le Gall on behalf of the SAPS 3 Outcomes Research Group, the SAPS 3 model was published in 2005.34,35 The study used a total of 19,577 patients consecutively admitted to 307 ICUs all over the world from 14 October to 15 December 2002. This high-quality multinational database was built to reflect the heterogeneity of current ICU case mix and typology all over the world, trying not to focus only on Western Europe and the United States. Consequently, the SAPS 3 database better reflects important differences in patients’ and health care systems’ baseline characteristics that are known to affect outcome. These include, for example, different genetic makeups, different styles of living, and a heterogeneous distribution of major diseases within different regions, as well as issues such as access to the health care system in general and to intensive care in particular, or differences in availability and use of major diagnostic and therapeutic measures within the ICUs. Although the integration of ICUs outside Europe and the United States surely increased its representativeness, it must be acknowledged that the extent to which the SAPS 3 database reflects case mix on ICUs worldwide cannot be determined yet.

Based on data collected at ICU admission (±1 hour), the authors developed regression coefficients by using multilevel logistic regression to estimate the probability of hospital death. The final model, which comprises 20 variables, exhibited good discrimination without major differences across patient typologies; calibration was also satisfactory. Customized equations for major areas of the world were computed and demonstrate a good overall goodness of fit. It is interesting that the determinant of hospital mortality probability changed remarkably from the early 1990s,¹² with chronic health status and circumstances of ICU admission now being responsible for almost three fourths of the prognostic power of the model.

To allow all interested parties to see the calculation of SAPS 3, completely free of charge, extensive electronic supplementary material (http://dx.doi.org/10.1007/s00134-005-2762-6 and http://dx.doi.org/10.1007/s00134-005-2763-5, both accessed 30/06/2013) was published together with the study reports, including the complete and detailed description of all variables as well as additional information about SAPS 3 performance. Moreover, the SORG provides at the project website (www.saps3.org) several additional resources: First, a Microsoft Excel sheet is available and can be used to calculate a SAPS 3 “on the fly.” Second, a small Microsoft Access database allows for the calculation, storage, and export of SAPS 3 data elements.

However, as all outcome prediction models, SAPS 3 is slowly losing calibration, as recently demonstrated by several groups.51–55 It seems to keep a good level of reliability and discrimination, but a recalibration process, which can be relatively easy to perform, must be done in the following years.

The APACHE IV Model

In early 2006, Jack E. Zimmerman, one of the original authors of the original APACHE models, published in collaboration with colleagues from Cerner Corporation (Vienna, VA) the APACHE IV model.³³ The study was based on a database of 110,558 consecutive admissions during 2002 and 2003 to 104 ICUs in 45 U.S. hospitals participating in the APACHE III database. The APACHE IV model uses the worst values during the first 24 hours in the ICU and a multivariate logistic regression procedure to estimate the probability of hospital death.

Predictor variables were similar to those in APACHE III, but new variables were added and different statistical modeling has been used. The accuracy of APACHE IV predictions was analyzed in the overall database and in major patient subgroups. APACHE IV had good discrimination and calibration. For 90% of 116 ICU admission diagnoses, the ratio of observed to predicted mortality was not significantly different from 1.0. Predictions were compared with the APACHE III versions developed 7 and 14 years previously: there was little change in discrimination, but aggregate mortality risk was systematically overestimated as model age increased. When examined across disease, predictive accuracy was maintained for some diagnoses but for others seemed to reflect changes in practice or therapy. A predictive model for risk-adjusted ICU length of stay was also published by the same group.⁵⁶ More information about the model and the possibility to compute the probability of death for individual patients is available at the website of Cerner Corporation (www.criticaloutcomes.cerner.com).

The MPM₀ III Model

The MPM₀ III was published by Tom Higgins and associates in 2007.⁵⁰ It was developed using data from ICUs in the United States participating in the project IMPACT but there is almost no published data to evaluate its behavior outside the development cohort. As for the previous MPM models, the MPM₀ III does not allow the computation of a score but estimates directly the probability of death in the hospital.

Developing Predictive Models

Selecting the Target Population

Although named “general,” most of the existing models are not applicable to all ICU patients. Patients with burns, admitted with coronary ischemia (or to rule out myocardial infarction), who were young (less than 16 or 18 years of age), in the postoperative phase of cardiac surgery, or with a very short length of ICU stay were explicitly excluded from the development of the majority of systems. This limitation is especially important when we evaluate specialized ICUs, with a particular case mix, but can also be important in general ICUs. In many cases, the application of exclusion criteria can imply the analysis of just a small proportion of the admitted patients, resulting in significant errors.

Outcome Selection

Outcome selection identifies the end point of interest, and at a minimum, the selected outcome should be as follows:

• A relatively common event

• Easily defined, recognized, and measured

• Clinically relevant

• Independent of therapeutic decisions

Fatality meets all the preceding criteria; however, there are confounding factors to be considered when using death as an outcome. The location of the patient at the time of death can considerably reduce hospital mortality rates. For example, in a study of 116,340 ICU patients, a significant decline in the ratio of observed and predicted death was attributed to a decrease in hospital mortality rate as a result of earlier discharge of patients with a high severity of illness to skilled nursing facilities.⁵⁷ In the APACHE III study, a significant regional difference in mortality rate was entirely secondary to variations in hospital length of stay.⁵⁸ Improvements in therapy, such as the use of thrombolysis in myocardial infarction or steroids in Pneumocystis pneumonia and the acquired immunodeficiency syndrome⁵⁹ can dramatically reduce hospital mortality rate. Increases in the use of advance directives, do-not-resuscitate orders, and limitation or withdrawal of therapy all increase hospital mortality rates.

Variations in any of the previous factors will lead to differences between observed and predicted deaths that have little to do with case mix or effectiveness of therapy. Predictive instruments directed at long-term mortality predictions provide accurate prognostic estimates within the first month of hospital discharge, but their accuracy falls off considerably thereafter, because other factors, such as HIV infection or malignancy, dominate the long-term survival pattern.⁴⁵ Owing to these caveats, fatality is the most useful outcome for designing general severity of illness scores and predictive instruments.

Other outcome measures represent important issues in improving ICU care. These issues include the following:

• Morbidity and complication rates

• Organ dysfunction

• Resource use

• Duration of mechanical ventilation, use of pulmonary artery catheters

• Quality of life after ICU/hospital discharge

• Length of stay in the ICU

Case-mix adjustment is indispensable for studying morbidity, resource utilization, and length of stay. Although these outcomes are difficult to define and are sensitive to local conditions, they are related to the cost of care and have therefore been useful in measuring and comparing ICU efficiency.

All current outcome prediction models aim at predicting vital status at hospital discharge. It is thus incorrect to use them to predict other outcomes, such as the vital status at ICU discharge. This approach will result in a gross underestimation of mortality rates.⁶⁰

Data Collection

The next step in the development of a general outcome prediction model is the evaluation, selection, and registration of the predictive variables. At this stage major attention should be given to the variable definitions as well as to the time frames for data collection.61–63 Very frequently models have been applied incorrectly, the most common errors being related to the following:

• The definitions of the variables

• The time frames for the evaluation and registration of the data

• The frequency of measurement and registration of the variables

• The applied exclusion criteria

• Data handling before analysis

It should be noted that all existing models have been calibrated for nonautomated (i.e., manual) data collection. The use of electronic patient data management systems (with high sampling rates) has been demonstrated to have a significant impact on the results:64,65 the higher the sampling rate, the more outliers will be found and thus scores will be higher. The evaluation of intra- and interobserver reliability should always be described and reported, together with the frequency of missing values.

Selection of Variables

The number of variables used in severity and prognostic systems is influenced by the data collection burden, statistical considerations, measurement reliability, and frequency. Variable selection reflects a balance between adding variables with a diminishing impact on outcome and limiting variables to the strongest predictors to ease data collection and minimize processing errors. Variables should have these characteristics:

• Readily available and clinically relevant

• Plausible relationship to outcome and easily defined and measured

• Independent of treatment processes

• Verifiable by checks of data accuracy

Initial selection of variables can be either deductive (subjective), using terms that are known or suspected to influence outcome, or inductive (objective) using any deviation from homeostasis or normal health status. The deductive approach employs a group of experts, who supply a consensus regarding the measurements and events most strongly associated with the outcome. This approach is faster and requires less computational work; APACHE I and SAPS I both started this way. A purely inductive strategy, used by MPM, begins with the database, and tests candidate variables with a plausible relationship to outcome. In the SAPS 3 model several complementary methods have been used, such as logistic regression on mutually exclusive categories built using smoothed curves based on LOWESS (locally weighted scatterplot smoothing),⁶⁶ and multiple additive regression trees (MART).⁶⁷

As a practical matter, neither technique is used exclusively; all systems now use a combination of these techniques. Variables that have been used in severity and prognostic systems include the following:

• Age

• Chronic disease status or comorbid conditions

• Circumstances of ICU admission

• Physiologic measures

• Reasons for ICU admission and admitting diagnoses

• Cardiopulmonary resuscitation (CPR), mechanical ventilation prior to ICU admission

• Location and length of stay before admission

• Emergency surgery and operative status

Predictor variables should be easily defined and reliably measured to ensure uniform data collection and minimize scoring variations. For statistical purposes, variables are considered dichotomous (e.g., surgery or not), categoric (e.g., disease classification or patient location before admission), or continuous (blood pressure or heart rate). With very large sample sizes, some continuous variables may be rendered dichotomous or categorical if it is discovered that there are strong and biologically sound threshold values beyond which their numerical value has no additional significance.

Weights for ICU admission diagnosis or reason for ICU admission (e.g., asthma vs. acute respiratory distress syndrome) significantly augment prognostic accuracy because a similar extent of physiologic derangement reflects substantial variations in mortality risk for different diseases. Interestingly, the circumstance of ICU admission such as the planned or unplanned character of the admission has been demonstrated to be very important. Systems that include weights for admitting diagnosis must have sufficient numbers of patients in each disease category to perform statistical analyses. Predictive instruments that ignore admitting diagnosis reduce the data collection burden, but perform poorly in ICUs with a case mix that differs significantly from the development database.

Location and length of stay before ICU admission account, at least partly, for lead-time bias, which has an important impact on outcome. For example, a patient treated for 2 days and then admitted to the ICU is at greater risk of death than a patient with the same diagnosis and severity of illness admitted from the emergency department.

The accuracy of any scoring system depends on the quality of the database from which it was developed. Even with well-defined variables, significant interobserver variability is reported.68,69

In calculating the scores, several practical issues should be discussed.70,71 First, exactly which value for any parameter should be considered? It is true that for many of the more simple variables, several measurements will be taken during any 24-hour period. Should the lowest, highest, or an average be taken as the representative value of that day? There is a general consensus that, for the purposes of the score, the worst value in any 24-hour period should be considered. Second, what about missing values? Should the last known value repeatedly be considered as representative until a new value is obtained, or should the mean value between two successive values be taken? Both options make assumptions that may influence the reliability of the score. The first option assumes that we have no knowledge of the evolution of values with time and the second assumes that changes are usually fairly predictable and regular. However, we prefer this second option because values may be missing for several days and repeating the last known value may involve considerable errors in calculation. In addition, changes in most of the variables measured (platelet count, bilirubin, urea) are, in fact, usually fairly regular, moving up or down in a systematic manner.

Validation of the Model

All predictive models developed for outcome prediction need, of course, to be validated, that is, to demonstrate their ability to predict the outcome under evaluation. Three aspects should be evaluated in this context: the first aspect is the calibration, or the degree of correspondence between the predictions of the model and observed results. The second is discrimination, or the capability of the model to distinguish observations with a positive outcome from those with a negative outcome. The third is the uniformity of fit of the model, which is related to the performance over various subgroups of patients.

The evaluation of the calibration and discrimination has been named goodness of fit. The evaluation of the performance of the model in major subgroups has been named uniformity of fit.

Goodness of Fit

The evaluation of the goodness of fit comprises the evaluation of calibration and discrimination in the analyzed population. Calibration evaluates the degree of correspondence between the estimated probabilities of fatality and the actual fatality in the analyzed sample. Several methods are usually proposed: observed/estimated (O/E) mortality ratios, Flora’s Z score,⁷² Hosmer-Lemeshow goodness of fit tests,^73–75 and Cox calibration regression and calibration curves.

O/E mortality ratios are computed by dividing the observed fatality (in other words, the number of deaths) by the predicted fatality (in other words, the sum of the probabilities of death of all patients in the sample). In a perfectly calibrated model this value should be 1.

Hosmer-Lemeshow goodness of fit tests are two chi-square statistics proposed for the formal evaluation of the calibration of predictive models.73–75 In the Ĥ test, patients are classified into 10 groups according to their probabilities of death. Then, a chi-square statistic is used to compare the observed number of deaths and the predicted number of survivors with the observed number of deaths and the observed number of survivors in each of the groups. The formula is (Equation 1):

with g being the number of groups (usually 10), o the number of events observed in group l, e the number of events expected in the same group, and the mean estimated probability, always in group l. The resulting statistic is then compared with a chi-square table with 8 degrees of freedom (model development) or 10 degrees of freedom (model validation), in order to know if the observed differences can be explained exclusively by random fluctuation. The Hosmer-Lemeshow Ĉ test is similar, with the 10 groups containing an equal number of patients. Hosmer and Lemeshow demonstrated that the grouping method used on the Ĉ statistics behaves better when most of the probabilities are low.⁷³

These tests are nowadays considered to be mandatory for the evaluation of calibration,⁷⁶ although they are subject to criticism.^77,78 It should be stressed that the analyzed sample must be large enough to have the power to detect the lack of agreement between predicted and observed mortality rates.⁴⁰

The Hosmer-Lemeshow tests are very sensitive to the size of the sample. When the sample is small, the test is usually underpowered to detect poor fit; when the sample size is very large, even minor, insignificant differences between the predicted and the observed fatality will result in a significant chi-square; for this reason, more and more investigators, especially those dealing with large databases, prefer to use the Cox calibration regression, in which the relation between the expected and the observed probabilities is assessed by logistic regression with hospital death being the dependent variable and the natural logarithm of the odds of the probabilities given by the model being the independent variable. If the intercept is 0 and the slope of the model 1, then the calibration is perfect.⁴⁹

This is the one of the best methods to provide the user with a quantitative measure of calibration, but it does not indicate fully the deviations between observed and predicted mortality rates, in particular in the direction, extent, and risk classes affected by these deviations. This last information is more or less provided by the traditional calibration plot, which, however, is not a real curve (single points are theoretically independent from each other), is not a formal statistical method, and consequently does not provide the uncertainty of the estimate. Another weak point of the most commonly used calibration assessment methods (i.e., the Hosmer–Lemeshow statistics combined with the traditional calibration plot) is that they average the risk of patients in each decile, thus not using the entire information carried by each patient. These shortcomings were recently addressed by a new statistical approach, the GiViTI calibration belt,⁵³ recently used by Poole and colleagues in the comparison of the calibration of SAPS II and 3.⁵³ However, it is also not perfect, because most of the time there is a left (low-risk) deviation of the distribution of risks and the test becomes less helpful and informative. Calibration curves are also used to describe the calibration of a predictive model. These types of graphics compare observed and predicted mortality risks. They can be misleading, because the number of patients usually decreases from left to right (when we move from low probabilities to high probabilities), and as a consequence, even small differences in high-severity groups appear visually more important than small differences in low-probability groups. It should be stressed that calibration curves are not a formal statistical test.

Discrimination evaluates the capability of the model to distinguish between patients who die from patients who survive. This evaluation can be made using a nonparametric test such as Harrell’s C index, using the order of magnitude of the error.⁷⁹ This index measures the probability of, for any two patients chosen randomly, the one with the greater probability to have the outcome of interest (to be dead). It has been shown that this index is directly related with the area under the receiver operating characteristic (ROC) curve and that it can be obtained as the parameter of the Mann-Whitney-Wilcox statistic.⁸⁰ Additional computations can be used to compute the confidence interval of this measure.⁸¹

The concept of the area under the ROC curve is derived from psychophysical tests. In an ROC curve, a series of two by two contingency tables are built, varying from the smallest to the largest score value. For each table the rate of true-positive (or sensitivity) and the false-positive rate (or 1 minus the specificity) are calculated. The final plot of all possible pairs of rates of true-positives versus false-positives gives then the visual representation of the ROC curve.

The interpretation of the area under the ROC curve is easy: a virtual model with a perfect discrimination would have an area of 1.0, a model with a discrimination no better than chance has an area of 0.5. Discriminative abilities are said to be satisfactory when the ROC curve is greater than 0.70. General outcome prediction models usually have areas greater than 0.80. Several methods have been described to compare the areas under two (or more) ROC curves,82–84 but they can be misleading if the shape of the curves is different.⁸⁵

Other measures have been utilized, based on classification tables, with describing sensitivity, specificity, positive and negative predictive values, and the correct classification rates. However because these calculations must use a fixed cutoff (usually 10%, 50%, or 90%), their value is limited.

The relative importance of calibration and discrimination depends on the intended use of the model. Some authors advise that for group comparison calibration is especially important ⁸⁶ and that for decisions involving individual patients both parameters are important.⁸⁷

Uniformity of Fit

The evaluation of calibration and discrimination in the analyzed sample is nowadays current practice. More complex is the identification of subgroups of patients when the behavior of the model is nonoptimal. These subgroups can be viewed as influential observations in model building and their contribution for the global error of the model can be vary large.⁸⁸

The most important subgroups are related to the case-mix characteristics that can be eventually related to the outcome of interest:

• The intrahospital location before ICU admission

• The surgical status

• The degree of physiologic reserve (age, comorbid conditions)

• The acute diagnosis (including the presence, site, and extension of infection at ICU admission)

Although some authors such as Rowan and Goldhill in the United Kingdom^89,90 and Apolone and Sicignano in Italy^42,91 have suggested that the behavior of a model can depend to a significant extent on the case mix of the sample, no consensus exists about the subpopulations that should mandatorily be analyzed.⁴⁴

Updating Severity Scores

Changes in the characteristics of the populations, changes in the therapy of major diseases, and the introduction of new diagnostic methods all imply modifications that result in necessary updates. Moreover, the use of a model outside its development population can eventually imply its modification and adaptation.

Using a Severity of Illness Score

Calculating a Severity of Illness Score

Using the original score sheets (or a computer software, well developed and validated), a score is assigned to each variable, depending on its deviation from normal values. The arithmetic sum of these variable scores (the sum score) represents the severity score for that patient, which is then used in the equation to predict hospital death. As described earlier, this approach was not chosen by any of the MPM systems, in which the variables are directly used to compute a probability of death in the hospital by a logistic regression equation.

Transforming the Score into a Probability of Death

The transformation of the severity score into a probability of death in the hospital uses a logistic regression equation. The dependent variable (hospital mortality) y is related to the set of independent (predictive) variables by the equation:

with b₀ being the intercept of the model, x₁ to x_k the predictive variables and b₁ to b_k the estimated regression coefficients. The probability of death is then given by:

with the logit being y as described before. The logistic transformation included in this equation allows the S-shaped relationship between the two variables to become linear (on the logit scale). In the extremes of the score (very low or very high values) changes in the probability of death are small; for intermediate values, even small changes in the score are associated with very large changes in the probability of death. This ensures that outliers do not influence the prediction too much.

Application of a Severity of Illness Score: Evaluation of Patients

All existing models aim at predicting an outcome (vital status at hospital discharge) based on a given set of variables: they estimate the outcome of a patient with a certain clinical condition (defined by the registered variables), treated in a hypothetical reference ICU. Several issues, however, need to be taken into account in order to apply one of the previously described models in another population:

• Patient selection

• Evaluation and registration of the predictive variables

• Evaluation and registration of the outcome

• Computation of the severity score

• Transformation of the score in a probability of death

After validation, the utility and applicability of a model must be evaluated. Literature is full of models developed in large populations that failed, when applied within other contexts.42,43,89,92–96 Thus, this question can only be answered by validating the model in its final population. The potential applications of a model—and consequently its utility—are different for individual patients and for groups.⁹⁷

Evaluating Individual Patients

Some evidence exists that suggests that statistical methods behave better than clinicians in predicting outcome,98–105 or that they can help clinicians in the decision-making process.^106–108 This opinion is, however, controversial,^109–111 especially for decisions to withdraw or to withhold therapy.¹¹² Moreover, the application of different models to the same patient results frequently in very different predictions.¹¹³ Thus, application of these models to individual patients for decision making is not recommended.¹¹⁴

It should not be forgotten that such statistical models are of a probabilistic nature. A well-calibrated model, applied to an individual patient may, for example, predict a hospital mortality rate of 46% for this individual; this, however, just means that for a group of 100 patients with a similar severity of illness, 46 patients are predicted to die; it makes no statement if the individual patient is included in the 46% who will eventually die or in the 54% who will eventually survive.

If should be noted that severity scores have been proposed for applications as diverse as to determine the use of total parenteral nutrition ¹¹⁵ or the identification of futility in intensive care medicine.¹¹⁶ Some authors demonstrated that knowledge of predictive information will not have an adverse effect on the quality of care, helping at the same time to decrease the consumption of resources and to increase the availability of beds.¹¹⁷

One field in which the scientific community agrees consensually is the stratification of patients for inclusion into clinical trials and for the comparison of the balance of randomization to different groups.¹¹⁸

Evaluating Groups of Patients

At a group level, general outcome prediction models have been proposed for two objectives: distribution of resources and performance evaluation. Several studies were published describing methods used to identify and to characterize patients with a low risk of death.119–123 This type of patient, who requires only basic monitoring and general care, could eventually be transferred to other areas of the hospital.^108,124 One could, however, also argue that these patients have a low mortality risk only because they have been monitored and cared for in an ICU.¹²⁵ Also, the use of current instruments is not recommended as a triage instrument in the emergency department,¹²⁶ and also the use of early physiologic indicators outside the ICU is being questioned.¹²⁷

Moreover, patient costs in the ICU depend on the amount of required (and utilized) nursing workload use. Patient characteristics (diagnosis, degree of physiologic dysfunction) are thus not the only determinants: costs depend also on the practices and policies in a given ICU. To focus our attention on the effective use of nursing workload ¹²⁸ or the dynamic evolution of the patient^129,130 seems thus a more promising strategy than those approaches based exclusively on the condition of the patients during the first hours in the ICU or in the O/E length of stay in the ICU.^58,131,132

On the other hand, general outcome prediction models have been proposed to identify patients who require more resources.¹³³ Unfortunately, these patients only rarely can be identified at ICU admission, because their degree of physiologic dysfunction during the first 24 hours in the ICU tends usually to be moderate, although very variable.^134–136 And even if some day these patients might be well identified, the question of what to do with this information remains.

Another important area in which these type of models have been used is evaluation of ICU performance. Several investigators proposed the use of standardized mortality ratios (SMR) for performance evaluation, assuming that current models can take into account the main determinants of mortality risk.¹³⁷ The SMR is computed by dividing the observed mortality rate by the averaged predicted mortality rate (the sum of the individual probabilities of death of all the patients in the sample). Additional computations can be made to estimate the confidence interval of this ratio.¹³⁸

The interpretation of the SMR is easy: a ratio lower than 1 implies a performance better than the reference population and a ratio greater than 1 indicates a performance worse than the reference population. This methodology has been used for international comparison of ICUs,* comparison of hospitals,^† ICU evaluation,^143–146 management evaluation,^142,147,148 and the influence of organization and management factors on the performance of the ICU.¹⁴⁹

Before applying this methodology, six questions should always be answered:

1. Can we evaluate and register all the data needed for the computation of the models?

2. Can the models be used in the large majority of our patients?

3. Are existent models able to control for the main patient characteristics related to mortality?

4. Has the reference population been well chosen and are the models well calibrated to this population?

5. Is the sample size large enough to draw meaningful differences?

6. Is vital status at ICU discharge the main performance indicator?

Each of these assumptions has been questioned in past years and there is no definitive answer at this time. However, most investigators believe that performance is multidimensional and consequently that it should be evaluated in several dimensions.^25,150 The problem of sample size seems especially important with respect to the risk of a type II error (in other words, to say that there are no differences when they exist).

Moreover, the comparison between observed and predicted might make more sense if done separately in low-, intermediate-, and high-risk patients, because the performance of an ICU can change according to the severity of the admitted patients. This approach was advocated in the past based on theoretical concerns,151–153 but was used only in a small number of studies.^149,154 Multilevel modeling with varying slopes can be an answer for the developers of such models.^25,155

Organ Dysfunction/Failure Scoring Systems

Organ failure scores are designed to describe organ dysfunction/failure more than to predict survival. In the development of organ function scores, three important principles need to be remembered.¹⁷ First, organ failure is not a simple all-or-nothing phenomenon; rather, a spectrum or continuum of organ dysfunction exists from very mild altered function to total organ failure. Second, organ failure is not a static process and the degree of dysfunction may vary with time during the course of disease so that scores need to be calculated repeatedly. Third, the variables chosen to evaluate each organ need to be objective, simple, and available but reliable, routinely measured in every institution, specific to the organ in question, and independent of patient variables, so that the score can be easily calculated for any patient in any ICU. Interobserver variability in scoring can be a problem with more complex systems^63,156 and the use of simple, unequivocal variables can avoid this potential problem. Ideally, scores should be independent of therapeutic variables, as stressed by Marshall and coworkers,¹⁵ but in fact, this is virtually impossible to achieve as all factors are more or less treatment dependent. For example, the PaO₂/FiO₂ ratio is dependent on ventilatory conditions and positive end-expiratory pressure (PEEP), platelet count may be influenced by platelet transfusions, urea levels are affected by hemofiltration, and so on.

The process of organ function description is relatively new and there is no general agreement on which organs to assess and which parameters to use. Many different scoring systems have been developed for assessing organ dysfunction,15–17,157–165 differing in the organ systems included in the score, the definitions used for organ dysfunction, and the grading scale used.^71,166 The majority of scores include six key organ systems—cardiovascular, respiratory, hematologic, central nervous, renal, and hepatic—with other systems, such as the gastrointestinal system, less commonly included. Early scoring systems assessed organ failure as either present or absent, but this approach is very dependent on where the limits for organ function are set, and newer scores consider organ failure as a spectrum of dysfunction. Most scores have been developed in the general ICU population, but some were aimed specifically at the septic patient.^{17,158,159,163,164} Three of the more recently developed systems will be discussed later, the main difference between them being in their definition of cardiovascular system dysfunction (Table 73.2).

Table 73.2

Organ Dysfunction/Failure Scoring Systems

GCS, Glasgow Coma Scale; HR, heart rate; LOD, Logistic Organ Dysfunction; MAP, mean arterial pressure; MODS, Multiple Organ Dysfunction Score; SAPS, Simplified Acute Physiology Score; SOFA, Sequential Organ Failure Assessment; WBC, white blood cells.

^*Data from Marshall JC, Cook DA, Christou NV, et al: Multiple organ dysfunction score: A reliable descriptor of a complex clinical outcome. Crit Care Med 1995;23:1638-1652.

^†Data from Vincent J-L, Moreno R, Takala J, et al: The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Med 1996;22:707-710.

^‡Data from Le Gall JR, Klar J, Lemeshow S, et al, The ICU Scoring Group: The logistic organ dysfunction system. A new way to assess organ dysfunction in the intensive care unit. JAMA 1996;276:802-810.

Multiple Organ Dysfunction Score

This scoring system was developed by a literature review of clinical studies of multiple organ failure from 1969 to 1993.¹⁵ Optimal descriptors of organ dysfunction were thus identified and validated against a clinical database. Six organ systems were chosen, and a score of 0 to 4 allotted for each organ according to function (0 being normal function through 4 for most severe dysfunction) with a maximum score of 24. The worst score for each organ system in each 24-hour period is taken for calculation of the aggregate score. A high initial Multiple Organ Dysfunction Score (MODS) correlated with ICU mortality risk and the delta MODS (calculated as the MODS over the whole ICU stay less the admission MODS) was even more predictive of outcome.¹⁵ In a study of 368 critically ill patients the MODS was found to better describe outcome groups than the APACHE II or the organ failure score, although the predicted risk of mortality was similar for all scoring systems.¹⁶⁷ The MODS has been used to assess organ dysfunction in clinical studies of various groups of critically ill patients, including those with severe sepsis.^168–171

Sequential Organ Failure Assessment Score

The sequential organ failure assessment (SOFA) score was developed in 1994 during a consensus conference organized by the European Society of Intensive Care and Emergency Medicine in an attempt to provide a means of quantitatively and objectively describing the degree of organ failure over time in individual, and groups of, septic patients.¹⁷ Initially termed the sepsis-related organ failure assessment score, the score was then renamed the sequential organ failure assessment because it was realized that it could be applied equally to nonseptic patients. In devising the score, the participants of the conference decided to limit to six the number of systems studied: respiratory, coagulation, hepatic, cardiovascular, central nervous system, renal. A score of 0 is given for normal function through 4 for most abnormal, and the worst values on each day are recorded. Individual organ function can thus be assessed and monitored over time, and an overall global score can also be calculated. A high total SOFA score (SOFA max) and a high delta SOFA (the total maximum SOFA minus the admission total SOFA) have been shown to be related to a worse outcome,^129,172 and the total score has been shown to increase over time in nonsurvivors compared to survivors.¹⁷² The SOFA score has been used for organ failure assessment in several clinical trials, including one in septic shock patients.^173–176

Logistic Organ Dysfunction System Score

This score was developed in 1996 using multiple logistic regression applied to selected variables from a large database of ICU patients.¹⁶ To calculate the score, each organ system receives points according to the worst value for any variable for that system on that day. If no organ dysfunction is present, the score is 0, rising to a maximum of 5. As the relative severity of organ dysfunction differs between organ systems, the Logistic Organ Dysfunction System (LODS) score allows for the maximum of 5 points to be awarded only to the neurologic, renal, and cardiovascular systems. For maximum dysfunction of the pulmonary and coagulation systems, a maximum of 3 points can be given for the most severe levels of dysfunction; and for the liver, the most severe dysfunction receives only 1 point. Thus, the total maximum score is 22. The LODS score is designed to be used as a once-only measure of organ dysfunction in the first 24 hours of ICU admission, rather than as a repeated assessment measure. The LODS system is quite complex and seldom used; nevertheless, it has been used to assess organ dysfunction in clinical studies.¹⁷⁷