The reliability of these intersections is inextricably bound to the ability to distinguish each row from every ¹ other row. This requires the assignment of a unique identifier, or primary key, to every row within the table. A common instinct for the row identification in a table that houses patient information is to use the patient’s name as the primary key. This solution, however, breaks down as soon as two different patients with the same name are entered. The medical record number is usually a better alternative, providing a completely unique value for identifying each patient. However, for reasons discussed previously (patient privacy law), the medical record number is not generally a viable option. A more appropriate method is to assign an independent, arbitrary value as a primary key for the row. One column within the table is dedicated to the primary keys (see Fig. 29-2), and will be structured to require that each value is unique.

By assigning a distinct value as primary key for each row, two different patients with the same name can now be identified unambiguously. The uniqueness of the primary key is important because it serves as a device to connect different tables within the database. Establishment of these connections, or relationships, across tables becomes essential as the database is normalized (a process of “tuning” the data storage system, discussed later in this chapter). If each row cannot be identified and referenced individually, relationships between separate tables become confused and unreliable. In the relational model, a table’s primary key provides a means for other tables to reference its information. When the primary key of one table is stored in another as a link between them, it is called a foreign key, and it establishes the relationship between the two tables. As a result, data elements that are stored in separate tables in a database can be combined to form new tables (called derived tables), as Figure 29-3 demonstrates. By linking records from the patient and physician tables through the “PhysicianForeignKey” column, a derived table is created that contains the relevant data from both tables.

FIGURE 29-3. Derived tables.

Although this example is somewhat trivial, the ability of the primary/foreign key model to connect otherwise disjointed tables is clear. As the discussion develops, the importance of this concept will become more evident. The application of the primary/foreign key model is one of the building blocks for normalizing the relational system.

Normalization

The rules of normalization, originally defined by Dr. E F. Codd, deal primarily with the elimination of data redundancies that lead directly to flawed data and impractical, inefficient data management in relational systems.² The rules of normalization provide solid guidelines for building effective relational database systems. Normalization leverages the actual structure of the database to improve the integrity of the data. In practice, normalization is manifested as a “spreading” of the data, as information is stored throughout the database in many separate tables that are interrelated. Entities should be grouped and related in the same manner that they would be observed in their real-world roles. In the same way, the differences should be maintained by using separate tables (i.e., a patient table should not contain information concerning the physician). Although this idea is fairly simple, it is the foundation of normalizing the database.

Originally, there were only three rules of normalization, but subsequent rules have been added. The rules of normalization are ordered by their degree of specificity, and each higher-order rule is contingent on compliance with each of the previous rules. A database that is in second normal form (term used to describe a database that complies with the second rule of normalization) must also be in first normal form. Each rule is more rigid than its predecessor and more difficult to use. The highest-order rules, in fact, are so strict that they can actually cause a decline in the performance of a relational system. It is uncommon for a production database to achieve anything higher than third normal form.

First Rule of Normalization

The first rule of normalization is somewhat academic: each column in a given row contains one—and only one—value. Violation of this principle is relatively easy to recognize and correct. It would seem unnatural, for instance, to include a column with the head “Physician/Diagnosis” that contains both the name of the physician and the patient’s diagnosis. This problem is easily resolved by separating the two independent values into two distinct columns, “Physician” and “Diagnosis.” A subtler example is demonstrated in the storage of a patient’s name in a single column, rather than creating one column for the first name and another for the last name. Arguments can be made that this is not truly a violation of first normal form, but the two-attribute approach is more suitable because of the common use of last name as an identifier and sort item for groups of patients.

The higher-order rules of normalization deal more specifically with the reduction of data in the relational system. The storage of duplicate information in multiple locations causes the process of modification to become unruly. For example, in the database depicted in Figure 29-2, if Dr. Jones gets married, triggering a name change, two rows are affected (those with values of 1 and 3 in the “PatientKey” column). As a result, the physician values stored in the “Physician” column of each record must be updated, signaling a data storage redundancy. In Figure 29-3, this redundancy is corrected by isolating the physician information into its own table (“Physician”). The data have been effectively reduced, so that the same change requires the update of only one row. This type of data reduction demonstrates the importance of the primary key in the relational model. Separate, related tables are “bridged” by storing the primary key from one table (i.e., “PatientKey”) as a foreign key in another (i.e., “PatientForeignKey”).

Second Rule of Normalization

Although this design strengthens the overall structure of the database, Figure 29-3 has yet to satisfy the standard set by the second rule of normalization: every nonkey attribute must be irreducibly dependent on the primary key.³ The second rule deals with the logical grouping of data elements. Tables should be designed to mirror their real-world counterparts. A table commissioned to store patient data should contain attributes of the patient only, completely separate from other entities, such as diagnosis or physician.

To achieve second normal form, the tables must be restructured. Duplication can be easily identified while reviewing the content of the database, as shown in Figure 29-3. The patient named Jane Smith, who was born February 20, 1960, has two rows in the “Patient” table. As a result, her name and date of birth are repeated unnecessarily. This repetition is caused by the inclusion of the attribute “Diagnosis” as part of the “Patient” table, even though it is functionally independent. To rectify this situation, the “Patient” table must be separated again into a set of smaller tables. This process, known as decomposition, must be “lossless” to maintain the integrity of the data. Just as the term implies, lossless decomposition is a process that retains all essential data and removes redundant values while preserving the ability to reproduce the content of the original table, as needed. This process is demonstrated in Figure 29-3, in which the “Patient” and “Physician” tables are stored separately but can be joined to form a derived table that contains the data from both. It should be noted that derived tables are temporary and should not be included in the long-term data storage design. Derived tables simply provide a convenient, short-term view of related data from separate tables.

In the current example (see Fig. 29-3), the “Diagnosis” column is the source of the redundancy and must be sequestered to its own table. However, this separation must be done without any data loss. To accomplish this, an “Appointment” table should be added to serve as a bridge between each patient and his or her associated diagnoses. The “Appointment” table also connects patients and physicians.

The relationship between patients and appointments is established by storing the “PatientKey” for each patient in the “PatientForeignKey” column. The relationship between the “Patient” and “Appointment” tables in the database mirrors the relationship between patients and appointments in reality. The relationship can be best described as “one-to-many,” in which one patient can have many appointments. If this relationship is built into the database design, a patient can have multiple appointments (requiring multiple entries in the “Appointment” table) but only one entry is required in the “Patient” table. As a result, the data redundancy visible in Figure 29-3 (in columns “LastName,” “FirstName,” and “Birthdate”) is eliminated.

The process of decomposition continues as the diagnosis and physician information are also separated. The relationships between the patient and the associated physician and diagnoses must be maintained. The “Appointment” table is used to connect the “Patient,” “Physician,” and “Diagnosis” tables. Once again, the database design draws from a real-world example. An appointment is the point in the treatment process at which the patient meets with the physician and the physician determines the diagnosis. The database model is a natural extension of this relationship. The restructured database is shown in Figure 29-4.