Home IDS 410

Notes Index

6 Physical Database Design and Performance

Charles E. Oyibo


Introduction

Our previous discussions about conceptual data modeling and the logical database design phases of the database development process (including E-R/EER notations and the relational data model and normalization) do not explain how data will be processed or stored. The purpose of physical database design is to translate the logical description of data into the technical specifications for storing and retrieving data. The goal is to create a design for storing data that will provide adequate performance and insure database integrity, security, and recoverability

Physical Database Design Process

The information needed for physical file and database design include there requirements:

Physical database design requires the following critical decisions:

  1. Choosing the storage format (called data type) for each attribute from the logical data model. The format is chosen to minimize storage space and maximize data integrity.
  2. Grouping attributes from the logical data model into physical records.
  3. Arranging similarly structured records in secondary memory so that individual and groups of records (called file organizations) can be stored, retrieved, and updated rapidly. Considerations should also be given to protecting data and recovering data after errors are found.
  4. Selecting structures (called indexes and database architectures) for storing and connecting files to make retrieving related data more efficient.
  5. Preparing strategies for handling queries against the database that will optimize performance and take advantage of the file organization and indexes that we have specified.

Data Volume and Usage Analysis

An easy way to show the statistic about data volumes and usage (access frequency) is by adding notation to the EER diagram that represents the final set of normalize relations from logical database design. The volume and frequency statistics are generated during the systems analysis phase of the systems development process when systems analysts are studying current and proposed data processing and business activities.

Fortunately, precise numbers are not necessary. What is crucial is the relative size of the numbers, which will suggest where the greatest attention needs to be given in order to achieve the best possible performance.

Designing Fields

Field: The smallest unit of named application data recognized by system software, such as a programming language or database management system.

A field corresponds to a simple attribute from the logical data model, so a field represents each component of a composite attribute.

Choosing Data Types

Data Type: A detailed coding scheme recognized by system software, such as a DBMS, for representing organizational data.

Selecting a data type involves four objectives that will have different relative importance for different applications.

  1. Minimize storage space
  2. Represent all possible values
  3. Improve data integrity
  4. Support all data manipulations

Coding and Compression Techniques

A field with a limited number of possible values can be translated into a code that requires less space. By creating a code or translation table for the product Finish field (with possible values Birth, Maple, and Oak), each Finish field value can be replaced by a code, a cross-reference to the look-up table, similar to a foreign key. (Note that the code table would not appear in the conceptual or logical model; the code table is a physical construct to achieve data processing performance improvements, not a set of data with business value.)

A form of a code table is used by data compression techniques, such as file zip routines. Also related to data compression techniques are encryption techniques.

Controlling Data Integrity

For many DBMA, data integrity controls (i.e. controls on the possible value a field can assume) can be built into the physical structure of the fields and controls enforced by the DBMS on those fields. The data type enforces one type of data integrity control since it limits the type of data and length of a field value. Some other integrity controls that a DBMS might support are:

Handling Missing Data

When a field may be null, simply entering no value may be sufficient. But if we needed to generate say a report of customers by zip codes, how would be handle customers with unknown zip codes? Two options are discussed in the data integrity controls mentioned above: use a default value or do not permit missing (null) values. Other possible methods for handling missing data are (Babad & Hoffer, 1984):

  1. Substitute an estimate for the missing value; however, such estimates must be marked so that users know that they are not actual values
  2. Track missing data so that special reports and other system elements cause people to quickly resolve unknown values. This can be done by setting up a trigger in the database definition.

Trigger: A routine that will automatically execute when some event occurs or time period elapses.

  1. Perform sensitivity testing so that missing data are ignored unless knowing a value might significantly change results. This is the most complex of the methods mentioned and hence requires the most sophisticated programming. Such routines for handling missing data may be written in application programs. (Many modern DBMSs now support more sophisticated programming capabilities, such as case expressions, user-defined functions, and triggers, so that such logic can be available in the database for all users without application-specific programming.)

Designing Physical Records and Denormalization

In a logical data model, we group into a relation those attributes that are determined by the same primary key. In contrast, a

Physical record is a group of fields stored in adjacent memory locations and retrieved and written together as a unit by a DBMS.

The design of physical record involved choosing the sequencing of fields into adjacent storage locations to achieve two goals: efficient use of secondary storage and data processing speed.

The efficient use of secondary storage is influenced by both the size of the physical record and the structure of secondary storage. Operating systems read data from hard disks in units called pages (not physical records) ...

Page: The amount of data read or written by an operating system in one secondary storage (disk) input or output operation. For I/O with a magnetic tape, the equivalent term is record block.

If page length is not an integer multiple of the physical record length, wasted space may occur at the end of the page.

Blocking Factor: The number of physical records per page.

If storage space is scarce and physical records cannot span pages, creating multiple physical records from one logical relation will minimize wasted storage space. Some DBMSs will also block multiple physical records into a data block; in this case, the DBMS manages the data block, whereas the OS manages a page.

Denormalization

Our preceding discussion focuses on efficient use of storage space. In most case, the second goal of physical record design--efficient data processing--dominates the design process. Efficient processing of data depends on how close together related data are.

Often all the attributes that appear within a relation are not used together, and data from different relations are needed together to answer a query or produce a report. Thus although normalized relations solve data maintenance anomalies and minimize redundancies, normalized relations, if implemented one for one as physical records, may not yield efficient data processing.

Denormalization: The process of transforming normalized relations into unnormalized physical record specifications.

The following are some common denormalization opportunities:

  1. Two entities with a one-to-one relationship: If the access frequency between the two entity types is high, it may result in better processing efficiency to combine the two relations into one record definition.
  2. A many-to-many relationship (associative entity) with nonkey attributes: Rather than joining three files to extract data from the two basic entities in the relationship, processing efficiency might be achieved by combine attributes from one of the entities into the record representing the many-to-many relationship, thus avoiding one join operation in many data access modules. (This would be most advantageous if joining occurs often.)
  3. Reference data: Reference data exists in an entity on the one-side of a one-to-many relationship, and this entity participates in no other database relationships. We should consider merging the two entities in this situation into one record definition when there are few instances of the entity on the many-side for each entity instance on the one-side.

The opportunities listed above deal with combining tables to avoid doing joins. However, denormalization can also be used to create more tables by partitioning a relation into multiple tables.

Horizontal Partitioning

Horizontal Partitioning: Distributing the rows of a table into several separate files. (Breaking a relation into multiple record specifications by placing different rows in different records based upon common column values.) Example: breaking a customer relation into four regional customer files based on the value of a field region.

We should note that horizontal partitioning is very similar to creating supertype/subtype relationships

The Oracle DBMS supports several forms of horizontal partitioning ... :

  1. Key range partitioning
  2. Hash partitioning
  3. Composite partitioning

Vertical Partitioning

Vertical Partitioning: Distributing the columns of a table into several separate physical records, (repeating the primary key in each of the records). Example: breaking a part relation by placing a part number along with accounting-related part data in one record specification, the part number along with engineering-related part data in another record specification, and the part number along with sales-related part data in yet another record specification.

Advantages of Partitioning

  1. Efficiency: Data used together are stored close to one another and separate from data not used together
  2. Local optimization: Each partition of data can be stored to optimize performance for its own use
  3. Security: Data not relevant to one group of users can be segregated from data they are allowed to use
  4. Recovery and uptime: Smaller files will take less time to retrieve, and other files are still accessible if one file is damaged, so the effect of damage are isolated
  5. Load balancing: Files can be allocated to different storage areas which minimize contention for access to the same storage area or even allows for parallel access to different areas.

Disadvantages of Partitioning

  1. Inconsistent access speed: Different partitions may yield different access speeds, this confusing users. Also, when data must be combined across partitions, users may have to deal with significantly slower response times than in a nonpartitioned approach
  2. Complexity: Partitioning is usually not transparent to programmers, who will have to write more complex programs when combining data across partitions.
  3. Extra space and update time: Data may be duplicated across the partitions, taking extra storage space compared to storing all the data in normalized fields. Updates which affect data in multiple partitions can take more time than if one file were used.

(Denormalization by) Data Replication

In data replication, the same data are purposely stored in multiple places in the database in order to achieve increased processing (data retrieval speeds). (Improved speed is worthwhile only if the replicated data is frequently accessed in each of the relations it is replicated in, and if the costs for extra secondary storage and data maintenance are not prohibitive.)

Designing Physical Files

Physical File: A named portion of secondart memory (such as a magnetic tape of hard disk) allocated for the purpose of storing physical records.

An important structure for physical storage space in Oracle is a tablespace.

A tablespace is a named set of disk storage elements in which physical files from one or more database tables may be stored.

Oracle has responsibility for managing the storage of data inside a tablespace, whereas the operating system has many responsibilities for managing a tablespace as a whole as it would any operating system file (e.g. handling file level security, allocating space, and responding to disk read and write errors).

A tablespace may be spread over several extents (where an extent is a contiguous section of disk storage space). When a tablespace needs to enlarge to hold more data, it is assigned another extent ... Managing table spaces, or physical database files, is a significant job of the database administrator in an Oracle environment.

Pointer

All files are organized by using two basic constructs to link one piece of data with another piece of data: sequential storage and pointers.

Pointer: A field of data that can be used to locate a related field or record of data. Generally, a pointer contains the address, or location, of the associated data.

(We define pointers here because a knowledge of what pointers are is necessary for understanding file organization, discussed below.)

File Organization

File Organization: A technique for physically arranging the records of a file on secondary storage devices.

With modern DBMSs, we do not have to design file organizations, though we may be allowed to select an organization and its parameters for a table or physical file. In choosing a file organization for a particular file in a database, we should consider seven important factors:

  1. Fast data retrieval
  2. High throughput for process data input and maintenance transactions
  3. Efficient use of storage space
  4. Protection from failures or data loss
  5. Minimizing need for reorganization
  6. Accomodating growth
  7. Security from unauthorized use

Often these objectives conflict, and we must select a file organization that provides a reasonable balance among the criteria within resources available. We now consider three families of basic file organization

Sequential File Organization

Sequential File Organization: The storage of records in a file in sequence according to a primary key value.

Because of their inflexibility (a program must normally scan a file from the beginning until the desired record is located), sequential files are not used in a database, but may be used for files that backup data from a database.

Indexed File Organizations

Indexed File Organization: The storage of records either sequentially or nonsequentially with an index that allows software to locate individual records.

An index is a table or other data structure used to determine the location of rows in a file that satisfy some condition.

Each index entry matches a key value with one or more records, and can point to unique records or to potentially more than one record. An index that allows each entry to point to more than one record is called a secondary key.

Secondary key: One field or combination of fields for which more than one record may have the same combination of values. Also called nonunique key.

When the term primary and secondary index are used, there are four types of indexes:

  1. Unique primary index (UPI)
  2. Nonunique primary index (NUPI)
  3. Unique secondary index (USI)
  4. Nonunique secondary index (NUSI)

Indexes can be built on top of indexes, since indexes (being files themselves) can become so large as to benefit from indexing. Each index entry then, has a key value and a pointer to another index or to a data record.

One of the most powerful capabilities of indexed fule organization is the ability to create multiple indexes (such as in the library, where there are indexes on title, author, subject, etc. pointing to the same set of books). Also, multiple indexes can be manipulated (e.g. "find products with a Birth finish AND that cost under $500") to increase gains in processing efficiency. Logical AND, OR, and NOT operations can be handled by simply manipulating results from an index scan, thus avoiding the more costly access to data records that do not meet all the qualifications of the query.

The general structure of a hierarchical index is called a tree (albeit an upside-down one, with the root node at the top and leaf nodes at the bottom). The performance of a tree of index entries is influenced by the properties of the tree structure used. A common type of tree used by DBMSs to store indexes is a balanced-tree, of B-tree.

Another type of index is called a bitmap index.

Bitmap index: A table of bits in which the row represents the distinct values of a key and each colum is a bit, when when on indicates that the record for that bit column position has the associated field value.

A bitmap is ideal for attributes that have only a few possible values. In proper situations, because bit manipulation and searching is so fast, the speed of query processing with a bitmap can be 10 times faster than a conventional tree index.

Another important type of index, especially in data warehousing and other decision support applications, is a join index.

A join index is an index on columns from two or more tables that come from the same domain of values.

Simply stated, a join says find rows in the same or different tables that have values that match some criterion.

Hashed File Organizations

Hashed file organization: A storage system in which the address of each record is determined using a hashing algorithm.

Hashing algorithm: A routine that converts a primary key value into a relative record number (or relative file address).

A typical hashing algorithm uses the technique of dividing each primary key value by a suitable prime number and then using the remainder of the division as the relative storage location. (Another technique must then be used to resolve the duplicates, or overflow, that can occur when two or more keys hash to the same address (known as "hash clash").

Hash index table: A file organization that uses hashing to map a key into a location in an index, where there is a pointer to the actual data record matching the hash key.

Summary on File Organizations

The DBMS will handle the management of the file organization. What is important for the DB designer is to understand the properties of the different file organizations so that we can choose the most appropriate one for the type of database processing required in the database and application we are designing.

Clustering Files

Some DBMSs allow adjecent secondary memory space to contain rows from several tables. In this case, a physical file does not contain records with identical structures. For example, in Oracle, rows from one, two, or more related tables that are often joined together can be stored in the same disk area. A cluster is defined by the table and the column or columns by which the tables are usually joined.

Clustering reduces the time to access related records compared to the normal allocation of different files to different areas of a disk. Time is reduced since related records are closer to each other than if the records were stored in separate files in separate areas of the disk.

Designing Controls for Files

One additional aspect of a database file about which we may have design options are the types of controls we can use to protect the file from destruction or contamination or to reconstruct the file if it is damaged. The proprietary storage format provided by the DBMS offers a basic level of control, but we may require additional security controls on fields, files (tables), or databases. Backup procedures provide a copy of a file and of the transactions that have changed the file. When a file is damaged, the file copy or current file along with the log of transactions are used to recover the file to an uncontaminated state. Further, in terms of security, the most effective method is to encrypt the contents of the file so that only programs with access to the decryption routine will be able to see the file contents.

Using and Selecting Indexes

Indexes on a file can be created for either a primary or secondary key or both. The index is itself a table with two columns: the key and the address of the record or records that contain that key value. For a primary key, there will be only one entry in the index for each key value.

Creating a Unique Key Index

CREATE UNIQUE CUSTINDEX ON CUSTOMER(CUSTOMER_ID)

Creating a Secondary (Nonunique) Key Index

CREATE INDEX DESCINDX ON PRODUCT(DECRIPTION)

When to Use Indexes

During physical database, we must choose which attributes to use to create indexes. There is a trade-off between improved performance for retrievals through the use on indexes, and the degraded performance for inserting, deleting, and updating the records in a file. Thus, indexes should be used generously for databases intended primarily to support data retrievals, such as for decision support and data warehouse applications; and should be used judiciously for databases that support transaction processing and other applications with heavy updating requirements, since the indexes impose additional overhead.

The following are some rules of thumb for choosing indexes for relational databases.

  1. Indexes are most useful for larger tables.
  2. Specify a unique index for the primary key of each table.
  3. Indexes are most useful for columns that frequently appear in WHERE clauses of SQL commands, either to qualify the rows to select or for linking (joining) tables.
  4. Use an index for attributes referenced by ORDER BY (sorting) and GROUP BY (categorizing) clauses. (Be sure that the DBMS will, in fact, use indexes on attributes listed in these clauses.
  5. Use an index when there is significant variety in the values of an attribute. (Oracle suggests that indexing is not useful when there are fewer than 30 different values for an attribute, and an index is clearly useful when there are 100 or more different values for an attribute. Similarly, an index will be helpful only if the results of a query which uses that index do not exceed roughly 20 percent of the total number of records in the file (Schumacher, 1997).)
  6. Check the DBMS for the limit, if any, on the number of indexes allowable per table.
  7. Be careful indexing attributes that have null values. For many DBMs, rows with a null value will not be referenced in the index (so they cannot be found from an index search of ATTRIBUTE = NULL). Such a search will have to be done by scanning the file.

RAID: Improving File Access Performance by Parallel Processing

The increasing feasibility of using multiple computer components (as a result of the constant reduction in cost and size of computer technologies) allow parallel processing of data--the result being that four IO operations, when done in parallel, take as long as one such operation.

Parallel database processing and fault tolerance can be accomplished by a database designer using hardware or software technology called Redundant Array of Inexpensive Disks (RAID).

Redundant Array of Inexpensive Disks (RAID): A set, or array, of physical disk drives that appear to the database user (and programs) as if they form one large logical storage unit. (Thus, RAID does not change the logical or physical structure of application programs or database queries.

In order to maximize I/O performance of RAID, all the disk drives need to be kept busy. This is accomplished by striping--the balancing of the workload across the disk drives.

Stripe: The set of pages on all disks in a RAID that are the same relative distrance from the beginning of the disk drive

The details of parrallel processing are usually implemented on the OS level.

RAID has one important risk: the increased likelihood of a disk drive failure across the whole database. To cope with this risk and to make disk storage fault tolerant, many types of RAID technologies redundantly store data so that at least one copy of the data is assessible when demanded, or store extra error correction codes so that damages or lost data can be rebuilt.

Choosing Among RAID Levels

RAID-0

RAID-1

RAID 0+1

RAID-2

RAID-3

RAID-4

RAID-5

Designing Databases

Choosing Database Architecture

A particular database management or data warehousing system supports one of five different architectures:

  1. Hierarchical database model
  2. Network database model
  3. Relational database model
  4. Object-oriented database model
  5. Multidimentional database model

Optimizing for Query Performance

Topics

Top of Page

Charles E. Oyibo
IDS :: CBA :: UIC