Ch. 8 Physical Database Design and Database Infrastructure
Physical Database Design Process
Goal: Data processing efficiency - requires info that has been collected and produced during prior systems development phases
Repositories
Informtion repository: A component that stores metadata that describe an organization's data and data processing resources, manages the total information processing environment and combines information about an organization's business information and its application portfolio
Denormalization pt 2
Leads to more storage space for raw data and more space for database overhead - primary purpose to increase data retrieval efficiency
Database recovery
Mechanisms for restoring a database quickly and accurately after loss or damage DBA: responsibility to ensure that all critical data in a database are protected and can be recovered in the event of loss
User-defined procedures
User exits (or interfaces) that allow system designers to define their own security procedures in addition to the authorization rules
Control Objectives for Information and Related Technology (COBIT)
an open standard published by the IT Governance Institute and the Information Systems Audit and Control Association (SACA)
Authorization rules
controls incorporated in a data management system that restrict access to data and restrict the actions that people may take when they access data
Incorrect data
difficult to detect and lead to complications When incorrect data have been processed, the database can be recoverd in 1 of the ways - backward recovery may be used - few errors is occurred - a series of compensation transactions can be introduced to correct errors - restart from the most recent checkpoint before the error occurred and process subsequent transactions again without the error
Hashed file organization
hashed file organization: a storage system in which the address for each record is determined using a hashing algorithm hashing algorithm: a routine that converts a primary key value into a record address Limitation: only 1 key can be used for hashing-based retrieval Hash index table: uses hashing to map a key into a location in an index Pointer: a field of data indicating a target address that can be used to locate a related field or record of data
3 factor authentication
important 3rd factor -> biometric attribute that is unique for each individual user - fingerprints, voiceprints, eye pictures, and signature dynamics SMART CARD
When to use indexes
indexes should be used for databases intended primarily to support data retrieval, such as for decision support and data warehouse applications - indexes are most useful on larger tables - unless DBMS is automatically, specify a unique index for the primary key of each table - indexes are useful for columns that frequently appear in WHERE clauses of SQL commands either to qualify the rows to select or to join tables - Use an index for attributes referenced in ORDER BY (sorting) and GROUP BY (categorizing) clauses - Use an index when there is significant variety in the values of an attribute - Before creating an index on a field with long values, consider first creating a compressed version of the value - if the key for the index is going to be used for determining the location where the record will be stored, then the key for this index should be a surrogate key so that the values cause records to be evenly spread across the storage space - Check your DBMS for the limit, if any, on the number of indexes allowable per table. - Be careful of indexing attributes that have null values.
Data dictionaries and repositories
metadata: data that describe the properties or characteristics of end-user data and the context of that data Data dictionary: A repository of information about a database that documents data elements of a database -> stores metadata or information about the database including attribute names and definitions for each table in the database System catalog: A system-created database that describes all database objects, including data dictionary information, and also includes user access information
hot backup
only a selected portion of the database
Designing physical database files
physical file: a named portion of secondary memory (such as a hard disk) allocated for the purpose of storing physical records EXTENTS: allow a physical file to be split into separate pieces DBA: needs to knw extensive details about how the database management system manages physical storage space tablespace: a named logical storage unit in which data from one or more database tables, views, or other database objects may be stored (consists of 1 or several physical operating system files)
IT operations
policies and procedures in place related to the day-to-day management of the infrastructure, applications, and databases in an organization - Vendor management: helps maintain data quality and availability -> review hardware/sftware
incremental backups
record changes made since the last backup but which do not take as much time to complete
Normalized relations
solve data maintenance anomalies and minimize redundancies (and storage space) - may not yield efficient data processing - creates a large number of tables
Two factor authentication
something the user has (card or token) and something the user knows (PIN) possible issues: Cards can be lost/stolen and PINs can be intercepted
Encryption
the coding or scrambling of data so that humans cannot read them - can be used to protect highly sensitive data
cold backup
the database is shut down
Heap file organization
the records in the file are not stored in any particular order
Indexed file organization
the storage of records either sequentially or nonsequentially with an index that allows software to locate individual needs - index: a table or other data structure used to determine in a file the location and records that satisfy some condition
Sequential file organization
the storage of records in a file in sequence according to a primary key value - to locate a particular record, a program must scan the file from the beginning until the desired record is located
Creating a Secondary (Nonunique) Key index
to speed up such retrievals, you can define a index on each attribute that you use to qualify a retrieval
Recovery and Restart Procedures
Disk mirroring: to be able to switch to an existing copy of a database - allows for a faster recovery - hot-swappable: a damaged disk can be rebuilt from the mirrored disk with no disruption in service to the user
Vertical partioning
Distribution of the columns of a logical relation into several separate physical tables, repeating the primary key in each of the tables
Designing fields
Field: the smallest unit of application data recognized by system software - corresponds to a simple attribute in the logical data model
Basic Recovery Facilities
- Backup facilities: provide periodic backup copies of portions of or the entire database - Journalizing facilities: maintain an audit trail of transactions and database changes - A checkpoint facility: DBMS periodically suspends all processing and synchronizes its files and journals to establish a recovery point - Recovery manager: allows the DBMS to restore the database to a correct condition and restart processing transactions
Physical database design has decisions that will impact integrity and performance of the application system
- Choosing the data type for each attribute - Giving the database management system guidance regarding how to group attributes from the logical data model into physical records - Giving the database management system guidance regarding how to arrange similarly structured records in secondary memory (hard disks), using a file organization so that individual and groups of records can be stored, retrieved, and updated rapidly - Selecting structures for storing and connecting files to make retrieving related data more efficient - Preparing strategies for handling queries against the database that will optimize performance and take advantage of the file organizations and indexes that you have specified
Clustering files
- DBMS allow adjacent secondary memory space to contain rows from several tables - Cluster: defined by the tables and the column/columns by which the tables are usually joined - time is reduced bc related records will be closer to each other than if the records are stored in separate files in separate areas of the disk - best used when the records are fairly static
Controlling data integrity
- Data integrity controls can be built into the physical structure of the fields and controls enforced by the DBMS on those fields TYPICAL INTEGRITY CONTROLS - default value: value a field will assume unless a user enters an explicit value for an instance of that field - range control: limits the set of permissible values a field may assume - null value control: each primarykey must have an integrity control that prohibits a null value - referential integrity: a form of range control in which the value of that field must exist as the value in some field in another row of the same or a different table
Requirements for physical file and database design
- Normalized relations (estimates for the range of the number of rows in each table) - Definitions of each attribute - Descriptions of where and when data are used in various ways - Expectations or requirements for response time and data security, backup, recovery, retention, and integrity - Descriptions of the technologies (database management systems) used for implementing the database
Important security features of data management software
- View or subschemas, which restrict user views of the database - Domains, assertions, checks, and other integrity controls defined as database objects - Authorized rules - User-defined procedures - Encryption procedures - Authentication schemes - Backup, journaling, and checkpointing capabilities
Physical Database Design as a Basis for Regulatory Compliance
- W/O careful physica design, an organization cnnot demonstrate that its data are accurate and well protected
Logical access to data
- about the securtiy procedures in place to prevent unauthorized access to the data - organizations must establish administrative policies and procedures that serve as a context for implementing these measures - removing authorizations and authentications when someone's status is different - Controlling physical access -> limiting access to particular areas in the building (swipe or access cards) - backup data types is kept in fireproof data safes and/or kept off-site at a safe location - security guards, card swipe system, handprint recognition system
Date Volume and Usage Analysis
- beneficial to maintain an understanding of the size and usage patterns of the database throughout its life cycle - access frequencies are estimated from the timing of events, transaction volumes, the number of concurrent users, and reporting and querying activities
Code lookup table
- code table is a physical construct to achieve data processing performance improvements, not a set of data with business value
Parallel Query Processing
- common aproach: replicate the query so that each copy works against a portion of the database, usually a horizontal partition
Tablespace
- consists of logical units (segments) and are divided into extents - extent: a contiguous section of disk storage space - data blocks: smallest unit of storage - each tablespace can be stored in 1 or multiple data files, but each data file is associated with only 1 tablespace and only 1 database - A DBA has the ability to manage the disk space allocated to tablespaces and a number of parameters related to the way free space is managed within a database
Denormalize with caution
- denormalization can increase the chance of errors and inconsistencies (caused by reintroducing anomalies into the database) and can force the reprogramming of systems if business rules change - denormalization optimizes certain data processing at the expense of other data processing, so if the frequencies of different processing activities change, the benefits of denormalization may no longer exist
SOX & Databases
- designed to ensure the integrity of public companies' financial statements - ensuring sufficient control and security over the financial systems and IT infrastructure in use within an organization
Data replication
- final form of denormalization - the same data is purposely stored in multiple places in the database
Partitioning
- leads to differences between the logical data model and the physical tables
Secondary key
- one field or a combination of fields for which more than one record may have the same combination of values -> called a nonunique key - important for supporting many reporting requirements and for providing rapid ad hoc data retrieval
Diaster Recovery
- organization-wide responsibility - DBA is responsible for developing plans for recovering the organization's data and for restoring data operations COMPONENTS OF A RECOVERY PLAN - develop a detailed written diaster recovery plan - choose and train a multidisciplinary team to carry out the plan - establish a backup data center at an off-site location - send backup copies of databases to the backup data center ona scheduled basis
Dangers of denormalization
- performance doesn't depend only on the number of tables accessed but how the tables are organized in the database - to avoid problems associated with the data anomalies in denormalized databases, try other things to achieve that performance
Secure Sockets Layer (SSL)
- popular implementation of the 2 key method - provides data encryption, server authentication, and other services in a TCP/IP connection
Integrity Controls
- protect data from unauthorized use and update - integrity controls limit the values a field may hold and the actions that can be performed on data or trigger the execution of some procedure FORM of INTEGRITY CONTROL: domain -> ADV: if it has to change, it can be changed in1 plan Assertions: powerful constraints that enforce certain desirable database conditions -> checked automatically by the DBMS when transactions are run involving tables or fields on which assertions exist
Using and selecting indexes
- scanning every row in a table looking for the desired rows may be unacceptably slow - Indexes can greatly speed up the process - an index can be created for either a primary key, a secondary key, or both
Backward recovery (rollback)
- the backout or undo of unwanted changes to a database.Before images o the records that have been changed are applied to the database and the database is returned to an earlier state. Rollback is used to reverse the changes made by transactions that have been aborted or terminated abnormally
IT change management
- the process by which changes to operational systems and databases are authorized KEY ISSUE: adequate segregation of duties between people who had access to databases in the 3 common environments: development, test, and production
Designing Controls for Files
- the types of controls you can use to protect the file from destruction or contamination or to reconstruct the file if it is damaged - backup procedures provide a copy of a file and of the transaction that have changed the file
Horizontal partioning
- type of denormalization - distribution of the rows of a logical relation into several separate tables - implements a logical relation as multiple physical tables by placing different rows into different tables based on common column values - can make maintenance of a table more efficient because fragmenting and rebuilding can be isolated to single partitions as storage space needs to be reorganized - can be more secure bc it can prohibit users from seeing certain rows of data
Denormalization
- used to improve efficient processing of data retrieval and quick access to stored data BEST KNOWN APPROACH: combining several logical tables into 1 physical table to avoid the need to bring related data back together when they are retrieved from the database - the process of transforming normalized relations into nonnormalized physical record specifications
Handling missing data
- using a default value - not permitting missing values - substitute an estimate of the missing value (with a formulua) - track missing data so that special reports/system elements cause people to resolve unknown values quickly TRIGGER: a routine that will automatically execute when some event occurs or time period passes - perform sensitivity testing so that missing data are ignored unless knowing a value might significantly change results
Encryption: 2 common forms
1 key method (DES - Data Encryption Standard) -> both the sender and the receiver need to know the key that is used to scramble the transmitted or stored data 2 key (Asymmetric encryption): employs a private and a public key
TYPES OF DENORMALIZATION
1. 2 entities w/ a one-to-one relationship: 2. many-to-many relationship 3. Reference Data
Key focus of SOX audits is around 3 areas of control
1. IT change management 2. Logical access to data 3. IT operations
Authroization factors
1. user knows -> password, personal ID number 2. user posses -> smart or token 3. unique personal characteristic -> fingerprint OR retinal scan Password: first line of defense (1 factor authentication) -> a couple of issues including written down, shared w/ others
Backup facility
A DBMS copy utility that produces a backup copy (or save) of an entire database or a subset of a database - the copy should be stored in a secured location where it is protected from loss or damage, used to restore the database in the event of hardware failure, catastrophic loss, or damage
Database-as-a-Service (DBaaS)
A cloud computing approach in which the service consists of a data management platform service
Platform-as-a-Service (PaaS)
A cloud computing approach in which the service consists of infrastructure resources and additional tools and services that allow application and data management solution developers to reach a higher level of productivity than with pure infrastructure resources
Software-as-a-Service
A cloud computing approach in which the service consists of software solutions/applications intended to directly address the needs of a noncomputing activity
Infrastructure-as-a-Service (IaaS)
A cloud computing approach in which the service consists primarily of hardware and various types of systems software resources
Checkpoint facility
A facility by which a DBMS periodically refuses to accept any new transactions. The system is in a quiet state, and the database & transaction logs are synchronized
Cloud computing
A model for provisioning and acquiring computing services on demand using centralized resources that are accessed either through the public Internet or a private network
Recovery manager
A module of a DBMS that restores the database to a correct condition when a failure occurs and then resumes processing user questions
Creating a Unique Key Index
CREATE UNIQUE INDEX
File organization
A technique for physically arranging the records of a file on secondary storage devices 7 important factors: - fast data retrieval - high throughput for processing data input and maintenance transactions - efficient use of storage space - protection from failures or data loss - minimizing need for reorganization - accommodating growth - security from unauthorized use
Restore/Run
A technique that involves reprocessing the day's transactions (up to the point of failure) against the backup copy of the database ADV: simpicity DIS: processing transactions takes hours and sequencing of transactions will be different than originally processed so different results
Forward recovery (rollforward)
A technique that starts with an earlier copy of a database. After images are applied to the database and the database is quickly moved forward to a later state - faster and accurate than restore/rerun bc of the time-consuming logic of reprocessing each transaction
Aborted transaction
A transaction in progress that terminates abnormally REASONS: human error, input of invalid data, hardware failure, and deadlock When a transaction aborts, you will want to "back out" the transaction and remove any changes that have been made (but not committed) to the database
Committee of Sponsoring Organizations (COSO) of the Treadway Commission
A voluntary private-sector organization dedicated to improving the quality of financial reporting through business ethics, effective internal controls, and corporate governance
Data dictionaries
ACTIVE data dictionary: managed automatically by the database management software -> consistent w/ the current structure and definition of the database bc they are maintained by the system itself PASSIVE data dictionary: managaed by the user(s) of the system and is modified whenever the structure of the database is changed -> may not be current
Triggers
Can be used for security purposes - prohibit inappropriate actions - Cause special handling procedures to be executed - cause a row to be written to a log file to echo important information about the user and a transaction being made to sensitive data so that the log can be reviewed by human or automated procedures for possible inappropriate behavior
Types of database failure
Common types of problems are aborted transactions, incorrect data, system failure, and database loss or destruction
Choosing Data Types
Data type: A detailed coding scheme recognized by system software for representing organizational data Selecting a data type involves 4 objectives: 1. Represent all possible values 2. Improve data integrity 3. Support all data manipulations 4. Minimize storage space
Who is responsible for physical database design?
Database administrator (DBA): primary responsibility for physical database design - a technical function responsible for logical and physical database design and for dealing with technical issues - understand the data models built by data administration and be capable of transforming them into efficient and appropriate logical and physical database designs - implements the standards and procedures established by the data administrator, including enforcing programming standards, data standards, policies, and procedures
Repository engine - 5 core function
Object management Relationship management Dynamic extensibility Version management Configuration management
Designing a database for optimal query performance
PURPOSE: optimize the performance of database processing Database processing: adding, deleting, and modifying a database
SOX (sabanes-oxyley act)
Purpose: protect investors by improving the accuracy and reliability of corporate disclosures made pursuant to the securities laws and for other purposes
Database destruction
The database itself is lost, destroyed, or cannot be read
OVERRIDING AUTOMATIC QUERY OPTIMIZATION
With some DBMSs, you can force the DBMS to do the steps differently or to use the capblities
Before image
a copy of a record before it has been modified
After image
a copy of the same record after it has been modified
Smart card
a credit card-sized plastic card with an embedded microprocessor chip that can store, process, and output electronic data in a secure manner
Journalizing facilities
an audit trail of transactions and database changes transaction log: contains a record of the essential data for each transaction that is processed against the database database change log: contains before and after images of records that have been modified by transactions