BUS 440: Test 1 (chapters 1-6)
Data redundancy
relational database facilitates the control of data redundancies through the use of FKs. ● Unless data redundancy must be increased to make the database service crucial information purposes or exists to preserve the historical accuracy of the data.
Data ● Data management
raw facts that have not been processed to give it any real meaning. ● Building blocks of information. ● Data management: generation, storage and retrieval of data.
Data-Modeling Checklist ■ Business Rules ■ Data Modeling □ Naming conventions ▹ Entity names ▹ Attribute names ▹ Relationship names □ Entities □ Attributes □ Relationships □ ER Model
(1.) Business Rules - ● Properly document and verify all business rules with the end-users. ● Ensure that all business rules are written precisely, clearly and simply. ● Business rules must help identify entities, attributes, relationships and constraints. ● Identify the source of all business rules, and ensure each business rule is justified, dated and signed off by an approving authority. ---------------- (2.) Data Modeling - ● Naming conventions: all names should be limited in length; (a.) Entity names: should be nouns and unique within the model. (b.) Attribute names: should be descriptive of the characteristic and unique within the model. (c.) Relationship names: should be active or passive verbs. ---- ● Entities: ○ Each entity should represent a single subject. ○ Each entity should represent a set of distinguishable entity instances. ○ All entities should be in 3NF or higher (any entities below 3NF should be justified). ○ The granularity of the entity instance should be clearly identified. ○ The PK should be clearly defined. ---- ● Attributes: ○ Should be simple and single-valued (atomic data). ○ Should document default values, constraints, synonyms and aliases. ○ Derived attributes should be clearly identified and include source(s). ○ Should not be redundant (unless required for transaction accuracy, performance or maintaining a history). ○ Nonkey attributes must be fully-dependent on the PK attribute. ---- ● Relationships: ○ Should clearly identify relationship participants. ○ Should clearly define participation, connectivity and document cardinality. ---- ● ER Model: ○ Should be validated against expected processes: inserts, updates and deletions. ○ Should evaluate where, when and how to maintain a history. ○ Should not contain redundant relationships—except as required (i.e., same rule as what was said about Attributes two black bullet points above). ○ Should minimize data redundancy to ensure single-place updates. ○ Should conform to the minimal data rule—"All that is needed is there, and all that is there is needed."
Evolution of file system data processing (1.) Manual file systems (2.) Computerized file systems (3.) File system redux
(1.) From manual file systems—just a system of file folders and filing cabinets... ---- (2.) To computerized file systems—where a data-processing specialist would create a computer-based system that would track data and produce required reports... - Lengthy development times, difficulty getting quick answers, complex system administration, lack of security, limited data sharing and extensive programming. - All departments stand alone, with no sharing of data (so data could be different in each department). ---- (3.) To file system redux—modern end-user productivity tools; includes spreadsheet programs like Microsoft Excel.
Five components of a database system
(1.) Hardware (2.) Software (3.) People ○ System administrators oversee general operations. ○ DBAs ensure the database is functioning properly. ○ Database designers (self-explanatory) ○ End-users (self-explanatory) (4.) Procedures ○ Instructions and rules that govern the database. (5.) Data
The Evolution of Data Models
(1.) Hierarchical (2.) Network (3.) Relational (4.) Entity-Relationship (5.) Semantic (6.) Object-Oriented (7.) Extended Relational (8.) NoSQL
Developing an ER Diagram
(a.) Create a detailed narrative of the organization's description of operations. (b.) Identify business rules based on the descriptions. (c.) Identify main entities and relationships from the business rules. (d.) Develop the initial ERD. (e.) Identify the attributes and PKs that adequately describe entities. (f.) Revise and review the ERD.
Relationships within the relational database
1:M 1:1 M:N
Chapter 2
Data Models
Chapter 1
Database Systems
Chapter 4
Entity-Relationship (ER) Modeling Entity-relationship model (ERM) is the basis of an entity-relationship diagram (ERD).
Chapter 6
Normalization of Database Tables
Chapter 3
The Relational Database Model ● Relational database models enable logical representation of the data and its relationships. ● Logical simplicity yields simple and effective database design methodologies. ● Facilitated by the creation of data relationships based on a logical construct (a relation).
Linking table
The implementation of a composite entity.
Data modeling
process of creating a specific data model for a determined problem domain.
Database management system (DBMS)
a collection of programs that manages the database structure and controls access to data stored in the database. ● It is the intermediary between the user and the database. ● Enables data to be shared and presents the end-user with an integrated view of the data. ● Receives and translates application requests into operations required to fulfill the requests. ● Hides the database's internal complexity from the application programs/users. ● Essentially, it is the middle-man between the database structure and the end-users.
*** (3.) Relational model ● Advantages ● Disadvantages
a method of structuring data using relations, which are grid-like mathematical structures consisting of columns and rows. ● Produces an automatic transmission database that replace standard transmission databases. ● Based on a relation - ○ Relation or table: matrix of intersecting tuples and attributes; ■ Tuple: rows ■ Attribute: columns ● Describes a precise set of data manipulation constructs. ---- ● Advantages: - Structural independence is promoted by using independent tables. - Tabular view improves conceptual simplicity. - Ad hoc query capability is based on SQL. - Isolates the end-user from physical-level details. - Improves implementation and management simplicity. ---- ● Disadvantages: - Requires substantial hardware and system software overhead. - Conceptual simplicity gives untrained people the tools to use a good system poorly. - May promote information problems.
Structural dependence
access to a file is dependent on its own structure. ● All file system programs are modified to conform to a new file structure.
Model
an abstraction of a real-world object or event.
Relationships ● Participants ● Connectivity ● Cardinality ● Existence dependence ● Existence independence ● Weak (non-identifying) relationship ● Strong (identifying) relationship ● Optional participation ● Mandatory participation ● Relationship degree ○ Unary relationship - Recursive relationship ○ Binary relationship ○ Ternary relationship
association between entities that always operate in both directions. ● Participants: entities that participate in a relationship. ● Connectivity: describes the relationship classification; □ e.g., -||--------|<- or ->O--------|<- ● Cardinality: expresses the minimum and maximum number of entity occurrence associated with one occurrence of a related entity; □ e.g., (1,1), (1,4), or (3,0), (1,6). ● Existence dependence: entity exists in the database only when it is associated with another related entity occurrence. - Must have a FK and it cannot be null. - e.g, EMPLOYEE claims DEPENDENT. ● Existence independence: entity exists apart from all of its related entities (referred to as a strong entity or a regular entity). ● Weak (non-identifying) relationship: PK of the related entity does not contain a PK component of the parent entity. ● Strong (identifying) relationship: PK of the related entity contains a PK component of the parent entity. ● Optional participation: one entity occurrence does not require a corresponding entity occurrence in a particular relationship; □ e.g., connectivity with: ->O-------- (a cardinality of (0,X)). ● Mandatory participation: one entity occurrence requires a corresponding entity occurrence in a particular relationship. - i.e., has to be 1:M, 1:1 or M:N. ---- ● Relationship degree: indicates the number of entities or participants associated with a relationship. ○ Unary relationship: association is maintained within a single entity. ■ Recursive relationship: relationship exists between occurrences of the same entity set. ○ Binary relationship: two entities are associated. ○ Ternary relationship: three entities are associated.
Entity
at the ER modeling level, refers to the entity set and not to a single entity occurrence. ● An entity in the ERM corresponds to a table (not a row) in the relational environment. ● ERM refers to a table row as an entity instance or entity occurrence.
Key attribute
attribute that is a part of a key.
Business rules ○ Allow the designer to...
brief, precise and unambiguous description of a policy, procedure or principle. ● Enables defining the basic building blocks. ● Describes main and distinguishing characteristics of the data. ● Come from: company managers, policy makers, department managers, written documentations and direct interviews with end-users. ● Helps standardize a company's view of the data. ● Is a communication tool between users and designers. ● Allow the designer to: (1.) Understand the nature, role, scope of data and business processes. (2.) Develop appropriate relationship participation rules and constraints. (3.) Create an accurate data model.
Attributes ● Required attribute ● Optional attribute ○ Domain ○ Identifiers ● Composite identifier ● Composite attribute ● Simple attribute ● Single-valued attribute ● Multivalued attributes ● Derived attribute
characteristics of entities. ● Required attribute: must have a value and cannot be left empty. ● Optional attribute: does not require a value and can be left empty. ○ Domain: set of possible values for a given attribute. ○ Identifiers: one or more attributes that uniquely identify each entity instance. ▹ Bold and underline indicates the PK. ● Composite identifier: PK composed of more than one attribute. ● Composite attribute: attribute that can be subdivided to yield additional attributes. ● Simple attribute: attribute that cannot be subdivided. ● Single-valued attribute: attribute that has only a single value. ● Multivalued attributes: attributes that have many values. - Requires creating: (1.) Several new attributes, one for each component of the original multivalued attribute. (2.) A new entity composed of the original multivalued attribute's components. ● Derived attribute: attribute whose value is calculated from other attributes. - Derived using an algorithm; - e.g., calculating an employee's age instead of entering it. - Saves CPU processing times, data access time and storage. - Computation always yields current value; data value is readily available. - Requires constant maintenance to ensure derived value is current.
Schema
conceptual organization of the entire database as viewed by the database administrator.
Entity integrity ● Null ○ Flags ○ NOT NULL constraint ○ UNIQUE constraint
condition in which each row in the table has its own unique identity. ● All of the values in the PK must be unique. ● No key attribute in the PK can contain a null. ---- ○ Null: absence of any data value that could represent an unknown attribute value, a known but missing value or an inapplicable condition. - Ways to handle nulls: ■ Flags: special codes used to indicate the absence of some value. ■ NOT NULL constraint: placed on a column to ensure that every row in the table has a value for that column. ■ UNIQUE constraint: restriction placed on a column to ensure that no duplicate values exist for that column.
Keys ○ Determination ○ Dependencies - Functional dependence ■ Determinant ■ Dependent - Full-functional dependence
consist of one or more attributes that determine other attributes. ● Used to ensure that each row in a table is uniquely identifiable; establishes relationships; and ensures the integrity of the data. ---- ○ Determination: state in which knowing the value of one attributes makes it possible to determine the value of another. ▸ Is the basis for establishing the role of a key. ▸ Based on the relationships among the attributes. ---- ○ Dependencies: - Functional dependence: value of one or more attributes determine the value of one or more other attributes. ■ Determinant: attribute whose value determines another. ■ Dependent: attribute whose value is determined by the other attribute. - Full-functional dependence: entire collection of attributes in the determinant is necessary for the relationship.
General-purpose databases
contains a wide variety of data used in multiple disciplines.
*** (6.) Object-oriented data model (OODM) ● Attributes ● Class ● Class hierarchy ● Inheritance ● Unified modeling language (UML) ○ Advantages ○ Disadvantages
contains data and their relationships with operations that are performed on it. ○ Basic building block for autonomous structures. ○ Abstraction of a real-world entity. ○ Is said to be a semantic data model. ● Attributes: describe the properties of an object. ● Class: collection of similar objects with shared structure and behavior organized in a class hierarchy. ● Class hierarchy: resembles an upside-down tree in which each class has only one parent. ● Inheritance: object inherits methods and attributes of parent class. ● Unified modeling language (UML): describes sets of diagrams and symbols to graphically model a system. ---- ● Advantages: - Semantic content is added. - Visual representation includes semantic content. - Inheritance promotes data integrity. ---- ● Disadvantages: - Slow development of standards caused vendors to supply their own enhancements. - Complex navigational system. - Learning curve is steep. - High system overhead slows transactions.
Discipline-specific databases
contains data focused on specific subject areas.
Cloud database
created and maintained using cloud data services that provide defined performance measures for the database.
Data dependence
data access changes when data storage characteristics change. ● Practical significance of data dependence is difference between logical and physical format.
Distributed database
data is distributed across different sites.
Centralized database
data is located at a single site.
Data indepedence
data storage characteristics is changed without affecting the program's ability to access the data.
Structured data
data that has been formatted (structure is applied based on the type of processing to be performed; e.g., "$37,819", "387.19", Area code: "37819".
Semi-structured data
data that has been processed to some extent.
Unstructured data
data's original state; e.g., "37819".
NoSQL (Not only SQL)
describes a new generation of databases that is not based on traditional relational database model.
Operational database
designed to support a company's day-to-day operations such as sales transactions. Also known as online transaction processing (OLTP).
Schema data definition language (DDL)
enables the database administrator to define the schema components.
Data manipulation language (DML)
environment in which data can be managed and is used to work with the data in the database.
Normalization ● Denormalization
evaluating and correcting table structures to minimize data redundancies. ● Reduce data anomalies. ● Assigns attributes to tables based on determination. ● Normal forms: 1NF, 2NF and 3NF ● Structural point-of-view of normal forms (higher normal forms are better than lower). ● Properly designed 3NF structures meet the requirement of 4NF. ● Used while designing a new database structure. - Analyzes the relationship among the attributes within each entity. - Determines if the structure can be improved. ● Improves the existing data structure and creates an appropriate database design. ---- ○ Denormalization: produces a lower normal form, resulting in increased performance and greater data redundancy.
Referential integrity
every reference to an entity instance by another entity instance is valid or null.
Structural independence
file structure is changed without affecting the application's ability to access the data.
Relational database management system (RDBMS)
performs basic functions provided by the hierarchical and network DBMS systems and makes the relational data model easier to understand and implement; hides the complexities of the relational model from the user.
Subschema
portion of the database seen by the application programs that produce the desired information from the data within the database.
Big data ● 3 Vs ○ Disadvantages
finding new and better ways to store large amounts of data for efficient and effective exchange of all data types. ● Aims at finding new and better ways to manage large amounts of web- and sensor-generated data as well as provide high performance and scalability at a reasonable cost. ---- Defined by the 3 Vs: ▸ Volume—the amount of data being stored; ▹ Social media greatly increased the volume of data. ▸ Velocity—the speed with which data grows and the need to process this data quickly. ▸ Variety—data being collected originates from many different sources and is stored in many different formats. ---- ○ Disadvantages: - Volume does not allow the usage of conventional structures. - Expensive. - OLAP tools proved inconsistent dealing with unstructured data. ---- ● Big data new technologies include: Hadoop, Hadoop Distributed File System (HDFS), MapReduce and NoSQL.
*** (4.) Entity-relationship model ● Entity-relationship diagram (ERD) ● Entity instance/occurrence ● Connectivity ○ Advantages ○ Disadvantages
graphical representation of entities and their relationships in the database structure. ● Entity-relationship diagram (ERD): uses graphic representations to model database components. ● Entity instance/occurrence: rows in the relational table. ● Connectivity: term used to label the relationship types. ---- ● Advantages: - Visual modeling yields conceptual simplicity. - Visual representation makes it an effective communication tool. - Is integrated with the dominant relational model. ---- ● Disadvantages: - Limited constraint representation. - Limited relationship representation. - No data manipulation language. - Loss of information content occurs when attributes are removed from entities to avoid crowded displays.
Composite entity (bridge or associative entity)
helps avoid problems inherent to M:N relationships; include the PKs of tables to be linked.
Database systems ● Current-generation DBMS software...
logically related data stored in a single logical data repository and then physically distributed among multiple storage facilities. ● DBMS eliminates most of file system's problems. ● Current-generation DBMS software: (1.) Stores data structures, relationships between structures and access paths. (2.) Defines, stores and manages all access paths and components.
*** (1.) Hierarchical models ● Advantages ● Disadvantages
manages large amounts of data for complex manufacturing projects. ● Represented by an upside-down tree which contains segments—equivalent of a file system's record type. ● Depicts a set of one-to-many (1:M) relationships. ---- ● Advantages: - Promotes data sharing. - Parent/child relationship promotes conceptual simplicity and data integrity. - Database security is provided and enforced by DBMS. - Efficient with 1:M relationships. ---- ● Disadvantages: - Requires knowledge of physical storage characteristics. - Navigational system requires knowledge of hierarchical path. - Changes in structure requires changes in all application programs. - Implementation limitations. - No data definitions. - Lack of standards.
Single-user database ○ Desktop database
only supports one user at a time. ○ Desktop database: runs on a PC.
Index ● Index Key ● Unique index
orderly arrangement to logically access rows in a table. ● Index Key: index's reference point that leads to data location identified by the key. ● Unique index: Index Key that can have only one pointer value associated with it. ○ Each index is associated with only one table.
*** (2.) Network models ● Advantages ● Disadvantages
represent complex data relationships. ● Improve database performance and impose a database standard. ● Depicts both one-to-many (1:M) and many-to-many (M:N) relationships. ---- ● Advantages: - Conceptual simplicity. - Handles more relationship types. - Data access is flexible. - Data owner/member relationship promotes data integrity. - Conformance to standards. - Includes data definition language (DDL) and data manipulation language (DML). ---- ● Disadvantages: - System complexity limits efficiency. - Navigational system yields complex implementation, application development and management. - Structural changes require changes in all applications programs.
Extensible Markup Language (XML)
represents data elements in textual format. Supports semi-structured data.
Islands of information
scattered data locations that increases the probability of having different versions of the same data. - Data-modeling skills are vital in the data-design process. - Good data modeling facilitates communication between the designer, user and the developer.
Data models
simple representations of complex real-world data structures. ● Useful for supporting a specific problem domain. ● Are a communication tool. ● Give an overall view of the database. ● Organize data for various users. ● Are an abstraction of the creation of a good database.
Analytical database ● Data warehouse ● Online analytical processing (OLAP) ○ Data warehouse and OLAP together create...
stores historical data and business metrics used exclusively for tactical or strategic decision-making. ○ Data warehouse: stores data in a format optimized for decision support. ○ Online analytical processing (OLAP): enables retrieving, processing and modeling data from the data warehouse. ○ Data warehouse and OLAP together create business intelligence, which is a comprehensive approach to gathering data and using it for decision-making.
Multi-user database ○ Workgroup databases ○ Enterprise database
supports multiple users at the same time. ○ Workgroup databases: supports a small number of users (< 50 users) or a specific department. ○ Enterprise database: supports many users (> 50 users) across many departments.
Relational algebra
theoretical way of manipulating table contents using relational operators.
Data redundancy ● Data anomaly
unnecessarily storing the same data in the different places. ● Poor data security. ● Data inconsistency. ● Increased likelihood of data-entry errors when complex entries are made in different files. ● Data anomaly: develops when not all of the required changes in the redundant data are made successfully; ○ Updated anomalies ○ Insertion anomalies ○ Deletion anomalies
Surrogate Keys
used by designers when the PK is considered to be unsuitable. ● System-defined attribute. ● Created and managed by the DBMS. ● Have a numeric value which is automatically incremented for each new row.
Associative entities (aka composite or bridge entities)
used to represent M:N relationships between two or more entities. ● Is in a 1:M relationship with parent entities (composed of the PK attributes of each parent entity). ● May also contain additional attributes that play no role in connective process.
Information
produced by the processing of data, giving data meaning. ● Enables knowledge creation. ● Should be accurate, relevant and timely. ● Is used to reveal the meaning of data.
Database ● End-user data ● Metadata
shared, integrated computer structure that stores a collection of: (1.) End-user data: raw facts that are important to the end-user. (2.) Metadata: data about data that is integrated and managed (describes characteristics and relationships of the data).
Conversion to Second Normal Form (2NF)
▸ Steps: (1.) Make new tables to eliminate partial dependencies. (2.) Reassign corresponding dependent attributes. ● Table is in 2NF when... ○ It's in 1NF. ○ Includes no partial dependencies. ---- ▹ Example (cont.): PROJECT (PROJ_NUM —> PROJ_NAME) EMPLOYEE (EMP_NUM —> EMP_NAME, JOB_CLASS, CHG_HOUR) ASSIGNMENT (PROJ_NUM, EMP_NUM —> ASSIGN_HOURS) ▸ Transitive dependency - (JOB_CLASS —> CHG_HOUR) ▸ PKs - (a.) PROJECT table = PROJ_NUM (b.) EMPLOYEE table = EMP_NUM (c.) ASSIGNMENT table = PROJ_NUM + EMP_NUM
Conversion to Third Normal Form (3NF) ● Determinant
▸ Steps: (1.) Make new tables to eliminate transitive dependencies. (2.) Reassign corresponding dependent attributes. - Determinant: any attribute whose value determines other values within a row. ● Table is in 3NF when... ○ It's in 2NF. ○ Contains no transitive dependencies. ---- ▹ Example (cont.): PROJECT (PROJ_NUM —> PROJ_NAME) EMPLOYEE (EMP_NUM —> EMP_NAME, JOB_CLASS) ASSIGNMENT (PROJ_NUM, EMP_NUM —> ASSIGN_HOURS) JOB (JOB_CLASS —> CHG_HOUR)
Entity-relationship model notation (Crows foot) ● 1:M ● M:N ● 1:1
● 1:M -||--------|<- ● M:N ->|--------|<- ● 1:1 -||--------||-
Characteristics of a relational table ● Rows ● Columns ○ Each intersection of a row and column represents...
● A table is perceived as a two-dimensional structure composed of rows and columns. ● Rows (tuples) - ○ Represent a single entity occurrence within the entity set. ● Columns - ○ Represent an attribute. ○ Must have a distinct name. ○ All values must conform to the same data format. ○ Has a specific range of values (an attribute domain). ● The order of rows and columns is immaterial to the DBMS. ● Each table must have an attribute or combination of attributes that uniquely identifies each row.
Database design
● Activities that focus on the design of the database structure that will be used to store and manage end-user data. ● A well-designed database facilitates data management and generates accurate and valuable information. ● Poorly designed databases cause difficult-to-trace errors. ● See figures 1.3 and 1.4 (splitting information into multiple tables increases efficiency and operability but runs the risk of data redundancy).
Advantages of a DBMS ○ Data quality
● Better data integration and less data inconsistency (different versions of the same data in different places). ● Increased user productivity. ● Better data sharing, data security, data access and decision-making. ○ Data quality: promoting accuracy, validity and timeliness of data.
Data dictionary and the system catalog ● Data dictionary ● System catalog ○ Homonym ○ Synonym
● Data dictionary: description of all tables in the database created by the user and the designer. ○ Contains metadata. ○ Often called the database designer's database. ---- ● System catalog: system data dictionary that describes all objects within the database; - i.e., similar to the data dictionary but with much more data. ---- ● Homonyms and synonyms must be avoided to lessen confusion. ○ Homonym: same name is used to label different attributes. ○ Synonym: different names are used to describe the same attribute.
DBMS functions - ● Data dictionary ● Data storage management ● Data transformation and presentation ● Security management ● Multi-user access control ● Backup and recovery management ● Data-integrity management ● Database access languages and application programming interfaces ● Database-communicated interfaces ○ Disadvantages
● Data dictionary: used by the DBMS to lookup the required data-component structures and relationships. ● Data storage management - ○ Performance tuning: ensures efficient performance of the database in terms of storage and access speed. ● Data transformation and presentation: transforms entered data to conform to required data structures. ● Security management: enforces user security and data privacy. ● Multi-user access control: sophisticated algorithms that ensure that multiple users can access the database at the same time without compromising its integrity. ● Backup and recovery management: enables recovery of the database after a failure. ● Data-integrity management: minimizes redundancy and maximizes consistency. ● Database access languages and application programming interfaces - (1.) Query language: lets the user specify what must be done without having to specify how. (2.) Structured Query Language (SQL): de facto query language and data-access standard supported by the majority of DBMS vendors. ● Database-communicated interfaces: accepts end-user requests via multiple, different network, environments. ---- ○ Disadvantages of database systems: - Increased costs - Management complexity - Maintaining currency - Vendor dependence - Frequent update/replacement cycles
Database design challenges - Conflicting goals
● Database design must conform to design standards. ● Need for high processing speed may limit the number and complexity of logically desirable relationships. ● Need for maximum information generation may lead to loss of clean design structures and high transaction speed.
Naming conventions ● Entity names ● Attribute names
● Entity names: are required to be descriptive of the objects in the business environment and use terminology that is familiar to the end-users. ● Attribute names: are required to be descriptive of the data represented by the attribute. ● Proper naming facilitates communication between parties and promotes self-documentation.
Data model basic building blocks - ● Entity ● Relationship ● Constratint
● Entity: a uniqueness about how the data will be stored. (1.) Attribute: a characteristic of an entity; e.g., Entity = customer, Attribute = age (as a numerical value). ● Relationship: describes an association among entities; (1.) One-to-many (1:M) (2.) Many-to-many (M:N or M:M) (3.) One-to-one (1:1) ● Constraint: set of rules to ensure data integrity; e.g., GPA must be between 0.00 and 4.00.
Requirements for a good normalized set of tables ○ Atomic attribute ○ Atomicity ○ Granularity
● Evaluate PK assignments and naming conventions. ● Refine attribute atomicity. ○ Atomic attribute: cannot be further subdivided. ○ Atomicity: characteristic of an atomic attribute. ● Identify new attributes and new relationships. ● Refine PKs as required for data granularity. ○ Granularity: level of detail represented by the values stored in a table's row. ● Maintain historical accuracy and evaluate using derived attributes.
Boyce-Codd Normal Form (BCNF) ○ Candidate Key
● Every determinant in the table should be a Candidate Key. ○ Candidate Key: same characteristics as PK but not chosen to be the PK. ● Equivalent to 3NF when the table contains only one Candidate Key. ● Violated only when the table contains more than one Candidate Key. ● Considered to be a special case of 3NF. ---- ▹ Example: ■ 1NF - (A, B —> C, D) □ Partial dependency: (B —> C) ---- ▹ Example (cont.): ■ 3NF and BCNF - (A, C —> D) (C —> B)
Object/Relational and XML *** (7.) Extended relational model (ERDM) ● Object/Relational Database Management System (O/R DBMS) ● Extensible markup language (XML)
● Extended relational model (ERDM): supports OO features and complex data representation. ● Object/Relational Database Management System (O/R DBMS): based on ERDM but focuses on better data management. ● Extensible markup language (XML): manages unstructured data for efficient and effective exchange of all data types.
Conversion to First Normal Form (1NF) ● Repeating group ○ Steps... ● Dependency diagram ○ 1NF describes tabular format in which...
● Repeating group: group of multiple entries of the same type can exist for any single key attribute occurrence (existence proves the presence of data redundancies). ● Enables reducing data redundancies. ▸ Steps: (1.) Eliminate the repeating groups. (2.) Identify the PK. (3.) Identify all dependencies. ● 1NF describes tabular format in which: - All key attributes are defined. - There are no repeating groups in the table. - All attributes are dependent on the PK. ● All relational tables satisfy 1NF requirements. ● Some tables contain partial dependencies (subject to data redundancies and various anomalies). ---- ▹ Example: ■ Partial dependencies - (PROJ_NUM —> PROJ_NAME) (EMP_NUM —> EMP_NAME, JOB_CLASS, CHG_HOUR) ▸ Transitive dependency - (JOB_CLASS —> CHG_HOUR) ▸ PKs - (a.) PROJ_NUM (b.) EMP_NUM
Database Abstraction Levels ● External model ● Conceptual model ○ Conceptual schema ○ Logical design ● Internal model ● Physical model ○ Physical independence
● External model: end-users' view of the data environment. ○ ER diagrams are used to represent the external views (divided up into sections). ○ External schema: specific representations of an external view. ---- ● Conceptual model: represents a global view of the entire database by the entire organization. ○ Conceptual schema: basis for the identification and high-level description of the main data objects. ○ Has a macro-level view of the data environment. ○ Is software and hardware independent. ○ Logical design: task of creating a conceptual data model. ---- ● Internal model: representing database as seen by the DBMS mapping conceptual model to the DBMS. ○ Internal schema: specific representation of an internal model; uses the database constructs supported by the chosen database. ○ Is software dependent and hardware independent. ○ Logical independence: changing the internal model without affecting the conceptual model. ---- ● Physical model: describes the way data are saved on storage media such as disks/tapes. ○ Operates at the lowest level of abstraction. ○ Requires the definition of physical storage and data access methods. ○ Relational model aimed at logical level and does not require physical-level details. ○ Physical independence: changes in physical model do not affect internal model.
Basic File terminology - ● Field ● Record ● File
● Field: a character or group of characters that has a specific meaning; a field is used to define and store data. ● Record: a logically connected set of one or more fields that describes something; ○ e.g., the fields that constitute a record for a customer might consist of their name, address, phone number, date of birth, etc. ● File: a collection of related records; ○ e.g., a file might contain data about the students currently enrolled at a university.
Functional dependencies ● Functional dependence ● Fully-functional dependence (Composite Key) ● Partial dependency ● Transitive dependency
● Functional dependence: the attribute B is fully-functionally dependent on the attribute A if each value of A determines one and only one value of B. - Generalized definition: attribute A determines attribute B if all of the rows in the table that agree in value for attribute A also agree in value for attribute B. ● Fully-functional dependence (Composite Key): if attribute B is functionally dependent on a Composite Key A but not on any subset of that Composite Key, the attribute B is fully-functionally dependent on A. ● Partial dependency: functional dependence in which the determinant is only part of the PK. ● Transitive dependency: an attribute functionally depends on another nonkey attribute.
Problems with file system data processing
● Lengthy development times. ● Difficulty getting quick answers. ● Complex system administration. ● Lack of security and limited data sharing. ● Extensive programming.
Normalization and database design
● Normalization should be part of the design process. ● Proposed entities must meet the required form before table structures are created. ● Principles and normalization procedures need to be understood to be understood to redesign and modify databases. - ERD is created through iterative process. - Normalization focuses on the characteristics of specific entities. ● Even if attributes are already in one table, it is better to create a new table (and create a new PK) if there are attributes that depend on each other within the original table. ... ○ If you have an EMPLOYEE table with: (PK) EMP_NUM EMP_FNAME EMP_INITIAL EMP_LNAME JOB_CODE JOB_DESCRIPTION ○ Create a JOB table with: (PK) JOB_CODE JOB_DESCRIPTION ○ Then have JOB_CODE be a FK in the EMPLOYEE table.
NoSQL Databases (Not using SQL) ● Advantages ● Disadvantages
● Not based on the relational model. ● Supports distributed database architectures. ● Provides high scalability, high availability and fault tolerance. ● Supports large amounts of sparse data. ● Geared towards performance, rather than transaction consistency. ● Typically stores data in key-value stores. ---- ○ Advantages: - High scalability, availability and fault tolerance are provided. - Uses low-cost commodity hardware. - Supports Big data. - Key-value model improves storage efficiency. ---- ○ Disadvantages - Complex programming is required. - There is no relationship or transaction integrity support. - In terms of data consistency, it provides an eventually consistent model.
Translating business rules into data model components ○ Nouns are... ○ Verbs are... ○ Relationships are...
● Nouns —> entities ● Verbs —> relationships among entities ● Relationships are bi-directional - ○ Questions to help identity the relationship type: (1.) How many instances of B are related to one instance of A? (2.) How many instances of A are related to one instance of B?
Normalization process
● Objective is to ensure that each table conforms to the concept of well-formed relations. - Each table represents a single subject. - No data item will be unnecessarily stored in more than one table. - All nonprime attributes in a table are dependent on the PK. - Each table is void of insertion, update and deletion anomalies. ● Ensures that all tables are in at least 3NF. ● Works one relation at a time.
Types of Keys ● Primary Key (PK) ● Foreign Key (FK) ● Secondary Key (SK) ● Composite Key ● Superkey ● Candidate Key
● Primary Key (PK): attribute or combination of attributes that uniquely identifies a row. - Requirement: all PK entries are unique; no part of a PK may be null. - Purpose: each row will have a unique identity, and FK values can properly reference PK values; ○ e.g., no invoice can have a duplicate number or be null. ---- ● Foreign Key (FK): an attribute or combination of attributes in one table whose values must either match the PK in another table or be null. - Requirement: a FK may may have either a null entry or an entry that matches a PK in a table to which it is related. - Purpose: it is possible for an attribute to not have a corresponding value, but it is impossible to have an invalid entry; it is impossible to delete a row in a table whose PKs have mandatory matching FK values in another table; ○ e.g., it is impossible to have an invalid Sales Representative Number. ---- ● Secondary Key (SK): an attribute or combination of attributes strictly for data retrieval purposes. ▸ Not all customers may remember their Order Number, so have them enter their last name, zip code and phone number in order to retrieve their order information; - That data is stored only for facilitation of their data retrieval. ---- ● Superkey: a key that can uniquely identify every single attribute in the row; - It is any set of columns that, combined together, are unique. - There are typically many Superkeys per table, and the same column may be shared by many Superkeys. - They are not very useful by themselves, and are more of a mental tool for identifying Candidate Keys. ---- ● Candidate Key: a minimal (irreducible) Superkey; a Superkey that does not contain a subset of attributes that is itself a Superkey; - If any column is removed it would no longer be unique. - There are typically significantly fewer Candidate Keys than Superkeys. ---- ● Composite Key: a key that is composed of more than one attribute. - It is a minimal Superkey that has multiple columns.
Relational set operators ● Select (restrict) ● Project ● Union ○ Union-compatible ● Intersect ● Difference ● Product ● Divide ○ Join (1.) Natural join - Join columns (2.) Theta join (3.) Equijoin (4.) Inner join (5.) Outer join (5a.) Left outer join (5b.) Right outer join
● Select (restrict): unary operator that yields a horizontal subset of a table. ○ Gives you certain rows (same number of columns) from one table to answer a question about the data; - e.g., selecting products with price < $2 ---- ● Project: unary operator that yields a vertical subset of a table. ○ Takes a whole column from a certain table; - e.g., just Product Code column. ---- ● Union: combines all rows from two tables, excluding duplicate rows. ▸ Union-compatible: tables share the same number of columns and their corresponding columns share compatible domains; ○ i.e., the same table (with the same number of columns) but with additional rows (because the two tables are combined). ---- ● Intersect: yields only the rows that appear in both tables. ○ Tables must be union-compatible to yield valid results. ○ i.e., a filter. ---- ● Difference: yields all rows in one table that are not found in the other table. ○ Tables must be union-compatible to yield valid results. ○ Rows from table A minus rows from table B. ---- ● Product: yields all possible pairs of rows from two tables. ○ If table A has 6 rows/3 columns and table B has 3 rows/3 columns, the product table will have every combination of the two tables (i.e., 18 rows/6 columns). ---- ● Divide: uses one 2-column table as the dividend and one single-column table as the divisor. ○ Output is a single column that contains all values from the second column of the dividend that are associated with every row of the divisor; ○ e.g., if you have a table with the columns CODE and LOC, you divide it by the column CODE and get one entry of the column LOC. ---------------- ● Join: allows information to be intelligently combined from two or more tables; types: (1.) Natural join: links tables by selecting only the rows with common values in their common attributes. ▸ Join columns: common columns. (2.) Theta join: extension of natural join, denoted by adding a theta subscript after the JOIN symbol. (3.) Equijoin: links tables on the basis of an equality condition that compares specific columns of each table. (4.) Inner join: only returns matched records from the tables that are being joined. ---- (5.) Outer join: matched pairs are retained and unmatched values in other table are left null. (5a.) Left outer join: yields all of the rows in the first table, including those that do not have a matching value in the second table. (5b.) Right outer join: yields all of the rows in the second table, including those that do not have matching values in the first table.
Normalization process (cont.) ● Starts by... ▸ 1NF ▸ 2NF ▸ 3NF ▹ BCNF ▸ 4NF
● Starts by: (a.) Identifying the dependencies of a relation (table). (b.) Progressively breaking the relation into new sets of relations. ---- ▸ First Normal Form (1NF) - Table format, no repeating groups, PK identified. ▸ Second Normal Form (2NF) - 1NF and no partial dependencies. ▸ Third Normal Form (3NF) - 2NF and no transitive dependencies. ---- ▹ Boyce-Codd Normal Form (BCNF) - Every determinant is a Candidate Key. (*** Special case of 3NF). ▸ Fourth Normal Form (4NF) - 3NF and no independent multivalued dependencies.
Fourth Normal Form (4NF)
● Table is in 4NF when... ○ It's in 3NF. ○ Has no multivalued dependencies. ▸ Rules: (a.) All attributes must be dependent on the PK, but they must be independent of each other. (b.) No row can contain two or more multivalued facts about an entity.
ERD depicts
● The conceptual database as viewed by the end-user. ● The database's main components: entities, attributes and relationships.