infs final
end-users
business-users; users using a database system to support their tasks and processes
row
tuple OR record
relationship
ER modeling construct depicting how entities are related; Within an ER diagram, each entity must be related to at least one other entity via a relationship
Mapping multiple relationships between the same entities
Each relationship is mapped
form
Enables data input and retrieval for end users; Provides an interface into a database relation or query
transitive functional dependency
occurs when nonkey columns functionally determine other nonkey columns of a relation; Nonkey column is a column in a relation that is neither a primary nor a candidate key column.
column
attribute OR field
update anomalies
anomalies in relations that contain redundant (unnecessarily repeating) data, caused by update operations: Insertion anomaly Deletion anomaly Modification anomaly
multivalued attribute
attribute for which instances of an entity can have multiple values for the same attribute; Used in cases in which there is a variable number of values that can be assigned to the particular attribute of the entity
partial key
attribute of a weak entity that combined with the unique attribute of the owner entity uniquely identifies the weak entity's instances; Combination of the partial key and the unique attribute from the owner entity uniquely identifies every instance of the weak entity; in 1:M -> weak entity must have a partial key attribute; in 1:1 -> weak entity doesn't need to have a partial key attribute
front-end applications
provide a mechanism for easy interaction between the users and the DBMS
operational (transactional) information
the information collected and used in support of day to day operational needs in businesses and other organizations
mapping 1:M relationships
The relation mapped from the entity on the M side of the 1:M relationship has a foreign key that corresponds to the primary key of the relation mapped from the 1 side of the 1:M relationship
database system
computer-based system whose purpose is to enable an efficient interaction between the users and information captured in a database
analytical information
the information collected and used in support of analytical tasks; is based on operational (transactional) information
dimensionally modeled data warehouse
Data warehouse as a collection of dimensionally modeled intertwined data marts (i.e. constellation of dimensional models) that integrates analytically useful information from the operational data sources Same as the normalized data warehouse approach when it comes to the utilization of operational data sources and the ETL process Dimensionally modeled data warehouse approach was championed by Ralph Kimball, and hence, it is often referred to as the Kimball approach. -A set of commonly used dimensions known as conformed dimensions is designed first -Fact tables corresponding to the subjects of analysis are then subsequently added -A set of dimensional models is created where each fact table is connected to multiple dimensions, and some of the dimensions are shared by more than one fact table -In addition to the originally created set of conformed dimensions, additional dimensions are included as needed -The result is a data warehouse that is a collection of intertwined dimensionally modeled data marts, i.e. a constellation of stars -Can be used as a source for dependent data marts and other views, subsets, and/or extracts
conceptual database model
a visualization of requirements by using a conceptual data modeling technique (such as entity-relationship (ER) modeling)
optional attribute
attribute that is allowed to not have a value
candidate keys
multiple unique attributes; when an entity has more than one unique attribute each unique attribute is also called a candidate key; one is chose later as the "primary key" for the table corresponding to the entity that contains the candidate keys
relationship instances
occurrences of a relationship; Occur when an instance of one entity is related to an instance of another entity via a relationship; Relationship themselves are depicted in the ER diagrams while relationship instances are not; Relationship instances are eventually recorded in the database that is created based on the ER diagram
entity instances (entity members)
occurrences of an entity; Entities themselves are depicted in the ER diagrams while entity instances are not; Entity instances are eventually recorded in the database that is created based on the ER diagram
partial functional dependency
occurs when a column of a relation is functionally dependent on a component of a composite primary key; -Only composite primary keys have separate components, while single-column primary keys do not have separate components -Hence, partial functional dependency can occur only in cases when a relation has a composite primary key
data warehouse modeling (logical data warehouse modeling)
creation of the data warehouse data model that is implementable by the DBMS software
logical database modeling
creation of the database model that is implementable by the DBMS software; follows conceptual database modeling
cardinality constraints
depict how many instances of one entity can be associated with instances of another entity; Maximum cardinality: -One (represented by a straight bar: I) -Many (represented by a crow's foot symbol) Minimum cardinality (participation): -Optional (represented by a circular symbol: 0) -Mandatory (represented by a straight bar: I)
attribute
depiction of a characteristic of an entity; Represents the details that will be recorded for each entity instance; Within one entity, each attribute must have a different name
type 1
-Changes the value in the dimension's record -The new value replaces the old value. -No history is preserved -The simplest approach, used most often when a change in a dimension is the result of an error
types of relationships
1:1 1:M M:N; maximum cardinality-wise
mapping ternary relationships
-Ternary relationships are used as many-to-many-to-many relationships -A new relation is created with foreign keys from the participating entities forming a composite primary key of the new relation
Mapping 1:1 relationships
1:1 relationships are mapped in the same way as 1:M relationships; One of the resulting relations will have a foreign key pointing to the primary key of another resulting relation; One of the mapped relations is chosen to have a foreign key referring to the primary key of the other mapped relation; In cases when there is no particular advantage in choosing which resulting relation will include a foreign key, the choice can be arbitrary In other cases one choice can be more efficient than the other
normalized data warehouse
Envisions a data warehouse as an integrated analytical database modeled by using the traditional database modeling techniques of ER modeling and relational modeling, resulting in a normalized relational database schema -Populated with the analytically useful data from the operational data sources via the ETL process -Serves as a source of data for dimensionally modeled data marts and for any other non-dimensional analytically useful data sets Data warehouse as a normalized integrated analytical database was first proposed by Bill Inmon, and hence, the normalized data warehouse approach is often referred to as the Inmon approach.
detailed vs aggregated fact tables
Fact tables in a dimensional model can contain either detailed data or aggregated data In detailed fact tables -each record refers to a single fact In aggregated fact tables -each record summarizes multiple facts
source systems
In the context of data warehousing, source systems are operational databases and other operational data repositories (in other words, any sets of data used for operational purposes) that provide analytically useful information for the data warehouse's subjects of analysis Every operational data store that is used as a source system for the data warehouse has two purposes: 1) The original operational purpose 2) As a source system for the data warehouse Source systems can include external data sources like market research data, census data, stock market data, weather data.
slowly changing dimension
Typical dimension in a star schema contains: -Attributes whose values do not change (or change extremely rarely) such as store size and customer gender -Attributes whose values change occasionally and sporadically over time, such as customer zip and employee salary. Dimension that contains attributes whose values can change referred to as a slowly changing dimension Most common approaches to dealing with slowly changing dimensions Type 1 Type 2 Type 3
Composite primary key
a primary key that is composed of multiple columns; Column names of a composite primary key are underlined, because combined together they form the primary key
relationship roles
additional syntax that can be used in ER diagrams at the discretion of a data modeler to clarify the role of each entity in a relationship; can be used in relationships of any degree, but their usefulness is most apparent when they are used in unary relationships.
composite unique attribute
attribute that is composed of several attributes and whose value is different for each entity instance
composite attribute
attribute that is composed of several attributes; Not an additional attribute of an entity; Its purpose is to indicate a situation in which a collection of attributes has an additional meaning, besides the individual meanings of each attribute; can share components with other composite attributes
unique attribute
attribute whose value is different for each entity instance; Every regular entity must have at least one unique attribute
derived attribute
attribute whose values are calculated and not permanently stored in a database; The value of a derived attribute is calculated from the stored values of other attributes and/or additional available data (such as current date).
relational database
collection of related relations within which each relation has a unique name
analytical database
collects and presents analytical information in support of analytical tasks
operational database
collects and presents operational information in support of daily operational procedures and processes
primary key
column (or a set of columns) whose value is unique for each row -Each relation must have a primary key -The name of the primary key column is underlined in order to distinguish it from the other columns in the relation
operational vs analytical data
data makeup differences: -typical time-horizon -> days/months vs years -detailed vs summarized (and/or detailed) -current vs values over time (snapshots) technical differences: -small vs large amounts used in a process -high vs low/modest frequency of access -can be updated vs read (and append) only -non redundant vs redundancy not an issue functional differences: -used by all types of employees for tactical purposes vs used by a narrower set of users for decision making -application oriented vs subject oriented
metadata
data that describes the structure and the properties of the data; is essential for the proper understanding and use of the data
DBAs
database administrators; perform the tasks related to the maintenance and administration of a database system
developing front-end (BI) applications
designing and creating applications for indirect use by the end-users -Front-end applications are included in most data warehousing systems and are often referred to as business intelligence (BI) applications -Front-end applications contain interfaces (such as forms and reports) accessible via a navigation mechanism (such as a menu) The design and creation of data warehouse front-end applications can take place in parallel with data warehouse creation. For example, the look and feel of the front-end applications, as well as the number and functionalities of particular components (e.g., forms and reports), can be determined before the data warehouse is implemented. This can be done based on the data warehouse model and the requirements specifying the capabilities and functionalities of the system needed by the end users. The actual creation of the front-end application involves connecting it to the implemented data warehouse, which can only be done once the data warehouse is implemented.
next version of the database
follows the same development steps as the initial version; In most cases, after a certain period of use, the need for modifications and expansion of the existing database system becomes apparent, and the development of a new version of the existing database system is initiated.
unary relationship
recursive relationship; occurs when an entity is involved in a relationship with itself; degree 1 relationship
information
refers to the data that is accessed by a user for some particular purpose; typically, getting the needed information from a collection of data requires performing an activity, such as searching through, processing, or manipulating the data in some form or fashion
degree of a relationship
reflects how many entities are involved in the relationship
database metadata
represents the structure of the database; database content that is not the data itself (data about the data); contains: names of data structures, data types, data descriptions, other information describing the characteristics of the data
database end users
use a database system to support their work- or life-related tasks and processes; Users differ in: -Level of technical sophistication -Amount of data that they need -Frequency with which they access the database system; Arguably, database end users are the most important category of people involved with database systems. They are the reason for the existence of database systems; The quality of a database system is measured by how quickly and easily it can provide the accurate and complete information needed by its end users.
check
used to specify a constraint on a particular column of a relation
database implementation
using a DBMS to implement the database model as an actual database; most modern databases are implemented using a relational DBMS (RDBMS) software; SQL is a language used by most relational DBMS software packages-includes commands for creating, modifying, and deleting database structures
relational schema
visual depiction of the relational database model
fact tables
-Contain measures related to the subject of analysis and the foreign keys (associating fact tables with dimension tables) -The measures in the fact tables are typically numeric and are intended for mathematical computation and quantitative analysis For example, if the subject of the business analysis is sales, one of the measures in the fact table sales could be the sale's dollar amount. The sale amounts can be calculated and recalculated using different mathematical functions across various dimension columns. For example, the total and average sale can be calculated per product brand, customer gender, customer income level, and so on. Additional possible fact attributes: A fact table contains -Foreign keys connecting the fact table to the dimension tables -The measures related to the subject of analysis In addition to the measures related to the subject of analysis, in certain cases fact tables can contain other attributes that are not measures Two of the most typical additional attributes that can appear in the fact table are: -Transaction identifier -Transaction time
type 2
-Creates a new additional dimension record using a new value for the surrogate key every time a value in a dimension record changes -Used in cases where history should be preserved Can be combined with the use of timestamps and row indicators Timestamps -columns that indicates the time interval for which the values in the records are applicable Row indicator -column that provides a quick indicator of whether the record is currently valid
ER modeling vs normalization
-ER modeling followed by mapping into a relational schema is one of the most common database design methods -When faced with a non-normalized table, instead of identifying functional dependencies and going through normalization to 2NF and 3NF, a designer can analyze the table and create an ER diagram based on it (and subsequently map it into a relational schema)
type 3
-Involves creating a "previous" and "current" column in the dimension table for each column where changes are anticipated -Applicable in cases in which there is a fixed number of changes possible per column of a dimension, or in cases when only a limited history is recorded. -Can be combined with the use of timestamps
independent data marts
-Stand-alone data marts are created by various groups within the organization, independent of other stand-alone data marts in the organization -Consequently, multiple ETL systems are created and maintained -Independent data marts are considered an inferior strategy -Inability for straightforward analysis across the enterprise -The existence of multiple unrelated ETL infrastructures -In spite of obvious disadvantages, a significant number of corporate analytical data stores are developed as a collection of independent data marts This strategy does not result in a data warehouse, but in a collection of unrelated independent data marts. While the independent data marts within one organization may end up as a whole, containing all the necessary analytical information, such information is scattered and difficult or even impossible to analyze as one unit.
star schema
-The result of dimensional modeling is a dimensional schema containing facts and dimensions -The dimensional schema is often referred to as the star schema In the star schema, the chosen subject of analysis is represented by a fact table; Designing the star schema involves considering which dimensions to use with the fact table representing the chosen subject For every dimension under consideration, two questions must be answered: Question 1: Can the dimension table be useful for the analysis of the chosen subject? Question 2: Can the dimension table be created based on the existing data sources?
steps in the development of database systems
1) database requirements -collection, definition, visualization (conceptual modeling) 2) database modeling (logical modeling) 3) database implementation/ developing front-end applications 4) database deployment 5) database use/ database administration and maintenance
snowflake model
A star schema that contains the dimensions that are normalized Snowflaking is usually not used in dimensional modeling Not-normalized (not snowflaked) dimensions provide for simpler analysis Normalization is usually not necessary for analytical databases Analytical databases are typically read only. Hence, no danger of update anomalies.
weak entity
ER diagram construct depicting an entity that does not have a unique attribute of its own
index
Mechanism for increasing the speed of data search and data retrieval on relations with a large number of records; Most relational DBMS software tools enable definition of indexes; Instead of simply sorting on the indexed column and applying binary search, different contemporary DBMS tools implement indexes using different logical and technical approaches, such as: -Clustering indexes -Hash indexes -B+ trees -etc. Each of the available approaches has the same goal - increase the speed of search and retrieval on the columns that are being indexed; CREATE INDEX Example: CREATE INDEX custname_index ON customer(custname); -Once this statement is executed, the effect is that the searches and retrievals involving the CustName column in the relation CUSTOMER are faster DROP INDEX Example: DROP INDEX custname_index ON customer(custname); This statement drops the index, and the index is no longer used
data warehouse components
Source systems Extraction-transformation-load (ETL) infrastructure Data warehouse (target system) Front-end application
Mapping unary relationships
Unary relationships in ER diagrams are mapped in the same way as binary relationships Mapping 1:M unary relationships: -The relation mapped from an entity involved in a 1:M unary relationship contains a foreign key that corresponds to its own primary key Mapping M:N unary relationships: -In addition to the relation representing the entity involved in a unary M:N relationship, another relation is created to represent the M:N relationship itself -This new relation has two foreign keys, both of them corresponding to the primary key of the relation representing the entity involved in the unary M:N relationship -Each of the foreign keys is used as a part of the composite primary key of the new relation Mapping 1:1 unary relationships: -Mapped in the same way as 1:M unary relationships
data warehouse front-end (BI) applications
Used to provide access to the data warehouse for users who are engaging in indirect use
business rules
User defined constraints that specify restrictions on databases that are not a part of the standard notation for creating ER diagrams
associative entity
construct used as an alternative way of depicting M:N relationships; Associative entities do not have unique or partially unique attributes, and often do not have any attributes at all; For relationships with a degree higher than 2 such as ternary relationships, associative entities provide a way to eliminate potential ambiguities in the ER diagrams; Associative entities are not necessary constructs for depicting binary or unary relationships (for binary or unary relationships, associative entities are simply another way of depicting a relationship).
entities
constructs that represent what the database keeps track of; The basic building blocks of an ER diagram; Represent various real world notions, such as people, places, objects, events, items, and other concepts; Within one ERD each entity must have a different name
Autonumber data type option
enables automatic generation of consecutive numeric data values in a column
new version of the data warehouse
follows the same development steps as the initial version; In most cases, after a certain period of use, the need for modifications and expansion of the existing data warehouse becomes apparent, and the development of a new version of the existing data warehouse system is initiated.
creating the data warehouse
using a DBMS to implement the data warehouse data model as an actual data warehouse -Typically, data warehouses are implemented using a relational DBMS (RDBMS) software Regular relational DBMS packages, such as Microsoft SQL Server or Oracle, can be used for implementing both operational databases and data warehouses. However, other relational DBMS packages, such as Teradata, that are specialized for processing large amounts of data typically found in data warehouses are also available for implementation of data warehouses. Creating data marts can be accomplished in the same fashion as creating data warehouses by using relational DBMS packages. In some cases, data marts can also be implemented as so-called "cubes," using a data management technology different from relational DBMS as described in Chapter 9.
denormalization
-reversing the effect of normalization by joining normalized relations into a relation that is not normalized, in order to improve query performance -The data that resided in fewer relations prior to normalization is spread out across more relations after normalization -This has an effect on the performance of data retrievals -Denormalization can be used in dealing with the normalization vs. performance issue -Denormalization is not a default process that is to be undertaken in all circumstances -Instead, denormalization should be used judiciously, after analyzing its costs and benefits
Relational database constraints
-rules that a relational database has to satisfy in order to be valid Implicit constraints: -The implicit relational database model rules that a relational database must satisfy in order to be valid User-defined constraints: -Database constraints that are added by the database designer
data mart
A data store based on the same principles as a data warehouse, but with a more limited scope; independent data mart: -Stand-alone data mart, created in the same fashion as the data warehouse -Independent data mart has its own source systems and ETL infrastructure Dependent data mart: -Does not have its own source systems -The data comes from the data warehouse -Dependent data marts provide users with a subset of the data from the data warehouse, in cases when users or applications do not want, need, or are not allowed to have access to all the data in the entire data warehouse.
relationship attributes
In some cases M:N relationships can actually have attributes of their own
data quality
The data in a database is considered of high quality if it correctly and non-ambiguously reflects the real-world it is designed to represent ; Data quality characteristics: Accuracy -the extent to which data correctly reflects the real-world instances it is supposed to depict Uniqueness: -requires each real-world instance to be represented only once in the data collection -The uniqueness data quality problem is sometimes also referred to as data duplication Completeness: -the degree to which all the required data is present in the data collection Consistency: -the extent to which the data properly conforms to and matches up with the other data Timeliness: -the degree to which the data is aligned with the proper time window in its representation of the real world -Typically, timeliness refers to the "freshness" of the data Conformity: -the extent to which the data conforms to its specified format
foreign key
column in a relation that refers to a primary key column in another (referred) relation; A mechanism that is used to depict relationships in the relational database model; For every occurrence of a foreign key, the relational schema contains a line pointing from the foreign key to the corresponding primary key
DBMS
database management system; software used for: creation of databases, insertion, storage, retrieval, update, and deletion of the data in the database, maintenance of databases
requirements collection, definition, and visualization
results in the requirements specifying which data the future database system will hold and in what fashion, and what the capabilities and functionalities of the database system will be; the collected requirements should be clearly defined and stated in a written document, and then visualized; the first and most critical step in the development of the database
dimensional modeling
-A data design methodology used for designing subject-oriented analytical databases, such as data warehouses or data marts -Commonly, dimensional modeling is employed as a relational data modeling technique -In addition to using the regular relational concepts (primary keys, foreign keys, integrity constraints, etc.) dimensional modeling distinguishes two types of tables: 1) Dimensions 2) Facts -A modeling technique tailored specifically for analytical database design purposes -Regularly used in practice for modeling data warehouses and data marts -as a relational modeling technique, dimensional modeling, just like standard relational modeling, designs relational tables that have primary keys and are connected to each other via foreign keys, while conforming to the standard relational integrity constraints
2NF
-A table is in 2NF if it is in 1NF and if it does not contain partial functional dependencies -If a relation has a single-column primary key, then there is no possibility of partial functional dependencies -Such a relation is automatically in 2NF and it does not have to be normalized to 2NF -If a relation with a composite primary key has partial dependencies, then it is not in 2NF, and it has to be normalized it to 2NF; -For a relational table, the process of normalization starts by examining if a relational table is in 2NF.; -Normalization of a relation to 2NF creates additional relations for each set of partial dependencies in a relation -The primary key of the additional relation is the portion of the primary key that functionally determines the columns in the original relation -The columns that were partially determined in the original relation are part of the additional table -The original table remains after the process of normalizing to 2NF, but it no longer contains the partially dependent columns
Mapping entities into relations
-Each regular entity becomes a relation -Each regular attribute of a regular entity becomes a column of the newly created relation -If an entity has a single unique attribute, then that attribute becomes the primary key in the resulting mapped relation composite attributes: -Each component of a composite attribute is mapped as a column of a relation -The composite attribute itself does not appear in the mapped relation unique composite attributes: -mapped as a relation with a composite primary key optional attributes: -mapped as an optional column candidate keys (multiple unique attributes): -One of the candidate keys is chosen by the database designer as the primary key during the mapping process; -Other candidate keys are mapped as non-primary key columns (U) multivalued attributes: -An entity containing the multi-valued attribute is mapped without the multi-valued attribute -The multi-valued attribute is mapped as a separate relation that has a column representing the multi-valued attribute and a foreign key column referring to the primary key of the relation resulting from the entity itself -Both of these columns form a composite primary key for the separate relation derived attributes: -Derived attributes are not mapped as a part of the relational schema -They are implemented as a part of the database front-end application
implicit constraints
-Each relation in a relational schema must have a different name -Each relation must satisfy the following conditions: --Each column must have a different name --Each row must be unique --In each row, each value in each column must be single valued --Domain constraint - all values in each column must be from the same predefined domain --The order of columns is irrelevant --The order of rows is irrelevant -Primary key constraint - each relation must have a primary key, which is a column (or a set of columns) whose value is unique for each row -Entity integrity constraint -Referential integrity constraint
designer-added entities (tables) and keys
-Even if a relation is in 3NF additional opportunities for streamlining database content may still exist -Designer-added entities (tables) and designer-added keys can be used for additional streamlining -Augmenting databases with designer added tables and keys is not a default process that is to be undertaken in all circumstances -Instead, augmenting databases with designer added tables and keys should be done judiciously, after analyzing pros and cons for each augmentation
update operations
Insert operation: Used for entering new data in the relation Delete operation: Used for removing data from the relation Modify operation: Used for changing the existing data in the relation; Update operation terminology note: in practice there are two different uses of the term "update operation": A) Update operation as a collective term for insert, delete and modify operations B) Update operation as a synonym for the modify operation In this chapter we will use the term update operation as defined in A)
1NF
A table is in 1NF if each row is unique and no column in any row contains multiple values; -1NF states that each value in each column of a table must be a single value from the domain of the column; -Every relational table is, by definition, in 1NF; -Related multi-valued columns - columns in a table that refer to the same real-world concept (entity) and can have multiple values per record; -Normalizing to 1NF involves eliminating groups of related multi-valued columns; -Normalizing to 1NF is not a part of the normalization process of relational databases.; -Normalizing to 1NF is done only when a non-relational table is being converted to a relation.
3NF
A table is in 3NF if it is in 2NF and if it does not contain transitive functional dependencies; -For a relational table that is in 2NF, the process of normalization continues by examining if a relational table is in 3NF.; -Normalization of a relation to 3NF creates additional relations for each set of transitive dependencies in a relation. -The primary key of the additional relation is the nonkey column (or columns) that functionally determined the nonkey columns in the original relation -The nonkey columns that were transitively determined in the original relation are part of the additional table. -The original table remains after normalizing to 3NF, but it no longer contains the transitively dependent columns
characteristics of dimensions and facts
A typical dimension contains relatively static data, while in a typical fact table, records are added continually, and the table rapidly grows in size. In a typical dimensionally modeled analytical database, dimension tables have orders of magnitude fewer records than fact tables
datawarehouse
A typical organization maintains and utilizes a number of operational data sources.; The operational data sources include the databases and other data repositories which are used to support the organization's day-to-day operations ; A data warehouse is created within an organization as a separate data store whose primary purpose is data analysis.; Two main reasons for the creation of a data warehouse as a separate analytical database 1) The performance of operational day-to-day tasks involving data use can be severely diminished if such tasks have to compete for computing resources with analytical queries 2) It is often impossible to structure a database which can be used in an efficient manner for both operational and analytical purposes
dimensional modeling vs ER modeling
Both ER modeling and dimensional modeling are viable alternatives for modeling data warehouses/data marts, and can be used within the same project For example, dimensional modeling can be used during the requirements collection process for collecting, refining, and visualizing initial requirements. Based on the resulting requirements visualized as a collection of facts and dimensions, an ER model for a normalized physical data warehouse can be created if there is a preference for a normalized data warehouse. Once a normalized data warehouse is created, a series of dependent data marts can be created using dimensional modeling.
creating a data warehouse
Involves using the functionalities of database management software to implement the data warehouse model as a collection of physically created and mutually connected database tables; Most often, data warehouses are modeled as relational databases; Consequently, they are implemented using a relational DBMS
creating ETL infrastructure
Creating necessary procedures and code for: -Automatic extraction of relevant data from the operational data sources -Transformation of the extracted data, so that its quality is assured and its structure conforms to the structure of the modeled and implemented data warehouse -The seamless load of the transformed data into the data warehouse Due to the amount of details that have to be considered, creating ETL infrastructure is often the most time- and resource-consuming part of the data warehouse development process ETL infrastructure has to account for and reconcile all of the differences in the metadata and the data between the operational sources and the target data warehouses. In many cases, organizations have multiple separate operational sources with overlapping information. In such cases, the process of creating the ETL infrastructure involves deciding how to bring in such information without creating misleading duplicates (i.e., how to bring in all the useful information while avoiding the uniqueness data quality problem).
database analysts, designers, developers
Database analysts - involved in the requirements collection, definition, and visualization stage ; Database designers (a.k.a. database modelers or architects) - involved in the database modeling stage; Database developers - in charge of implementing the database model as a functioning database using the DBMS software; It is not uncommon for the same people to perform more than one of these roles. In fact (especially in smaller companies and organizations), the same people may be in charge of all aspects of the database system, including the design, implementation, administration, and maintenance.
database scope
Databases can vary in their scope from small single-user (personal) databases to large enterprise databases that can be used by thousands of end-users; Regardless of their scope, all databases go through the same fundamental development steps (requirements, modeling, implementation, deployment, use, etc.); The difference in the scope of databases is reflected in the size, complexity and cost in time and resources required for each of the steps
ERD
ER diagram; the result of ER modeling; Serves as a blueprint for the database
front-end application analysts and developers
Front-end application analysts: in charge of collecting and defining requirements for front-end applications; Front-end applications developers: in charge of creating the front-end applications
granularity of fact tables
Granularity describes what is depicted by one row in the fact table Detailed fact tables have fine level of granularity because each record represents a single fact Aggregated fact tables have a coarser level of granularity than detailed fact tables as records in aggregated fact tables always represent summarizations of multiple facts Due to their compactness, coarser granularity aggregated fact tables are quicker to query than detailed fact tables Coarser granularity tables are limited in terms of what information can be retrieved from them One way to take advantage of the query performance improvement provided by aggregated fact tables, while retaining the power of analysis of detailed fact tables, is to have both types of tables coexisting within the same dimensional model, i.e. in the same constellation Aggregation is requirement-specific while a detailed granularity provides unlimited possibility for analysis. You can always obtain an aggregation from the finest grain, but the reverse is not true.
referential integrity constraint lines
In a relational schema lines pointing from the foreign key to the corresponding primary key
Mapping M:N relationships
In addition to the two relations representing the two entities involved in the M:N relationship, another relation is created to represent the M:N relationship itself (can have it's own attribute); This new relation has two foreign keys, corresponding to the primary keys of the two relations representing the two entities involved in the M:N relationship; The two foreign keys form the composite primary key of the new relation; Associative entities are mapped into relational database constructs in the identical way as M:N relationships
referential integrity constraint
In each row of a relation containing a foreign key, the value of the foreign key EITHER matches one of the values in the primary key column of the referred relation OR the value of the foreign key is null (empty). -Regulates the relationship between a table with a foreign key and a table with a primary key to which the foreign key refers; delete and update implementation options: Delete options DELETE RESTRICT DELETE CASCADE DELETE SET-TO-NULL DELETE SET-TO-DEFAULT Update options UPDATE RESTRICT UPDATE CASCADE UPDATE SET-TO-NULL UPDATE SET-TO-DEFAULT
Referential integrity constraint
In each row of a relation containing a foreign key, the value of the foreign key EITHER matches one of the values in the primary key column of the referred relation OR the value of the foreign key is null (empty).; A rule that defines values that are valid for use in foreign keys
exact min/max cardinality
In some cases the exact minimum and/or maximum cardinality in relationships is known in advance; Exact minimum/and or maximum cardinalities can be depicted in ER diagrams
line-item vs transaction-level detailed fact table
Line-item detailed fact table -Each row represents a line item of a particular transaction Transaction-level detailed fact table -Each row represents a particular transaction
Implementing user-defined constraints
Methods for implementing user-defined constraints include: -CHECK clause -Assertions and triggers -Coding in specialized database programming languages that combine SQL with additional non-SQL statements for processing data from databases (such as PL/SQL) -Embedding SQL with code written in regular programming languages (such as C++ or Java) In many cases the logic of user-defined constraints is not implemented as a part of the database, but as a part of the front-end database application; For the proper use of the database, it is important that user-defined constraints are implemented fully; From the business use point of view, the enforcement of user-defined constraints is more important than which actual method of implementing the user-defined constraint was chosen. Often, when there is more than one method available, the choice is made based on technical considerations, such as which of the methods provides for the fastest execution.
streamlining functional dependencies
Not all functional dependencies need to be depicted; The following types of functional dependencies can be omitted: -Trivial functional dependencies -Augmented functional dependencies -Equivalent functional dependencies
report
Presents the data and calculations on the data from one or more tables from the database in a formatted way; The data that is retrieved via reports is formatted and arranged to be displayed on the screen or printed as a hard copy
data quality actions
Preventive data quality actions: -Actions taken to preclude data quality problems Corrective data quality actions: -Actions taken to correct the data quality problems
database front-end
Provides access to the database for indirect use; In most cases, a portion of intended users (often a majority of the users) of the database lack the time and/or expertise to engage in the direct use of the data in the database. It is not reasonable to expect every person who needs to use the data from the database to write his or her own queries and other statements.; In addition to the forms and reports, database front-end applications can include many other components and functionalities, such as: -menus -charts -graphs -maps -etc. The choice of how many different components to use and to what extent is driven by the needs of the end-users; A database can have multiple sets of front-end applications for different purposes or groups of end-users; Front-end applications can be accessible separately on their own or via an interface that allows the user to choose an application that they need.
multiple relationships between same entities
Same entities in an ER diagram can be related via more than one relationship
data warehouse
The data warehouse is a structured repository (The data warehouse is a database containing analytically useful information Any database is a structured repository with its structure represented in its metadata, not a disorganized random mass of data) of integrated (The data warehouse integrates the analytically useful data from the various operational databases (and possibly other sources) Integration refers to this process of bringing the data from multiple data sources into a singular data warehouse. In the process of bringing the data from operational databases into the data warehouse, no data is actually removed from the operational sources. Instead, the analytically useful data from various operational databases is copied and brought into the data warehouse.), subject-oriented (The term subject-oriented refers to the fundamental difference in the purpose of an operational database system and a data warehouse. An operational database system is developed in order to support a specific business operation A data warehouse is developed to analyze specific business subject areas), enterprise-wide (The term enterprise-wide refers to the fact that the data warehouse provides an organization-wide view of the analytically useful information it contains), historical (The term historical refers to the larger time horizon in the data warehouse than in the operational databases), and time-variant (The term time variant refers to the fact that a data warehouse contains slices or snapshots of data from different periods of time across its time horizon With the data slices, the user can create reports for various periods of time within the time horizon) data. The purpose of the data warehouse is the retrieval of analytical information. -A data warehouse is developed for the retrieval of analytical information, and it is not meant for direct data entry by the users. -The only functionality available to the users of the data warehouse is retrieval -The data in the data warehouse is not subject to changes. -The data in the data warehouse is referred to as non-volatile, static, or read-only -The data in the data warehouse is not subject to modifications, insertions or deletions by the users. -New data in the data warehouse is periodically loaded from the operational data sources and appended to the existing data, in an automatic fashion. -The data that eventually gets older than the required time horizon is purged from the data warehouse. A data warehouse can store detailed and/or summarized data. -A data warehouse, depending on its purpose, may include the detailed data or summary data or both -A data warehouse that contains the data at the finest level of detail is the most powerful
dw
The data warehouse is sometimes referred to as the target system, to indicate the fact that it is a destination for the data from the source systems; A typical data warehouse periodically retrieves selected analytically useful data from the operational data sources; In so-called "active" data warehouses, the retrieval of data from operational data sources is continuous.
functional dependencies
The functional dependencies that are used as a basis for the typical normalization process can be classified in one of the three categories: -Partial functional dependency -Full key functional dependency -Transitive functional dependency
ETL infrastructure
The infrastructure that facilitates the retrieval of data from operational databases into the data warehouses ETL includes the following tasks: -Extracting analytically useful data from the operational data sources -Transforming such data so that it conforms to the structure of the subject-oriented target data warehouse model (while ensuring the quality of the transformed data) -Loading the transformed and quality assured data into the target data warehouse
data warehouse/ data mart modeling approaches
Three of the most common data warehouse and data mart modeling approaches: -Normalized data warehouse -Dimensionally modeled data warehouse -Independent data marts
surrogate key
Typically, in a star schema all dimension tables are given a simple, non-composite system-generated key, also called a surrogate key Values for the surrogate keys are typically simple auto-increment integer values Surrogate key values have no meaning or purpose except to give each dimension a new column that serves as a primary key within the dimensional model instead of the operational key
Mapping weak entities
Weak entities are mapped in a same way as regular entities with one addition: -The resulting relation has a composite primary key that is composed of the partial identifier and the foreign key corresponding to the primary key of the owner entity
constellation/ galaxy of stars
When multiple subjects of analysis can share the same dimensions, a dimensional model contains more than one fact table This approach enables: -Quicker development of analytical databases for multiple subjects of analysis, because dimensions are re-used instead of duplicated -Straightforward cross-fact analysis
developing front-end applications
designing and creating applications for indirect use by the end-users; are based on the database model and the requirements specifying the front-end functionalities; contain interfaces (such as forms and reports) accessible via a navigation mechanism (such as a menu); The design and creation of front-end applications can commence and proceed in parallel with database implementation; The actual creation of a front-end application involves connecting it to the database. Connecting front-end application to the database can only be done once the database is implemented.
dimensions
dimension tables: -Contain descriptions of the business, organization, or enterprise to which the subject of analysis belongs -Columns in dimension tables contain descriptive information that is often textual (e.g., product brand, product color, customer gender, customer education level), but can also be numeric (e.g., product weight, customer income level) -This information provides a basis for analysis of the subject For example, if the subject of the business analysis is sales, it can be analyzed by dimension columns such as product brand, customer gender, customer income level, and so on.
direct interaction
end-user communicating with the database directly through DBMS
indirect interaction
end-user communicating with the database through front-end applications
owner entity
entity whose unique attribute provides a mechanism for identifying instances of a weak entity
ER modeling
entity-relationship modeling; conceptual database modeling technique ; Enables the structuring and organizing of the requirements collection process; Provides a way to graphically represent the requirements
data
facts that are recorded and can be accessed; data formats: text, numbers, figures, graphics, images, audio/video recordings and more; data is recorded and kept because it is considered to be of use to an intended user
augmented functional dependency
functional dependency that contains an existing functional dependency; Does not add new information to what is already described by the existing functional dependency; It can be omitted; ex] if a functional dependency A -> B exists in a relation then, A, C -> B is an augmented functional dependency
Entity integrity constraint
in a relational table, no primary key column can have null (empty) values; -A rule stating that no primary key column can be optional -Every RBMS enforces this rule
relational database model
logical database model that represents a database as a collection of related tables; most contemporary commercial DBMS software packages are relational DBMS (RDBMS) software packages
full key functional dependency
occurs when a primary key functionally determines the column of a relation and no separate component of the primary key partially determines the same column; -If a relation has a single component (non-composite) primary key, the primary key fully functionally determines all the other columns of a relation -If a relation has a composite key, and portions of the key partially determine columns of a relation, then the primary key does not fully functionally determine the partially determined columns
trivial functional dependency
occurs when an attribute (or a set of attributes) functionally determines itself or its subset; are not depicted; ex] A -> A A, B -> A, B A, B -> A CampainMgrID, CampaignMgrName -> CampaignMgrName
deletion anomaly
occurs when deletion of data about a real-world entity forces deletion of data about another real-world entity
insertion anomaly
occurs when inserting data about one real-world entity requires inserting data about another real-world entity
functional dependency
occurs when the value of one (or more) column(s) in each record of a relation uniquely determines the value of another column in that same record of the relation; ex] A -> B ClientID -> ClientName
equivalent functional dependency
occurs when two columns (or sets of columns) that functionally determine each other determine other columns; If one of the equivalent functional dependencies is depicted, the other equivalent functional dependency can be omitted; ex] if functional dependencies: A -> B B -> A exists in a relation, then: A -> B B -> A are equivalent functional dependencies, and: A -> B, X B -> A,X are equivalent functional dependencies, and: Y, A -> B, X Y, B -> A, X are equivalent functional dependencies
modification anomaly
occurs when, in order to modify one real-world value, the same modification has to be made multiple times
data warehouse administration and maintenance
performing activities that support the data warehouse end user, including dealing with technical issues, such as: -Providing security for the information contained in the data warehouse -Ensuring sufficient hard-drive space for the data warehouse content -Implementing the backup and recovery procedures
database administration and maintenance
performing activities that support the database end user, including dealing with technical issues, such as: -Providing security for the information contained in the database -Ensuring sufficient hard-drive space for the database content -Implementing the backup and recovery procedures
Designer-created primary key
primary key column, not called for by the original requirements, added to a table by the database designer; Often used in conjunction with the autonumber data type option
normalization
process used to improve the design of relational databases; The normalization process involves examining each table and verifying if it satisfies a particular normal form; If a table satisfies a particular normal form, then the next step is to verify if that relation satisfies the next higher normal form; If a table does not satisfy a particular normal form, actions are taken to convert the table into a set of tables that satisfy the particular normal form; Normalizing to first normal form is done on non-relational tables in order to convert them to relational tables; Normalizing to subsequent normal forms (e.g., second normal form, third normal form) improves the design of relational tables that contain redundant information and alleviates the problem of update anomalies; exceptions: -In general, database relations are normalized to 3NF in order to eliminate unnecessary data redundancy and avoid update anomalies -However, normalization to 3NF should be done judiciously and pragmatically, which may in some cases call for deliberately not normalizing certain relations to 3NF
relation
relational table OR table; table in a relational database -a table containing rows and columns -the main construct in the relational database model -every relation is a table, but not every table is a relation; In order for a table to be a relation the following conditions must hold: 1) Each column must have a name (within one table, each column name must be unique) 2) Within one table, each row must be unique 3) Within each row, each value in each column must be single valued (multiple values of the content represented by the column are not allowed in any rows of the table) 4) All values in each column must be from the same (predefined) domain 5) Order of columns is irrelevant 6) Order of rows is irrelevant
identifying relationship
relationship between a weak entity and its owner entity in which each instance of a weak entity is associated with exactly one instance of an owner entity; Each weak entity must be associated with its owner entity via an identifying relationship; Unique attribute from the owner entity uniquely identifies every instance of the weak entity via an identifying relationship
binary relationship
relationship between two entities; degree 2 relationship
ternary relationship
relationship involving three entities; degree 3 relationship; uses an associative entity to connect entities; In practice, ternary relationships are relatively rare, and relationships of degree higher than 3 are rarer still; In most cases when a designer is tempted to create relationships of degrees higher than 2, he or she should explore the possibility of creating additional entities instead.
data warehouse deployment
releasing the data warehouse and its front-end (BI) applications for use by the end users Typically, prior to this step, the initial load populating the created data warehouse with an initial set of data from the operational data sources via the ETL infrastructure is executed.
database deployment
releasing the database system for use by the end users; also involves populating the implemented database with the initial set of data
requirements collection, definition, and visualization dw
results in the requirements specifying the desired capabilities and functionalities of the future data warehouse -The requirements are based on the analytical needs that can be met by the data in the internal data source systems and available external data sources -The requirements are collected through interviewing various stakeholders of the data warehouse -In addition to interviews, additional methods for eliciting requirements from the stakeholders can be used The first and most critical step in the development of the data warehouse. The requirements collection process aspires to analytically take advantage of all data available for consideration, but it cannot be based on data that is not available or does not exist. Additional methods for eliciting requirements include focus groups, questionnaires, surveys, and observations of existing analytical practices to determine what users really do with data and what data they actually use and need The collected requirements should be clearly defined and stated in a written document, and then visualized as a conceptual data model
database
structured collection of related data stored on a computer medium; organizes the data in a way that facilitates efficient access to the information captured in the data
normal form
term representing a set of particular conditions (whose purpose is reducing data redundancy) that a table has to satisfy; From a lower to a higher normal form, these conditions are increasingly stricter and leave less possibility for redundant data; There are several normal forms, most fundamental of which are: First normal form (1NF) Second normal form (2NF) Third normal form (3NF)
database use
the insertion, modification, deletion, and retrieval of the data in the database system; The database system can be used indirectly, via the front-end applications, or directly via the DBMS; SQL includes commands for insertion, modification, deletion and retrieval of the data. These commands can be issued by front-end applications (indirect use), or directly by the end-users themselves (direct use)
data warehouse use
the retrieval of the data in the data warehouse Indirect use -Via the front-end (BI) applications Direct use -Via the DBMS -Via the OLAP (BI) tools