OIM 350 Exam #2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

analytical information

information collected and used in support of analytical tasks -analytical information is based on operational (transactional) information

operational (transactional) information

information collected and used in support of day to day operational needs

update operations

insert operation: -used for entering new data in the relation delete operation: -used for removing data from the relation modify operation: -used for changing the existing data in the relation

update anomalies when relations contain redundant (unnecessarily repeating) data, anomalies can result

insertion anomaly: when inserting data about one entity requires inserting data about another entity deletion anomaly: when deletion of data about an entity forces deletion of data about another entity modification anomaly: when modifying one value requires the same modification to be made multiple times

ER modeling and relational modeling

ER modeling: -primary technique for visualizing database requirements for conceptual modeling of operational databases relational modeling: -standard method for logical modeling of operational databases *both techniques can also be used during the development of data warehouses and data marts

DW: structured repository

-A DW is a database containing analytically useful information -Any database is a structured repository, with structure represented by metadata (specific format) -The data warehouse is a database containing analytically useful information. Any database is a structured repository, with its structure represented in its metadata. Hence, the data warehouse, being a database, is also a structured repository. In other words, the data warehouse is not a disorganized random mass of data.

*detailed and/or summarized data

-A DW may include detailed data or summary data or both -A DW that contains data at the finest level of detail is most powerful -A data warehouse, depending on its purpose, may include detailed data or summary data, or both. The detailed data is also called atomic data or transaction-level data. For example, a table in which each ATM transaction is recorded as a separate record contains detailed data (i.e., atomic or transaction-level data). On the other hand, a table in which a record represents calculations based on multiple instances of transaction-level data contains summarized data at a coarser level of detail. For example, a summarized data record could represent the total amount of money withdrawn in one month from one account via an ATM.

methods for implementing user-defined constraints:

-CHECK clause -assertions and triggers -Coding in specialized database programming languages: combine SQL with additional non-SQL statements for processing database data (such as PL/SQL) -Embedding SQL in code written in regular programming languages (such as C++ or Java)

data warehouse

-DW sometimes referred to as target system- a destination for data from source systems -typical DW retrieves selected, analytically useful data from operational data sources

steps in the development of data warehouses

*requirements collection, definition, and visualization- specifies desired functionalities ~based on data in the internal data source system and external data sources ~requirements are collected through interviewing various stakeholders ~collected requirements should be clearly defined in a written document, and visualized as a conceptual data model *data warehouse modeling (logical data warehouse modeling) ~ creation of the data warehouse data model that is implementable by the DBMS software *creating the data warehouse ~using a DBMS to implement the data warehouse data model as an actual data warehouse -typically, data warehouses are implemented using a relational DBMS (RDBMS) software *creating ETL infrastructure -creating necessary procedures and code for: -automatic extraction of relevant data from operational data sources -transformation of extracted data- quality is assured and structure conforms to the structure of the modeled and implemented DW -seamless load of the transformed data into DW -creating ETL infrastructure is often the most time- and resource- consuming part of the data warehouse development process *developing front-end (BI) applications- design and create applications for end-users ~front-end applications included in most data warehousing systems, referred to as business intelligence (BI) applications ~front-end applications contain interfaces (such as forms and reports) accessible via a navigation mechanism (such as a menu) *data warehouse deployment- releasing the data warehouse and its front-end (BI) applications for use by the end users *data warehouse use- the retrieval of the data in the data warehouse ~indirect use -via the front-end (BI) applications ~direct use -via the DBMS -via the OLAP (BI) tools *data warehouse administration and maintenance - perform activities that support the data warehouse end user, such as: -provide security for information contained in the data warehouse -ensure sufficient hard-drive space for the data warehouse content -implement backup and recovery procedures **the next (new) version of the data warehouse follows the same development steps as the initial version

normalized data warehouse (Inmon)

-Integrated analytical database modeled w/ traditional modeling techniques (E-R & relational) -Final result in normalized relational database schema -Populated with analytically useful data from operational data sources via ETL process -Serves as source data for dimensionally modeled data marts and other analytically useful data sets

data warehouse/data mart modeling approaches

-Three of the most common data warehouse and data mart modeling approaches: *normalized data warehouse (Inmon) *Dimensionally modeled data warehouse (Kimball) *Independent data marts

name of the 3 approaches for handling dimension tables with slowly change attribute values

-Type 1: no record of any changes, just the current record -Type 2: create reference point for the change -Type 3: gives idea- is it current? or not current?

data mart

-a data store with same principles as DW, but with a more limited scope *independent data mart ~stand-alone data mart, created in the same fashion as DW ~independent data mart has own source systems and ETL infrastructure *dependent data mart ~does not have own source systems ~data comes from a DW

*time-variant

-a data warehouse contains slices or snapshots of data from different periods of time across the time horizon -with the data slices, the user can create reports for various periods of time within the time horizon -The term "time-variant" refers to the fact that a data warehouse contains slices or snapshots of data from different periods of time across its time horizon. With these data slices, the user can create reports for various periods of time within the time horizon. For example, if the subject of analysis is cost and the time horizon is a number of years, we can analyze and compare the cost for the first quarter from a year ago versus the cost for the first quarter from two years ago.

dimensional modeling

-a modeling technique tailored specifically for analytical database design purposes -regularly used in practice for modeling data warehouses and data marts -methodology used for designing subject-oriented analytical databases, such as data warehouses or marts -in addition to using relational concepts (primary keys, foreign keys, integrity constraints, etc.) dimensional modeling distinguishes two types of tables: -dimensions -facts

form

-enables data input and retrieval for end users -provides an interface into a database relation or query

data warehouse components

-source systems -extraction-transformation-load (ETL) infrastructure -data warehouse -front-end applications

star schema

-the result of dimensional modeling is a dimensional schema (model) containing facts and dimensions -the dimensional schema is often referred to as a star schema -an extended, more detailed of the star schema is the snowflake schema *chosen subject of analysis is represented by a fact table *design of star schema considers which dimensions to use with the fact table representing the chosen subject *for every dimension under consideration, two questions must be answered: ~1. can the dimension table be useful for the analysis of the chosen subject? ~2. can the dimension table be created based on the existing data sources?

two different uses of "update operation"

1. collection term for insert, delete and modify operations (the one we use) 2. synonym for the modify operation

types of functional dependencies: functional dependencies used as a basis for the normalization process can be classified in one of the three categories:

1. partial functional dependency 2. full key functional dependency 3. transitive functional dependency

two main reasons for creating a separate data warehouse (analytical database)

1. performance of operational databases can be severely diminished if day-to-day tasks have to share computing resources with analytical queries 2. difficult to structure a database which can be efficiently used for both operational and analytical purposes

3 data quality issues

3 examples from the 6 fundamental data quality characteristics

dependent data mart

A small DW (1 to 2 subjects) that is created directly from a data warehouse

Second normal form (2NF)

A table is in 2NF if it is in 1NF and if it does not contain partial functional dependencies If a relation has a single-column primary key, then there is no possibility of partial functional dependencies Such a relation is automatically in 2NF and does not have to be normalized to 2NF If a relation with a composite primary key has partial dependencies, then it is not in 2NF, and it has to be normalized it to 2NF Normalization of a relation to 2NF creates additional relations for each set of partial dependencies in a relation Primary key of the new relation is portion of the primary key that functionally determines the columns in the original relation Columns that were partially determined in the original relation are part of the new relation The original table remains after the process of normalizing to 2NF, but it no longer contains the partially dependent columns

user-defined constraints

CHECK: -used to specify a constraint on a particular column of a relation SQL code: Four inserts accepted, two inserts rejected due to CHECK constraint INDEX: -mechanism for increasing the speed of data search and data retrieval on relations with a large number of records *Vast majority of relational DBMS software tools enable definition of indexes

INDEXING

Examples provided simplified illustration of principles on which an index is based • Instead of simply sorting on the indexed column and applying binary search, different DBMS tools implement indexes using different logical and technical approaches, such as: • Clustering indexes • Hash indexes • B+ trees • etc.

normalization: several normal forms

First normal form (1NF) Second normal form (2NF) Third normal form (3NF)

full key functional dependency: when a primary key functionally determines the column of a relation and no separate component of the primary key partially determines the same column

If a relation has a single component (non-composite) primary key, the primary key fully functionally determines all the other columns of a relation If a relation has a composite key, and portions of the key partially determine columns of a relation, then the primary key does not fully functionally determine the partially determined columns

OLAP Operations (PivotTable Assignment)

Slice: -a subset of a multidimensional array Dice: -a slice on two or more dimensions Drill Down/Up: -navigating among levels of data ranging from the most summarized (up) to the most detailed (down) Pivot: -used to change the dimensional orientation of a report or an ad hoc query-page display

5 data warehouse architectures (find one you clearly understand and talk about what type of data warehouse it is) be familiar

comes from 2 versions: dimensional model and normalized -federate architecture: a lot of sources systems

when to choose data mart vs data warehouse

data warehouse: when you want to analyze the whole company, have all the time and resources to get it done, users across the entire organization+business model that wants to empower every employee in the corporate office to make decisions data mart: on a budget, save time, only care for a subset of the company

purpose of datawarehouse and what are its benefits compared to an operational database?

decision making, data retrieval, analytical purposes - analyzing dataa to make decisions

compare kimball and inmon approaches toward DW development

hehe

First normal form (1NF)

is done on non-relational tables in order to convert them to relational tables Normalizing to subsequent normal forms (e.g., second normal form, third normal form) improves the design of relational tables that contain redundant information and alleviates the problem of update anomalies A table is in 1NF if each row is unique and no column in any row contains multiple values 1NF states that each value in each column of a table must be a single value from the domain of the column Every relational table is, by definition, in 1NF Related multivalued columns - columns in a table that refer to the same entity and can have multiple values per record Normalizing to 1NF involves eliminating groups of related multivalued columns

transitive functional dependency: occurs when nonkey columns functionally determine other nonkey columns of a relation

nonkey column is a column in a relation that is neither a primary nor a candidate key column

normalization: process used to improve the design of relational databases

normal form: Term representing a set of particular conditions (which reduce data redundancy) that a table has to satisfy From a lower to a higher normal form, these conditions are increasingly stricter and leave less possibility for redundant data

functional dependency

occurs when value of one (or more) column(s) in each record of a relation uniquely determines the value of another column in that same record of the relation ex: A-> B ClientID-> ClientName

partial functional dependency: occurs when a column of a relation is functionally dependent on a component of a composite primary key

only composite primary keys have separate components, partial functional dependency can occur only when a relation has a composite primary key

Type 1

~Changes the value in the dimension's record -the new value replaces the old value ~no history is preserved ~simplest approach, used most often when a change in a dimension is the result of an error

slowly changing dimension

~dimension tables typically include: -attributes whose values do not change (or change rarely) such as store size and customer gender -attributes whose values change occasionally over time, such as customer zip and employee salary ~dimension that contains attributes whose values can change referred to as a slowly changing dimension ~most common approaches to dealing with slowly changing dimensions -Type 1, Type 2, Type 3

line-item versus transaction-level detailed fact table (both are examples of detailed fact tables)

~line-item detailed fact table: *each row represents a line item of a particular transaction *TransactionID repeats ~transaction-level detailed fact table *each row represents a particular transaction *each row is identified by a unique TransactionID

data warehouse definition (next slide: 6 characteristics of data warehouse)

-a structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data -purpose: retrieval of analytical information. a data warehouse can store detailed and/or summarized data The data warehouse is a structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. The purpose of the data warehouse is to enable the retrieval of analytical information. A data warehouse can store detailed and/or summarized data.

dimensionally modeled data warehouse

-collection of dimensionally modeled intertwined data marts (i.e. constellation of dimensional models) -integrates analytically useful information from the operational data sources -Same as normalized data warehouse approach when it comes to the utilization of operational data sources and the ETL process -A set of commonly used dimensions known as conformed dimensions is designed first -Fact tables corresponding to the subjects of analysis are subsequently added -A set of dimensional models is created w/ each fact table connected to multiple dimensions -Some dimensions are shared by more than one fact table -Additional dimensions are included as needed -Resulting DW is a collection of intertwined dimensionally modeled data marts, i.e. a constellation of stars -can be used as a source for dependent data marts and other views, subsets, and/or extracts

dimensional tables (dimensions)

-contain descriptions of the business, organization, or enterprise to which the subject of analysis belongs -columns often contain descriptive information that is textual (e.g., product brand, product color, customer gender) -columns can also be numeric (e.g., product weight, customer income level) -columns provides a basis for analysis of the subject

fact tables

-contain measures related to subject of analysis -contain foreign keys (associates facts and dimension tables) -can also contain other attributes that are not measures (transaction identifier, transaction time) -can contain either detailed data or aggregated data -measures in fact tables are typically numeric and intended for mathematical computation and quantitative analysis

Type 2

-creates a new additional dimension record using a new value for the surrogate key every time a value in a dimension record changes -used in cases where history should be preserved -can be combined with the use of timestamps and row indicators *timestamps- columns that indicates the time interval for which the values in the records are applicable *row indicator- column that provides a quick indicator of whether the record is currently valid

data quality

-data in a database is high quality if it correctly and non-ambiguously reflects the real world it represents -data quality characteristics *Accuracy *uniqueness *completeness *consistency *timeliness *conformity accuracy- the extent to which data correctly reflects the real-world instances it is supposed to depict uniqueness- requires each real-world instance to be represented only once in the data collection ~the uniqueness data quality problem is sometimes also referred to as data duplication completeness- the degree to which all the required data is present in the data collection consistency- the extent to which the data properly conforms to and matches up with the other data timeliness- the degree to which the data is aligned with the proper time window in its representation of the real world ~typically, timeliness refers to the "freshness" of the data conformity- the extent to which the data conforms to its specified format *preventive data quality actions -actions taken to preclude data quality problems *corrective data quality actions -actions taken to correct the data quality problems

*enterprise-wide

-data warehouse provides organization-wide view of analytically useful information -The data warehouse is a repository of analytically useful information in the organization. The term "enterprise-wide" refers to the fact that the data warehouse provides an organization-wide view of the analytically useful information it contains. For example, if one of the subjects of the data warehouse is cost, then all of the analytically useful data regarding cost within the operational data sources throughout the entire organization will be brought into the data warehouse.

*retrieval of analytical information

-developed for retrieval of analytical information, not direct data entry by users -retrieval is only functionality available to DW users -DW data is not subject to changes -Data in a DW is referred to as non-volatile, static, or read-only -A data warehouse is developed for the retrieval of analytical information, and it is not meant for direct data entry by the users. The only functionality available to the users of the data warehouse is retrieval. The data in the data warehouse is not subject to modifications, insertions, or deletions by the end users. New data in the data warehouse is periodically loaded from the operational data sources and appended to the existing data, in an automatic fashion. The data that eventually gets older than the required time horizon is automatically purged from the data warehouse (and possibly archived and/or summarized). However, the data in the data warehouse is not subject to changes. That is why the data in the data warehouse is referred to as nonvolatile, static, or read-only.

characteristics of dimensions and facts

-dimension table: must have relatively static data -which dimension table will have records added regularly? dates? -fact table: records are added continually. table rapidly grows in size -in analytical databases (data warehouse), dimension tables typically have orders of magnitude fewer records than fact tables

surrogate key in star schemas

-dimensional tables are often given a simple, non-composite system-generated key, called a surrogate key -values are simple auto-increment integer values -used as a primary key -typically have no other meaning or purpose

key attributes of a data warehouse

-enterprise : collect information all about the organization -integrated: collects information from many different sources -subject-oriented: focuses on one specific value to be analyzed. subject is the actual thing you're looking at, the numbers that you're representing: sales, profit- pick the specific measure you want to use. /pick the criteria : the money . use subject measure fact all at the same time (is the same thing) (most important) -structure: -historical: can go back a lot of time -operational: can only go back a couple years because it starts getting too big -time-variant: snapshots of a point of time

ETL infrastructure

-facilitates retrieval of data from operational databases into DW -ETL includes the following tasks: *extracting analytically useful data from operational data sources *transforming data to conform to structure of the subject-oriented target DW model *loading transformed and quality-assured data into target DW

detailed versus aggregated fact tables

-fact tables in a dimensional model can contain either detailed data or aggregated data -in detailed fact tables each record refers to a single fact *line-item detailed fact table *transaction level detailed fact table -in aggregated fact tables each record summarizes multiple facts

granularity of the fact tables

-granularity describes what is depicted by one row in the fact table -detailed fact tables have fine level of granularity because each record represents a single fact -aggregated fact tables have a coarser level of granularity than detailed fact tables as records in aggregated fact tables always represent summarizations of multiple facts -due to their compactness, coarser granularity aggregated fact tables are quicker to query than detailed fact tables -coarser granularity tables are limited in terms of what information can be retrieved from them -one way to take advantage of the query performance improvement provided by aggregated fact tables, while retaining the power of analysis of detailed fact tables, is to have both types of tables coexisting within the same dimensional model, i.e. in the same constellation

DW: integrated

-integrates useful data from operational databases & other sources -integration: the process of bringing the data from multiple data sources into a singular data warehouse -The idea behind a data warehouse is to create a repository of analytical information in the organization. This repository is physically separate from the existing operational databases in the organization. The data warehouse integrates the analytically useful data from those various operational databases. Integration refers to the process of bringing the data from multiple operational databases into a single data warehouse. Of course, in this process no data is actually removed from the operational sources. Instead, the analytically useful data from various operational databases is copied and brought into the data warehouse.

Type 3

-involves creating a "previous" and "current" column in the dimension table for each column where changes are anticipated -applicable in cases in which there is a fixed number of changes possible per column of a dimension, or in cases when only a limited history is recorded -can be combined with the use of timestamps

fact table attributes

-measures related to the subject of analysis -foreign keys connecting the fact table to the dimension tables -sometimes, fact tables can contain other attributes that are not measures. Two common ones are: *transaction identifier *transaction time

multiple facts in a dimensional model

-multiple subjects of analysis share the same dimensions, dimensional model contains more than one fact table -a dimensional model with multiple fact tables: constellation or galaxy of stars -this approach enables: ~integrated development of analytical databases for multiple subjects of analysis ~dimensions are re-used instead of duplicated ~straightforward cross-fact analysis

source systems

-operational databases and repositories that provide analytical useful information in DW subject areas -each operational data store has two purposes: *original operational purpose *source system for the data warehouse -source systems can include external data sources

data warehouse front-end (BI) applications

-provides access to DW for users who are engaging in indirect use

database front-end

-provides access to the database for indirect use *form: -enables data input and retrieval for end users -provides an interface into a database relation or query *report: -presents data and calculations on data from one or more tables -formatted and arranged to be displayed on the screen or printed as a hard copy *database front-end applications can include many other components and functionalities, such as: -menus -charts -graphs -maps ~can be multiple sets of front-end applications for different purposes or groups of end-users

*subject-oriented

-refers to fundamental difference in purpose of operational database system vs. a data warehouse -operational database system: developed to support a specific business operation data warehouse: developed to analyze specific business subject areas -The term "subject-oriented" refers to the fundamental difference in the purpose of an operational database system and a data warehouse. An operational database system is developed in order to support a specific business operation. On the other hand, a data warehouse is developed to analyze specific business subject areas.

*historical

-refers to larger time horizon in DW compared to operational databases -operational databases hold typically 1 year of data -DW hold many years of data -When compared to operational information, analytical information has a longer time horizon. Given that a data warehouse contains analytical information, its time horizon is longer (usually substantially so) than the time horizon in the operational databases. The term "historical" refers to the larger time horizon in the data warehouses than in the operational databases. For example, many traditional operational databases have a time horizon of 60 to 90 days, where it is quite common for data warehouses to contain multiple years worth of data.

list the 4 options for enforcing referential integrity with DELETE and UPDATE operations? how are these options implemented in a relational database?

-restrict: not let you go changes or deletions if there is an impact on the other side, can do it if there's no impact on the other table -cascade -set to null: impact will be set to a blank field -set to default: pick a previously defined number

6 fundamental data quality characteristics

1. Accuracy: Accuracy of the data refers to the extent to which data correctly reflects the real-world instances it is supposed to depict. For example, if a customer's name is spelled incorrectly in a database, such an entry has an accuracy data quality problem. Correcting the typo would solve the accuracy data quality problem. 2. Uniqueness: Uniqueness of the data requires each real-world instance to be represented only once in the data collection. The uniqueness data quality problem is sometimes also referred to as data duplication. For example, a uniqueness data quality problem would occur if two records in the relation containing the data about customers referred to the exact same customer. Correcting this uniqueness problem would involve first recognizing that two records represent the same customer and then deleting one of those two records or combining the two records into one record. 3. Completeness: Completeness of the data refers to the degree to which all the required data is present in the data collection. For example, if a relation containing data about patients has a value for weight that is missing for one of the patients, such an omission indicates a data completeness problem. Every patient has a weight, and the patient record without it is incomplete. On the other hand, a missing value for the insurance policy number of a patient is not necessarily an indication of a completeness data quality problem. If a patient is uninsured, the missing value for the insurance policy number is actually the correct depiction of a real-world situation. However, if a patient is insured, the missing value for an insurance policy number constitutes a completeness data quality problem. 4. Consistency: Consistency of the data refers to the extent to which the data properly conforms to and matches up with the other data. For example, an organization may have two different sources of information about the overall company profits. If those two sources list two different figures for company overall profits, those two figures are inconsistent with each other. 5. Timeliness: Timeliness of the data refers to the degree to which the data is aligned with the proper time window in its representation of the real world. Typically, timeliness refers to the "freshness" of the data. If the data is sufficiently current for carrying out the users' tasks at hand, then such data does not have a timeliness problem. On the other hand, if the data is not up-to-date enough for proper use, then such data has a timeliness problem. 6. Conformity: Conformity of the data refers to the extent to which the data conforms to its specified format. Conformity data quality problems occur when an instance of data does not conform to a pre-agreed-upon format for that data. For example, let us assume that the dollar amount in a bank transaction has to be stored and presented with the currency symbol $ preceding the amount. If certain bank transactions have dollar amounts formatted with the currency symbol $ following the amount, such records have conformity data quality problems. Correcting this problem would involve formatting every amount in all bank transactions with the currency symbol $ preceding the amount.

Third normal form (3NF)

A table is in 3NF if it is in 2NF, AND It does not contain transitive functional dependencies Normalization of a relation to 3NF creates new relations for each set of transitive dependencies. Primary key of the new relation is the nonkey column (or columns) that functionally determined the nonkey columns in the original relation Nonkey columns that were transitively determined in the original relation are part of the additional table The original table remains after normalizing to 3NF, but it no longer contains the transitively dependent columns

Operational data sources

Operational data sources include databases and other data repositories which support an organization's day-to-day operations -a data warehouse is created and used as a separate analytical data store for the purpose of data analysis

independent data marts

~stand-alone data marts are created by various groups within the organization, independent of other stand-alone data marts in the organization -consequently, multiple ETL systems are created and maintained ~independent data marts are considered inferior strategy -inability for straightforward analysis across the enterprise -existence of multiple unrelated ETL infrastructures ~in spite of obvious disadvantages, a significant number of corporate analytical data stores are developed as a collection of independent data marts


Ensembles d'études connexes

Working Memory and Cognitive Control (ch. 9)

View Set

History of the Holocaust Final Exam Part 2 (2nd half of material)

View Set

SEC + Ch 4Explain how digital certificates are managed.

View Set

first trad 2000 words example sentences

View Set

ch. 12: Global Markets in Action

View Set

Ch. 4-4: Gender Roles and Differences

View Set