OIM 350 Exam 2
augmented functional dependencies
Functional dependencies that contain an existing functional dependency; they do not add new information to what is already described by the existing functional dependency and therefore can be omitted -not depicted in set of identified functional dependencies -ex. A->B then A,C->B is an augmented functional dependency
surrogate key
a simple, non-composite key in a dimension table, the values are simple auto-number integer values, gives dimension new column that serves as a PK within the dimensional model instead of the operational key
update anomalies
when relations contain redundant, unnecessary repeating data
Data warehouse Components (4)
1. Source Systems: operational databases that provide analytically useful info for the data warehouse's subject of analysis 2. Data Warehouse: the collection of analytical data from the source systems (destination for data) 3. ETL (extraction-transformation-load): facilitates the retrieval of data from the operational databases into the data warehouse -extracts the useful data -transforms it so it conforms to the structure of the subject oriented target data -loading the data into the warehouse 4.front-end applications: allow straightforward access to their functionalities for users who are engaging in indirect use
When would you choose a Data mart vs. Data warehouse?
data mart= based on same principles of data warehouse but with a more limited scope main differences: smaller, contains data about one subject, does not have an enterprise wide focus, easier to implement (less time, money, resources), fewer data sources *use data mart when focusing on one line of business but data warehouse when want a big picture view with data from multiple sources
read operation
data retrieval operation used for reading the data from relations
update operations
'operations for inserting data into relation, modifying existing relations, and deleting data from relations
Types of Normal Forms
*1NF: each value in each column of a table must be a single value from the domain of the column; a table is in 1NF if each row is unique and no column in any row contains multiple values from the column's domain -eliminate groups of related multivalued columns: create new separate table for each group of related multivalued columns or increase the number of records · Column names are unique · Rows are unique · There are no multivalued attributes · All values in same column are from the same domain · Column order is irrelevant *2NF: process of normalization starts by examining if a relational table is in 2NF; in 2NF if it is in 1NF and it does not contain partial functional dependencies -relation with a single column primary key is automatically in 2NF -creates additional relations for each set of partial dependencies in a relation: PK of the additional relation is the portion of the PK that functionally determines the columns in the original relation, columns that were partially determined in the original relation are now part of the additional table *3NF: in 3NF if it is in 2NF and if it does not contain transitive functional dependencies -creates additional relations for each set of transitive dependencies: the PK of the additional relation is the non key column that functionally determined the non key column in the original relation -the original table remains after but it no longer contains the transitively dependent columns
detailed vs aggregated fact tables
-detailed fact tables= each record refers to a single fact, more fields in a fact table; has fine level of granularity, provides unlimited possibility for analysis -aggregated fact tables= each record summarizes multiple facts, FK is primary keys of the dimension, fact table has as many PK as dimensions, coarser level of granularity bc always represent summarizations of multiple facts; quicker query than detailed fact tables but there are limitations on what info can be retrieved from them
Types of Update anomalies (3)
1. insertion anomaly: when a user who want to insert data about a real world entity is forced to enter data about another real world entity 2. deletion anomaly: when a user wants to delete data about a real world entity is forced to delete data about another real world entity 3. modification anomaly: occurs when in order to modify one value, the same modification has to be made multiple times
dimensional modeling
A data design methodology used for designing subject-oriented analytical databases (data warehouses or data marts), uses regular relational concepts -two types of tables: 1. dimensions= contain descriptions of the business/organization to which the subject of analysis belongs (columns contain descriptive info that is often textual but also can be numeric (provides basis for analysis of the subject) 2. fact tables= contain measures related to the subject of analysis, contain FK associating them with dimension tables, measures in the table are typically numeric and are intended for mathematical computation -primary key is a surrogate key, or other attributes that arent measures (transaction identifier, transaction time)
Compare Kimball and Inmon approaches toward DW development
Inmon: Normalized Data Warehouse Approach-> top down, enterprise wide view 1. Normalized Data Warehouse (Inman): has a central data warehouse modeled as an ER model that is then mapped into a normalized relational database model serving as a physical store for the data warehouse; populated with analytical info from operational database via the ETL process *uses ER model, uses dimensional model for Data Marts only uses data marts as physical separation from data warehouse Kimball: Dimensionally Modeled Data Warehouse Approach (Data Mart approach)-> not normalized, bottom up 2. Dimensionally Modeled Data warehouse (Kimball): collection of dimensionally modeled intertwined data marts, difference from normalized data warehouse is the technique used for modeling the data warehouse: -.a set of commonly used dimensions (conformed dimensions) is used first -.fact tables corresponding to subject of analysis added -. a set of dimensional models created in which each fact table is connected to multiple dimensions -result: constellation of stars (star schema)ER *uses dimensional model such as star schema to organize data, unnecessary to separate data marts from dimensional data warehouse, analytical systems can access data directly
Differences between Analytical (data warehouse) vs Operational Info
Operational: shorter time horizon (60-90 days), detailed data, represents current state of affairs, process smaller amounts of data that is issued much more often, regularly updated by users, want to reduce the redundancy of data, used by all types of employees/users, application oriented Analytical: larger time horizon because must be able to provide trend of analysis overtime, contain both detailed and summarized data, can represent current and past data, process larger amounts of data, only able to retrieve data from it/ updates are not allowed, not subject to update anomalies, used by narrower set of users for decision making purposes, subject oriented
Describe the 3 approaches (types) for handling Dimension tables with slowing change attribute values
Type 1: when the change in a dimension is the result of an error; simply changes the value in the dimension's record, so the new value replaces the old Type 2: history should be preserved, creates new dimensional record using new value for the surrogate key every time a value in a dimension record changes, most commonly used approach, uses timestamps= indicates the time interval (start/end date) for the values, has column entitled row indictor-> indicator of which records are currently valid Type 3: in cases in which there is a fixed number of changes possible per column of dimension/ only limited history is recorded; creates a previous and current column in dimension table for each column where changes are anticipated , uses time stamps
linear search
a part of indexing -looking up a particular record based on a value in an unsorted column, looks up one by one so is slow
What is data quality? List and define the fundamental data quality characteristics.
data quality= data in a database is considered of high quality if it correctly reflects the real world it is designed to represent Accuracy: the extent to which the data correctly reflects its real world application Completeness: the degree to which all the required data is present in the data collection (no gaps/missing info) Consistency: the extent to which the data properly matches up with the other data (ex. same currency across dataset) Conformity: format of dataset is the same across database Uniqueness: requires each real world instance to be represented only once in the dataset, no data duplication Timeliness: freshness of data, data is aligned with the proper time window it is supposed to * acccut
binary search
divides sort list initially into 2 parts of same size by looking up the value in the middle of the list -if searched for value is larger than the value in middle of list, the top part of list is eliminated, if smaller then the bottom is eliminated
What is drill down and drill up?
drill down: navigating levels of data to most detailed ex. presenting all the months instead of just the whole year drill up: navigating levels of data to most summarized (more compiled info, ex. showing all the distribution channels)
line-item detailed fact table
each row represents a line item of a particular transaction; contains the most detail, has 2 primary keys; transaction ID repeats
transaction level detailed fact table
each row represents a particular transaction (has 1 primary key), each row represented by a unique transaction ID
Types of data marts
independent: a stand-alone data mart, created in the same way as a data warehouse; has its own source systems and ETL infrastructure dependent: does not have its own source systems, its data comes from the data warehouse, way to provide users with smaller portion of data from the data warehouse when users do not need/ are not allowed to have access to all the data in the warehouse
How to identify good primary keys
non significant, arbitrary, identifier IDs -want as little primary keys as possible -should NOT have repeated values in the table -if there are 2 find non repeated combos
redundant data
relation with multiple instances of data referring to the same occurance
What is slice and dice?
slice: a subset of a multidimensional array dice:a slice on two or more dimensions (more than one slices at a time)
equivalent functional dependency
-not depicted in set of functional dependencies occurs when two columns (or sets of columns) that functionally determine each other determine other columns ex. A->B and B->A
check clause
used to specify a constraint on a particular column of a relation
Normal Forms
what normalization is based on, a set of particular conditions that a table has to satisfy that reduce data redundancy -normalizing to first normal form is done on a non-relational table in order to convert them into a relational table
referential integrity constraint ( in regards to database implementation)
deletions/modifications of the records in the relation on the PK side have an effect on the records in the relation on the FK side, but relations in the record on the FK side do not have an effect on the records in the relation on the PK side
database front end application components
*provides access to the database for indirect use -FORM= purpose is to enable data input and retrieval for end users in a straight forward way -REPORT: purpose is to prevent the data and calculations on the data from one or more tables from the database in a formatted way, used strictly for the retrieval of data
normalization
-process used to improve the design of relational databases that contain redundant data and therefor are prone to update anomalies -helps improve relations susceptible to update anomalies -normalization process is based on analyzing functional dependencies that exist within each relation in the database -based on normal forms
characteristics of dimensions and facts
-the # of rows in any of the dimension tables is relatively small when compared to the # of records in the fact table -typical dimension contains relatively static data, while typical fact table, records are added continually -fact tables= contain measures related to the subject of analysis, but can contain other attributes: degenerate dimension -transaction identifier -transaction time
Describe and draw the 5 data warehousing architectures. Which architectures would Kimball and Inmon support (choose)?
1. Independent Data Marts: source system-> ETL staging area-> independent data marts (atomic/summarized data)-> end user access and applications 2. Data Mart bus Architecture w/ Linked Dimensional Datamarts: source system-> ETL staging area-> dimensionalized data marts linked by conformed dimensions (atomic/summarized data)-> end user access and applications -Kimball/Inmon? 3. Hub and Spoke Architecture (Corporate Info Factory): source systems-> ETL staging area->normalized relational warehouse (atomic data)->end user access and applications-> (both the last two connected to this: dependent data marts (summarized/some atomic data) -Inmon 4. Centralized Data Warehouse Architecture: source system-> ETL staging area-> normalized relational warehouse (atomic/some summarized data)->end user access and applications -Inmon 5. Federated Architecture: existing data warehouses, data marts, and legacy systems-> data mapping/meta data (logical/physical integration of common data elements)->end user access and applications -Kimball
Datawarehouse and Datamart Modeling Approaches(3)
1. Normalized Data Warehouse (Inman): has a central data warehouse modeled as an ER model that is then mapped into a normalized relational database model serving as a physical store for the data warehouse; populated with analytical info from operational database via the ETL process 2. Dimensionally Modeled Data warehouse (Kimball): collection of dimensionally modeled intertwined data marts, difference from normalized data warehouse is the technique used for modeling the data warehouse: -.a set of commonly used dimensions (conformed dimensions) is used first -.fact tables corresponding to subject of analysis added -. a set of dimensional models created in which each fact table is connected to multiple dimensions -result: constellation of stars (star schema) 3. Independent Data Marts: stand alone data marts created by various groups in organization, independent of others, multiple ETL systems created and maintained -considered inferior strategy
Types of Functional Dependencies (that are used in normalization process)
1. Partial Functional Dependency: when a column of a relation is functionally dependent on a component of a composite primary key -can only occur when a relation has a composite primary key ex. AdCampaignID and ModusID are both primary keys and both functionally determine other columns in the same table 2.Full Key Functional Dependency: when a primary key functionally determines the column of a relation and no separate component of the primary key partially determines the same column; when: -if relation has a single primary key OR if has composite primary key and they together functionally determine the column 3. Transitive Functional Dependency: when non-key columns functionally determine other non-key columns of a relation ***NOTE: every column that is not a primary key has to be predicted by a full key or a partial dependency
trivial functional dependencies
when an attribute (or set of attributes) functionally determine itself -are not depicted in the set of identified functional dependencies -A->A or A,B-> A,B
steps in development of data warehouse
1. Requirements, Collection, Definition, and Visualization: results of this step are end user requirements specifying the functionalities of the future data 2. Data warehouse modeling: creation of data warehouse models that are implementable by the data management software 3. creating the data warehouse: implementing the data warehouse model as the actual data warehouse that initially empty 4. creating ETL Infrastructures: creating necessary procedures and code for automatic extraction of data from operational database, transforming the data so it conforms to the structure of the data warehouse 5. developing front end applications: designing and creating applications for indirect use by the end users 6. data warehouse deployment: releasing data warehouse/ front end applications for use to the end users 7. data warehouse use: data can be used by the end users indirectly via front end applications 8. data warehouse administration and maintenance: deal with technical issues, ensure sufficient hardware space, implementing recovery procedures
List the four options for enforcing referential integrity with DELETE and UPDATE operations? How are these options implemented in a relational database?
1. Restrict: delete: does not allow a record to be deleted if its PK value is referred to by a FK value; update: does not allow the PK value of a record to be changed if its PK value is referred to by a FK value 2. Cascade (whatever change you make on one side makes a change on the other side): delete: allows a record to be deleted if its PK value is referred to by a FK value but all those values of the referred record will also be deleted, update: allows the primary key value of a record to be changed, but all the FK values that refer to the PK value being changed are also changed and set to the new value of the primary key 3. set-to-null: delete/update= allows a record to be deleted/changed if its PK value is referred to by a FK value of a record of another relation, results in all of the records whose FK value refers to the PK of the record that is being deleted/changed, the value of the FK is set to null 4.set-to-default: delete/update: allows a record to be deleted/changed if its PK value is referred to by a FK value of a record in another relation; results-> in all of the records whose FK value refers to the PK of the record that is being deleted/change, the value of the FK is set to a predetermined default value
3 update operations
1. insert operation= used for entering new data into the relation 2. delete operation= used for removing data from the relation 3. modify operation= used for changing existing data in the relation
What is a DW and what are the key attributes of a DW?
A data warehouse is a separate data store whose main purpose is the retrieval of of analytical information -structured repository: contains analytically useful data and not just a disorganized collection of data -time variant & historical: contains snapshots of data from different periods of time, has a larger time horizon -integrated: bringing data in from multiple data sources into a singular database -enterprise wide: provides an organization wide view of the analytical info (big picture) -subject oriented: main difference in purpose from operational database-> data warehouse developed to analyze specific business subject areas (delivers info about a theme), things expressed in numbers) -contains both detailed and summarized (but most detail is most powerful)
write operation
update operation used to update the data content in the relations
index
a mechanism for increasing the speed of data search and retrieval on relations with a large number of records -linear search and binary search
What is the purpose of a DW and what are its benefits compared to an operational database?
reasons for creation of data ware house as separate analytical database: 1. performance of operational tasks involving data use can be ruined if the tasks have to compete for computing resources with analytical queries 2. often very difficult to structure a database that can be used in a straightforward manner for both operational and analytical purposes -it also summarizes info over time thus giving business' the tools and intelligence for decision making because they can see the big picture and trends going on in the company
functional dependencies
when the value of one (or more) column(s) in each record of a relation uniquely determines the values of another column in that same record of the relation -predictors are typically IDs ex. columns ClientID and ClientName: column ClientID functionally determines column name bc particular ClientID value can be associated with only one ClientName value but same can not go vice versa