Data Warehousing

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Smaller surrogate keys translate into.....

smaller fact table rows

What are the *3 Measure Types*?

1. Additive 2. Semi-additive 3. Non-additive "You can always sum up numbers, but sometimes it makes less sense too....."

What 3 types are there to change the dimension?

Type-1: *Overwrite* the dimension attribute Type-2: *Adding* a new dimension record/row Type-3: *Adding* a new attribute/column

How could we apply a Type-2 Change (Adding a new record) in the context of these rapidly changing dimension tables?

Use *mini-dimensions*

Describe Non-Additive Measures....

You can never sum these facts - regardless of the dimensions present. ex: a discount percentage cannot but summed together

Data-warehouse manage _____________ data. Databases manage ______________ data.

historical; current

What is a *Snapshot Fact Node*:

•Measures in the fact node represent a "picture" of the activity at the end of a given period. (Allow one to measure the status of an organization)

What is a *Transactional Fact Node*:

•Measures in the fact table represent an event that occurred at a point in time ex: •Retail sales - line items on a sales ticket •Financial services - individual deposits / withdrawals •Communications - individual calls

What are *Semi-Additive Measures*:

-Can be added only along some of the dimensions -Measures of intensity •Ex. Account balance, inventory balance •(Snapshots of something taken at a point in time)

Snapshots usually only have....

....One fact table row per time period (at the occurrence of each snapshot)

In a TFT, there are more ________________, but less _______________ than a PFT. Why?

dimensions; measures In TFTs, more data is collected and dimensions are needed to describe the data. In PFTs, the data needed for the period can easily be summed up in measures; hence, there's more

When end-users are using the data-warehouse, what is the main challenge? What's a potential solution?

for end users to define what information they want from the warehouse and how to use it; Focus on "business" dimensions used for decision making.

Although surrogate keys seems like they're replacements for primary keys, all DWs should.... Why?

represent surrogate keys; They cannot uniquely identify the data in the row. They'e also smaller in length.

What are *Senior Business Management Sponsor(s)*? Why are they important?

someone we need to pay for the data warehouse and be willing to utilize it.

OLAP Cubes Offer......

sophisticated security options

A *fact table* holds...... A *dimension table* stores.....

the data to be analyzed; data about the ways in which the data in the fact table can be analyzed.

A _____________ fact node represent events occurring at a point in time

transactional

SURROGATE KEYS SHOULD BE ___________-_____________ INTEGERS

Auto-Incremented ....These dimension surrogate keys are simple integers, assigned in sequence, starting with the value 1, every time a new key is needed. (The date dimension is exempt from the surrogate key rule; this highly predictable and stable dimension can use a more meaningful primary key.)

How do surrogate keys increase performance?

Because surrogate keys are simple integers, performance is superior to that of a typical operational transaction system. It allows more efficient joins and smaller indices.

What are the constructs of a STAR Schema?

Fact Node - the key table in the center (contains measures) Dimension Node - the dependent tables Dimension Edge - the relationships between dimension tables and facts

What is an example of a *Factless Fact Table* being needed?

Factless fact tables are used for tracking a process or collecting stats *examples:* - Tracking student attendance or registration events - Tracking insurance-related accident events - Identifying building, facility, and equipment schedules for a hospital or university

Describe Additive Measures....

Facts (remember - numerical values) that can be added with any dimension in the fact table. ex: You can add hourly sales to get the sales for a day, week, month, quarter, or year. You can add sales across stores or regions.

Describe Semi-Additive Measures....

Facts that can be summed up for some of the dimensions in the fact table. These facts usually are dependent on what point in time the "transaction" records the facts. *Number of Items per Day* ex 1: if you have the number of items in the warehouse for each day, you can sum up the items for each day (total warehouse of the day), but you can't do this for the entire year -- assume you sold everything in the warehouse everyday, does the total for the year mean this is the quantity currently in the warehouse? *Number of Employees per Day* ex 2: you can aggregate department headcounts to give an organization total, but you cannot aggregate them over time, so the Sales department headcount for March 31 maybe 20 employees, and for April 30 the headcount maybe 23, but that does not mean that the total headcount at the end of April is 43

Thus - based on the previously mentioned challenge - when we develop a data-warehouse what should be our first step? What do we focus on to accomplish this step?

Figure out what information the end-users want to collect from the DW; Focus on the *dimensions* and *measures* in order to get the proper requirements

*2. Declare granularity* Explain this...

How many data attributes an entity has. ex: - Name,Address,Gender,City,State, Country of a person is *High Granularity* - Male, Female and Transgender of the Gender of this person has *Low granularity*

What's a good economical justification for developing a data warehouse?

Identifying costs and return on investment for the implementation of the data warehouse. For instance, should a business *build* or *buy* a data-warehouse?

Data-warehouse provide..... Databases provide....

Information to support decision making; Information to support day-to-day operations

Define *feasibility*

It is the availability of clean data..... Emphasize on *clean* data!! (Garbage in = garbage out)

Elaborate on how a dimension modelling/design should be *extensible*

It should easily allow for the integration of new data

To elaborate on normalization...... what is *snowflaking*?

It's a method of normalizing the dimension tables in a star schema. For instance, *in a de-normalized star schema, every dimension is represented by a single table*. After snowflaking, you have "child tables" to dimensions to offer descriptions of rows without cluttering a table

What is *Normalization*? Why do it?

It's simply the process of removing redundant attributes from a de-normalized dimension table; *Why*? 1. Normalized tables are easier to update 2. Savings in storage space

Define *understandability* regarding dimension modelling....

Keeping everything as simple as possible, but not watered down

Define *Measures*:

Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated. *Measures are facts* - the quantitative values of the event (numbers we want to analyze)

What are *mini-dimensions*?

Mini-dimensions contain the rapidly changing attributes of the original dimension and are treated as a stand-alone dimension. A mini dimension table is joined directly to fact table and not snowflaking.

What's the affect on query performance when adding dimensions?

No adverse affect -- DWs are typically optimized for this kind of design paradigm

Explain *Online Analytical Processing (OLAP)*....

OLAP has a very simple concept. *It pre-calculates most of the queries that are typically very hard to execute over tabular databases, namely aggregation, joining, and grouping.* These queries are calculated during a process that is usually called 'building' or 'processing' of the OLAP cube. This process happens overnight, and by the time end users get to work - data will have been updated.

What is an OLAP Cube?

Online Analytical Processing Engine It is a three dimensional array for analyzing data.

What are pros/cons of the Type-2 Dimension Changing?

Pro: In this methodology *all history of dimension changes is kept in the database.* Cons: This could be a very expensive database option, as many changes would result in large dimension tables and more disk space used. (It is recommended to use these in databases where attribute changes are unlikely to take place)

What are pros/cons of Type-1 Dimension Changing?

Pros: Easy to maintain records and doesn't create large tables Cons: You lose historic information.

What are pros/cons of the Type-3 Dimension Changing?

Pros: Preserves historical values Cons: The number of historical values preserved is limited to the number of columns used for each value (keeping 3 past values would = 3 additional columns) Thus, this method is not strongly needed.

*3. Identify Dimensions* Explain this....

Provide the "who, what, where, why and how" context. What hierarchies the foreign keys in your fact table lead to.

Who has access to databases vs. data-warehouses?

*Data-warehouses* Usually accessed by decision makers, data analysts, data *Databases* Non-management employees typically have access

Databases uses a Online Transactional Processing (OLTP), while data warehouses uses...

*Online Analytical Processing (OLAP)*

Databases adopt an entity relationship diagram model, while data warehouses adopt a.....

*multidimensional data model*

*4. Identify Facts* Explain this....

- A *fact* is literally an entry in the fact table.... Quantitative measurements that result from your "business process" occuring

What are disadvantages of normalization/snowflaking?

- scheme is less intuitive - ability to browse through content is more difficult - additional joins are needed that degrade query performance (esp. in browsers)

What are the 2 main purposes of *dimensions*?

1. Constrain querying 2. Filter the query result set

What are the 4 Steps to the Dimensional Modeling Design Process?

1. Select business process 2. Declare granularity 3. Identify dimensions 4. Identify facts

•Key decisions made during design of a dimensional model:

1. Select the "business" process 2.Declare the granularity 3.Identify the measures (facts) 4.Identify the dimensions

The fact table consists of ____ types of columns. Name and Describe them.....

1. The *foreign key column* to allow joins with the other dimensional tables in the STAR Schema 2. The *measures column* that contains the data being analyzed ------------------ ex: Every sale is a fact that happens, and the fact table is used to record these facts.

What 2 types of fact nodes / tables are there?

1. Transactional 2. Snapshot

What question should you ask to determine if an entry is a fact or not?

"Does the attribute take on lots of values, and used in calculations?" If yes... FACT

While data-warehouses support complex transactions, databases can only do....

"Short" and "canned" transactions

Does a transactional or snapshot fact node have more measures?

Snapshot

The logical data model is usually structured as a....... Why?

Star Schema - Users are better able to navigate the model - Relational queries usually perform better against this structure

A ____________ _____________ is a good relational DB for building a ______________ ____________________.

Star Schema; OLAP Cube

What is a *Surrogate Key*? Why are they important?

System generated key values; Important because all data warehouse keys should represent surrogate keys

Explain the Type-3 Dimension Changing: *Adding* a new attribute/column

This changes an attribute and preserves historical information without adding a new row/record. 2 columns are used: one for the current and and another for the previous value(s).

Does a transactional or snapshot fact node have more dimensions?

Transactional

Periodic fact tables allow one to measure the __________ of an organization

status

What are *Additive Measures*:

-Can be added along all dimensions -Measures of activity •Ex. $Sales, quantity sold, ...

What are *Non-additive measures*:

-Can't be added along any dimension

In a PFT, here is _______ fact table row per ________ __________.

one; time period

If no property of the fact can fulfill the purpose of the primary key (meaning there is no, single identifier for the row to be referenced), what do we use?

A *surrogate key* - a unique key that artificially assigned to identify the row when there's no primary key. ....It should still correspond to an attribute in the fact table as a primary key would

Define *facts:*

A fact is a value, or measurement, which represents a fact about the managed entity or system ex: - tch_req_total = 1000 - tch_req_success = 820 - tch_req_fail = 180

What is a *fact table*?

A fact table is the *central table in a star schema of a data warehouse*. It stores quantitative information for analysis and is often denormalized.

What is a *Factless Fact Table*?

A factless fact table is fact table that does not contain fact. They contain only dimensional keys and captures events that happen only at information level but not included in the calculations level.

What is a *multidimensional data model*?

A multidimensional model views data in the form of a data-cube. A data cube enables data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

What is a *Periodic/Snapshot Fact Table*?

A row in a periodic snapshot fact table *summarizes many measurement events occurring over a standard period, such as a day, a week, or a month*. The grain is the period, *the grain is not the individual transaction.*

What is an issue with performing relational queries using the star schema?

Access time

How do you determine events that did now occur?

Coverage - activity = events which did not occur

Explain the Type-2 Dimension Changing: *Add* a new dimension record/row

Creating a new additional record. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifier). Has some flags to mark current data as "active" and historical data as "inactive".

What is an issue with the OLAP Cube?

Data Sparsity (too many empty cells)

*1. Select business process* Explain this...

These are the operational activities performed by your organization. More specifically, *this is the measurement event to be modeled*

What are *dimensions*?

They simply provide *attributes* that provide more data about a fact in the fact table.

Define *Dimensions*:

They're the set of companion tables to a fact table that contain measures

Explain what a rapidly changing large dimension is....

This is whenever we have an extremely large dimension that the attributes of which constantly change. The *rapid growth of this dimension will impact maintenance and performance as the dimension grows* (assuming Type-2 Changes are used) *Examples:* -Government agencies -> large "people" dimension (100 million or more) -Large retail stores -> large "product" dimension (several million) -Insurance companies -> large "automobile" dimension (millions of records)

Explain the Type-1 Dimension Changing: *Overwriting* the dimension attribute

This type is easy to maintain and is often use for data which changes are caused by processing corrections(e.g. removal special characters, correcting spelling errors).

Transactions for data-warehouses are a.) read-only b.) read/write What about databases?

a.) read-only; b.) read/write

Which contains different levels of granularity? a.) Databases b.) Data-warehouses

b.) Data-warehouses; Rather, databases are reflective of "the current state of the world"

When developing a data-warehouse, we need *Compelling Business Motivation*. What is this?

it ensures the data warehouse systems align with the strategic business motivations and initiatives.

Define *Granularity*:

means 'Level of Division'. It is an extent up to which an entity can be divided into attributes. On the basis of which there are high level of granularity as well as low level of granularity. examples: - Employee : Has high level of granularity like name, phone, email, etc. - Full Name : Low level of granularity like full name, middle name, last name.

Measures in the fact table represent an event that..... Measures in the fact node represent.....

occurred at a point in time; a "picture" of the activity at the end of a given period.


Kaugnay na mga set ng pag-aaral

HRM 361 Chapter 7: Selecting Human Resources (Definitions)

View Set

TED TALK: Andras Forgacs Leather and Meat without Killing Animals

View Set

Foundations and Adult Health Nursing - Chapter 10 (NCLEX)

View Set

Human Movement Final Exam DeWitt

View Set

APEX Business Finance 8.1.2 Study Guide

View Set

Types of Insurers - Insurance Companies or Carriers

View Set

How to get Involved in Government Sergio Mejia

View Set

European History Midterm 2013 PLUS NEW QUESTIONS

View Set

Statics 1401- Final Exam-Chapters 1-11

View Set

Chapter 15: Intracellular Compartments and Protein Transport

View Set

Ch 36 inflammatory disorders, Valvular disorders

View Set