Chapter 8: Big Data, Data Warehouses, and Business Intelligence Systems

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What are the problems in operational data that create the need to clean data before loading the data into a data warehouse?

(1) Dirty Data (2) Missing Values (3) Inconsistent Data (4) Data not integrated but stored in different DBMSs. (5) Data in wrong format (6) Too much Data

What were the first two nonrelational data stores to be developed, and who developed them?

(1) Dynamo, developed by Amazon.com (2) BigTable, developed by Google

What are the characteristics of a Reporting System?

(1) Filter, sort, group and make simple calculations (2) Summarize current status (3) Compare current status to past or predicted status. (4) Classify entities (customers, products, employees, etc.) (5) Report delivery crucial

What are a couple of examples of data transformation?

(1) Instead of storing full phone numbers, you may want to just store the area codes or prefixes. (2) Data stored with two letter state codes may need to be transformed to full state names.

What are the challenges of fully replicating a DB but allowing only one computer to process updates?

(1) Make sure that only logically consistent copies of the DB are distributed. (2) Make sure the locations using the replicas understand that the replicas may not be completely updated.

What are the problems with operational DB's that limit their usefulness for BI applications?

(1) Need to check and modify operational data (2) Need to change or transform operational data

What are the characteristics of a Data Mining Application?

(1) Often employ sophisticated statistical and mathematical techniques. (2) Used for: ~What-if analyses ~Predictions ~Decisions (3) Results often incorporated into some other report or system.

What are the three sources of data for BI systems?

(1) Operational DB's (2) Extracts from Operational DB's (3) Purchased Data

What are two main categories of BI systems?

(1) Reporting (2) Data Mining

How do structured storage systems compare to RDBMS systems?

(1) Structured storage column families can have variable columns and data stored in each row in a way that is impossible in an RDBMS table. (2) Structured storage is not in 1NF or BCNF

If more than one computer can update a replicated DB what three problems can occur?

(1) There may be inconsistent updates. (2) One computer may delete a record that another computer is updating. (3) Changes may be made that violate uniqueness constraints.

What is an Operational Database?

(1) Used for structured transaction data processing (2) Current data are used (3) Data are inserted, updated, and deleted by users.

What is a Dimensional Database?

(1) Used for unstructured analytical data processing (2) Current and historical data are used (3) Data are loaded and updated systematically, not by users.

How are Structured Storage systems organized?

(1)The smallest unit of storage is called a column, but is really the equivalent of an RDBMS table cell. (2) A column consists of three elements: the column name, the column value or datum and a timestamp to record when the value was stored in the column. (3) Columns can be grouped into sets referred to as super columns. (4) Columns and super columns are grouped to create column families, which are the structured storage equivalent of RDBMS tables. (5) In a column family we have rows of grouped columns, and each row has RowKey, which is similar to the PK used in and RDBMS table. But a row in a column family doesn't have to have the same number of columns as another row in the same column family. They can have variable columns and data stored in each row. (6) All column families are contained in a keyspace, which provides the set of RowKey values that can be used in the data store.

Zettabyte (ZB)

1,024 EB

Terabyte (TB)

1,024 GB

Megabyte (MB)

1,024 KB

Gigabyte (GB)

1,024 MB

Exabyte (EB)

1,024 PB

Petabyte (PB)

1,024 TB

Yottabyte (YB)

1,024 ZB

Byte

8 bits [Store one character]

What is a Star Schema?

A DB design that uses a denormalized design to store historical data. It has a fact table at the center and dimension tables radiating out from the center. The fact table is always fully normalized, but dimension table may be non-normalized.

What is a Data Warehouse?

A DB system that has data, programs, and personnel that specialize in the preparation of data for BI processing, for the entire business. The data is: ~usually summarized, not detailed ~historical, not current ~multi-dimensional ~tuned for queries not insert, update, and delete transactions ~can be un-normalized

What is a Distributed DB?

A DB that has: (1) Been split into sections called partitions, and has the partitions stored on different computers, -OR- (2) Copies of the DB called replicas stored on different computers -OR- (3) Been both partitioned and replicated.

What is Hadoop?

A Distributed File System which provides standard file services to clustered servers so that their file systems can function as one distributed file system.

What is Cassandra?

A NoSQL nonrelational data store.

What is an OLAP dimension?

A characteristic of a measure. If the measure is PURCHASE, then dimensions could include PurchaseDate, CustomerType, etc.

What is a Slowly Changing Dimension?

A column of historical data that changes over time.

What is an RFM analysis?

A customer classification technique that consider how Recently (R) a customer ordered, how Frequently (F) a customer orders, and considers how much Money (M) the customer spends per order.

What is a Conformed Dimension?

A dimension table that is used in more than one star schema to maintain consistency within the data warehouse. It then links to two or more fact tables.

What is an OLAP measure?

A dimensional model fact - the data item of interest that is to be summed or averaged or otherwise processed in the OLAP report.

What is Structured Storage?

A nonrelational DBMS that is associated with the NoSQL movement.

What is a Measure?

A quantitative or factual record of business activity about the entity represented by the fact table.

What is an Object-Relational DB?

A relational DB that has the full functionality of a relational DB, but can also handle some object data.

What is an OLAP cube?

A report. Despite its name however, it may have any number of axes.

What is Distributed Two-Phase Locking?

A sophisticated form of record locking that must be used when DB transactions are processed on two or more machines.

What is an ETL system?

A system that reads data from operational DB's.

What is a Fact Table?

A table in a dimensional DB that is used to store measures of business activity, which are quantitative or factual data about the entity represented by the fact table.

What is a Dimension Table?

A table used to record values of attributes that describe the fact measures in the fact table, and these attributes are used in queries to select and group the measures in the fact table.

Kilobyte (KB)

About 1,024 bytes

Explain one way to partition a DB that has three tables: T1, T2, and T3.

Assume we have three DB servers: S1, S2 and S3. Put one table on one server: Server S1 holds Table T1 Server S2 holds Table T2 Server S3 holds Table T3

Explain one way to replicate a DB that has three tables: T1, T2, and T3.

Assume we have three DB servers: S1, S2, and S3. Put all three tables on each server: Server S1 holds Tables T1, T2 and T3 Server S2 holds Tables T1, T2 and T3 Server S3 holds Tables T1, T2 and T3

Why are OODBMSs not successful?

Because by the time they were introduced too much data was already stored in relational DBMS formats, and organizations do not want to have to convert this data to a different format just to be able to use it.

Explain why relational DBs are difficult to use for object persistence.

Because object structures are generally more complicated than a table row. Typically several rows of several tables are required to store the data about the object. This results in the need for special structures (essentially mini-databases) in the relational database just to hold the object data.

Why is the Time Dimension important in a Dimensional Model?

Because, as they are used for the analysis of historical data, they must be designed to handle data that changes over time. An example would be a customer that moved to a different address or state.

What is a commonly used example of MapReduce?

Counting how many times each word is used in a document.

What is Click-stream Data?

Data about a customer's clicking behavior on a Web page; such data are often analyzed by e-commerce companies.

What is Metadata?

Data concerning the structure of data in a DB stored in the data dictionary. Used to describe tables, columns, constraints, indexes, and so forth.

What does it mean to transform data?

Data in data warehouses is summary data, and may need to be aggregated to be used correctly. Data could also be in a different format and thus may need to be changed though it is at the same level of aggregation.

What is an example of Data Warehouse Metadata?

Data source, format, assumptions, and constraints

What does a Two-Dimensional Matrix show?

Each cell shows the total number of each product that has been purchased by each customer.

What does a Three-Dimensional Cube show?

Each cell will show the total quantity of each product that has been purchased by each customer on a specific date.

What does ETL stand for?

Extract, Transform, Load

What is the nonrelational data store for Hadoop called?

HBase

What are Business Information (BI) Systems?

Information systems that assist managers and other business professionals to (1) analyze current and past events and (2) predict future events.

What functions does an ETL perform?

It cleans and prepares the data for BI processing. This can be a complex process.

What is MapReduce Processing?

It is used to break a large analytical task into smaller tasks, assign each smaller task to a separate computer in the cluster, and then gather the results of each of those tasks and combine them in the final product of the original tasks.

What is the history of Hadoop?

It originated as part of Cassandra, but the project has spun off a nonrelational data store and a query language of its own.

What does OODBMS stand for?

Object-Oriented Database Management System

What organizations should consider using a distributed database?

Ones that have: (1) experienced database teams (2) a substantial budget for the project (3) a significant amount of time for the project (4) data communications expertise

What does OLAP stand for? What is its purpose?

Online Analytical Processing and its purpose is to provide the ability to sum, count, average, and perform other simple arithmetic operations on groups of data.

What is the query language for Hadoop called?

Pig

What problems can occur in a distributed DB that is partitioned but not replicated?

Problems only occur if a transaction updates data that spans two or more of the distributed partitions. The transaction must be initiated on all the servers, but must be allowed to commit on any one computer only if it can be committed on all the servers.

What is a Data Mart?

Stores data that reflects one component or functional area of the business. It is like a retail store in a supply chain.

What is meant by the term Object Persistence?

Storing the values of the properties of an object.

What is a Drill Down?

The ability to further divide data into more detail.

What solution is used to prevent the problems of more than one computer updating a replicated DB?

The associated problems can be prevented by using Distributed Two-Phase Locking.

What is the Enterprise Data Warehouse (EDW) Architecture?

The combined system, when the data mart structure is combined with the data warehouse architecture. In this configuration, the data warehouse maintains all enterprise BI data and acts as the authoritative source for data extracts provided to the data marts. The data marts receive all their data from the data warehouse; they do not add or maintain any additional data.

What does the term "Reduce" refer to in MapReduce?

The combining of the individual results into the final result.

What must be done when fully replicating a DB but allowing only one computer to process updates?

The copies of the updates must be periodically sent to the servers holding the other replicas.

What is Big Data?

The current term for the enormous datasets generated by Web applications such as search tool (for example Google and Bing) and Web 2.0 social networks (for example Facebook, LinkedIn, and Twitter). Although these new and very visible Web applications are highlighting the problems of dealing with large datasets, these problems were already present in other areas, such as scientific research and business operations.

What is the NoSQL movement (a.k.a. the Not Only SQL movement)?

The movement form Big Data DBs to the use of non-relational DBMSs, often known as structured storage. It is typically a distributed, replicated DB used where this type of a DMBS is needed to support large datasets. For example, both Facebook and Twitter use the Apache Software Foundations Cassandra Database.

What does the term "Map" refer to in MapReduce?

The work that is done on each individual computer.

What is the distinguishing characteristic of OLAP reports?

They are dynamic, which means that the user can change the format of the report while preparing it.

How do BI systems differ from transaction processing systems?

They do not support normal operations such as order processing.

What do BI's do with Extracts from Operational DB's?

They may run on separate DBMS using data imported from operational DBMSs.

What do BI's do with Operational DB's?

They read and process operational DBMS data, but do not insert, modify, or delete the data.

What do Operational Systems do?

They support primary business activities, such as sales, purchasing and inventory-control systems. They use DBMS to both read data from and store data to the DB.

Why are data warehouses necessary?

To overcome the problems of using operational DBs for BI applications.

What is the purpose of a OODBMS?

To provide a DBMS that stores object data and supported object persistence.

What are Operational Systems also known as?

Transactional Systems or Online Transaction Processing (OLTP) Systems

When does Sparse Data occur?

When the number of attributes (columns in a relational database) is large but the actual number of data instances is low.


Set pelajaran terkait

FINN 3013 Exam 1 Practice Questions

View Set

Organic Chemistry Chapter 17 Lipids and their functions in biochemical systems

View Set

Theatre 130 Final Kevin Ferguson

View Set

Chapter 14: The Nervous System: The Spinal Cord and Spinal Nerves

View Set

NOS 130- Lesson 5 Configuring Security & Devices

View Set

Potter & Perry: Chapter 48 Skin and Wound Care

View Set

Area, Perimeter, and Circumference

View Set

English Midterm Assessment Concorde

View Set