BIA Exam 2

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

data cleansing (scrubbing)

the detection and correction of low-quality, redundant data

online transaction processing (OLTP)

updating (i.e. inserting, modifying and deleting), querying and presenting data from databases for operational purposes

delete set-to-default

a referential integrity constraint option that allows a record to be deleted if it PK value is referred to by a FK value of a record in another relation. if deleted, the deleted value is set to default

linear (sequential) search

a search method that finds a particular value in a list by checking elements sequentially one at a time until the searched for value is found

binary search

a search method that takes advantage of sorted lists

accuracy

the extent to which data correctly reflects the real-world instances it is supposed to depict

conformity

the extent to which the data conforms to its specified format

data warehouse use

- the retrieval of the data in the data warehouse -Indirect use=Via the front-end (BI) applications -Direct use=Via the DBMS, Via the OLAP (BI) tools

data mart

A data store with same principles as DW, but with a more limited scope -Three of the most common data warehouse and data mart modeling approaches: -Normalized data warehouse -Dimensionally modeled data warehouse -Independent data marts

constellation (galaxy) of stars

A dimensional model with multiple fact tables

corrective data quality actions

Actions taken to correct the data quality problems

preventive data quality actions

Actions taken to preclude data quality problems

production release

Actual deployment of functioning system

What is meant by the statement - "Each of us is now a walking data generator."

All using phones Going on websites, facebook. Companies are looking and tracking what we look at

beta release

Deployment of system to a selected group of users to test system usability

timestamps

columns that indicates the time interval for which the values in the records are applicable

How did comScore make their big data more consumable, and less overwhelming, for customers?

actionable insights, visualizations, and dashboards. supported wizards, knowledge portals. -format data a specific way so it is easier to digest

detailed data

data composed of single instances of data

type 1 approach

-Changes the value in the dimension's record --The new value replaces the old value. -No history is preserved -Simplest approach, used most often when a change in a dimension is the result of an error

delete cascade

a referential integrity constraint option that allows a record to be deleted if it PK value is referred to by a FK value of a record in another relation - if deleted all other values are also deleted

Online analytical processing (OLAP) tools

tools enabling end users to engage in ad-hoc analytical querying of data warehouses

backup

saving additional physical copies of the data

Why is it challenging to find professionals that can effectively work with Big Data?

Because there's a large scope of different skills needed. Can't just hire computer scientists because they are missing the business side of it. And vice versa. Business people don't have the technical side. Communication, math, coding, business, domain expertise. Business students= business and communication side, some domain expertise. Computer scientists= math, coding, statistics.

COMMIT

Causes all the updates to be recorded on the disk

multiuser system

Data manipulation component used by multiple users at the same time

single-user system

Data manipulation component used by one user at a time

transaction-level detailed fact table

Each row represents a particular transaction

alpha release

Internal deployment of a system to members of development team for initial testing of functionality

authentication

Login procedure using user ID and password

database front-end

Provides access to the database for indirect use -can include many other components and functionalities, such as: menus charts graphs maps etc. -Can be multiple sets of front-end applications for different purposes or groups of end-users

pivot (rotate)

Reorganizes the values displayed in the original query result by moving values of a dimension column from one axis to another

data dictionary

Repository of the metadata

ROLLBACK

Rolls back all the updates since the last COMMIT

What is a DW and what are the key attributes of a DW. What is the purpose of a DW and what are its benefits compared to an operational database?

Subject oriented Integrated Time-variant (time series) Nonvolatile Summarized Not normalized Metadata Web based, relational/multi-dimensional Client/server, real-time/right-time/active ...

application development component

Used to develop front-end applications

CHECK

Used to specify a constraint on a particular column of a relation

update cascade

a referential integrity constraint option that allows the PK value of a record to be changed if its PK value if referred to by a FK value of a record in another relation, all the FK values that refer to the PK being changed are also changed

decryption key

reverts the information to its original state

data warehouse modeling

(logical data warehouse modeling ) - creation of the data warehouse data model that is implementable by the DBMS software

slice and dice

-Adds, replaces, or eliminates --Specified dimension attributes, or --Specific values of dimension attributes -Slice = Filter -Dice = 2 or more slices

conformed dimensions

A set of commonly used dimensions

dimensional modeling

-A data design methodology used for designing subject-oriented analytical databases, such as data warehouses or data marts -Commonly, dimensional modeling is employed as a relational data modeling technique -In addition to using the regular relational concepts (primary keys, foreign keys, integrity constraints, etc.) dimensional modeling distinguishes two types of tables: --Dimensions --Facts

data warehouse

-A structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. -Purpose: retrieval of analytical information. A data warehouse can store detailed and/or summarized data. -DW sometimes referred to as target system - a destination for data from source systems -Typical DW retrieves selected, analytically useful data from operational data sources

access privileges

-Assigned to the database user account -Determine user's privileges on database columns, relations and views -Include the following actions: -SELECT, UPDATE, ALTER, DELETE, INSERT

catalog

-Data dictionary created by the DBMS -Sample catalog

type 3 approach

-Involves creating a "previous" and "current" column in the dimension table for each column where changes are anticipated -Applicable in cases in which there is a fixed number of changes possible per column of a dimension, or in cases when only a limited history is recorded. -Can be combined with the use of timestamps

recovery log

-Logs database updates -Ensures against loss of updates

source systems

-Operational databases and repositories that provide analytically useful information in DW subject areas -Each operational data store has two purposes: --Original operational purpose --Source system for the data warehouse -Source systems can include external data sources

checkpoints

-Part of a recovery log -Indicates a point when updates are written on the disk

report

-Presents data and calculations on data from one or more tables -Formatted and arranged to be displayed on the screen or printed as a hard copy

authorization matrix

-implements access privileges -Provided by the DBMS -Managed by the DBA

creating the data warehouse

-using a DBMS to implement the data warehouse data model as an actual data warehouse -Typically, data warehouses are implemented using a relational DBMS (RDBMS) software

transaction identifier

a column representing transaction ID

assertion

a mechanism for specifying user-defined constraints

multidimensional database model

a model for implementing dimensionally modeled data in which the database is implemented as a collection of cubes

update set-to-null

a referential integrity constraint option that allows a record to be changed if its PK value is referred to be a FK value of a record in another relation- if changed the value of the FK is set to null value

update set-to-default

a referential integrity constraint option that allows a record to be changed if its PK value is referred to be a FK value of a record in another relation- if changed, the value of the FK is set to default value

update restrict

a referential integrity constraint option that allows a record to be changed if its PK value is referred to by a FK value of a record in another relation

delete set-to-null

a referential integrity constraint option that allows a record to be deleted if its PK value is referred to by a FK value of a record in another relation. if deleted, the deleted value is set to null

delete restrict

a referential integrity constraint option that does not allow a record to be deleted if its PK value is referred to by a FK value of a record in another relation. if deleted, the deleted value is set to restrict

snowflake models

a star schema that contains the dimensions that are normalized

row indicator

column that provides a quick indicator of whether the record is currently valid

detailed fact tables

each record refers to a single fact

aggregated fact tables

each record summarizes multiple facts

query optimization

examining multiple ways of executing the same query and choosing the fastest option

What is preattentive processing and how can a dashboard provide it? Give some examples. Which forms of preattentive processing are best for presenting quantitative information?

goal = support automatic processing instead of sequential processing. instead of reading every line of text (like finding 5s), if you just bold all of the 5s you can easily pick out the 5s. use of certain functionality whether it is through line length, bolding, shape, enclosure, saturation. making something stand out so the user can find it easily. forms = line length, x y coordinate spatial representation

operational information (transactional information)

information collected and used in support of day to day operational needs -time:days/months, detailed, current -small amounts used in a process, high frequency of access, can be updated, non-redundant -used by all types of employees for tactical purposes, application oriented

What are the general design guidelines for dashboards?

no scrolling, white space to draw our attention, don't over use color, data ink ratio. being able to drill down, being intuitive to use, user should be able to use it automatically without any training

do we really need big data?

privacy issues regarding big data..

online analytical processing (OLAP)

querying and presenting data from data warehouses and/or data marts for analytical purposes

recovery

recovering the content of the database after a failure

data warehouse deployment

releasing the data warehouse and its front-end (BI) applications for use by the end users

consistency

the extent to which the data properly conforms to and matches up with the other data

Why are bullet graphs and sparklines these graphics better than other widgets such as gauges and other graphics?

want most of our data in data not ink. no ink for backgrounds, gridlines, borders, too much color. -sparklines = space efficient time series context for measures. quick sense of measure -bullet graph= displaying a key measure along with comparative measure (like a target) and qualitative scale it instantly declare whether the measure is good or bad

What does passive measurement mean and how did comScore use it?

when you track users online while they are online without having any input or contact with the user. not asking what they are doing. the user is giving comscore permission to monitor everything they do online. they get the user to agree to do this by paying them. collects data from panelists, compiling it together to see how they feel about certain companies and products

encryption key

- information scrambling algorithm

dimensional modeled data warehouse

-Collection of dimensionally modeled intertwined data marts (i.e. constellation of dimensional models) that integrates analytically useful information from the operational data sources -Same as normalized data warehouse approach when it comes to the utilization of operational data sources and the ETL process -Fact tables corresponding to the subjects of analysis are subsequently added -A set of dimensional models is created w/each fact table connected to multiple dimensions, and some dimensions are shared by more than one fact table -Additional dimensions are included as needed -Resulting DW is a collection of intertwined dimensionally modeled data marts, i.e. a constellation of stars -Can be used as a source for dependent data marts and other views, subsets, and/or extracts

dimension tables (dimensions)

-Contain descriptions of the business, organization, or enterprise to which the subject of analysis belongs -Columns in dimension tables contain descriptive information that is often textual (e.g., product brand, product color, customer gender, customer education level), but can also be numeric (e.g., product weight, customer income level) -This information provides a basis for analysis of the subject

fact tables

-Contain measures related to the subject of analysis and the foreign keys (associating fact tables with dimension tables) -The measures in the fact tables are typically numeric and are intended for mathematical computation and quantitative analysis -Foreign keys connecting the fact table to the dimension tables -Measures related to the subject of analysis -Sometimes, fact tables can contain other attributes that are not measures. Two common ones are: --Transaction identifier --Transaction time

type 2 approach

-Creates a new additional dimension record using a new value for the surrogate key every time a value in a dimension record changes -Used in cases where history should be preserved -Can be combined with the use of timestamps and row indicators

dependent data mart

-Does not have own source systems -Data comes from a DW

form

-Enables data input and retrieval for end users -Provides an interface into a database relation or query

refresh load

-Every subsequent load is referred to as a refresh load -time period for loading new data (e.g. hourly, daily). -Determined in advance: --Based on needs of DW users and technical feasibility --In active DW, loads occur in continuous micro batches

extraction-transformation-load ETL

-Facilitates retrieval of data from operational databases into DW -ETL includes the following tasks: -Extracting analytically useful data from operational data sources -Transforming data to conform to structure of the subject-oriented target DW model -Loading transformed and quality-assured data into target DW

granularity

-Granularity describes what is depicted by one row in the fact table -Detailed fact tables have fine level of granularity because each record represents a single fact -Aggregated fact tables have a coarser level of granularity than detailed fact tables as records in aggregated fact tables always represent summarizations of multiple facts -Due to their compactness, coarser granularity aggregated fact tables are quicker to query than detailed fact tables -Coarser granularity tables are limited in terms of what information can be retrieved from them -One way to take advantage of the query performance improvement provided by aggregated fact tables, while retaining the power of analysis of detailed fact tables, is to have both types of tables coexisting within the same dimensional model, i.e. in the same constellation

How has Hadoop enhanced our ability to leverage (effectively use) Big Data?

-High level question -Gives us parallel processing, distributed file storage and processing. And because we are in our current time period, the hardware of a cluster of 200 nodes are inexpensive and have all the combined abilities of a supercomputer. -Cheap technology

Compare Kimball and Inmon approaches toward DW development.

-Imon : EDW approach (top down) -Kimball : Data mart approach (bottom up)

first load

-Initial load -populates empty DW tables -Can involve large amounts of data, depending on desired time horizon of the DW

normalized data warehouse

-Integrated analytical database modeled w/traditional database modeling techniques of ER modeling and relational modeling, resulting in a normalized relational database schema -Populated with analytically useful data from the operational data sources via the ETL process -Serves as source data for dimensionally modeled data marts and for any other non-dimensional analytically useful data sets

executive dashboard

-Intended for use by higher level decision makers within an organization -Contains organized easy-to-read display of critically important queries describing organizational performance -In general, the usage of executive dashboards should require little or no effort or training -Executive dashboards can be web-based

Similarities and differences across from the three main systems

-Operational databases= day-to-day, not designed for analytical decision making. Up to 1 year typically. ER diagrams, relational models, SQL -Analytical databases/warehouses: more than 1 year, many years. Purpose is to support decision making. The data is not changed. -Big data / Hadoop - do not have models because it comes in so fast it would not be able to put it into models efficiently. It is different because no models, constantly changing. Can store many years of data. Can have short term or long term scope depending on the needs.

developing front-end (BI) applications

-Provides access to DW for users who are engaging in indirect use -design and create applications for end-users -Front-end applications included in most data warehousing systems, referred to as business intelligence (BI) applications -Front-end applications contain interfaces (such as forms and reports) accessible via a navigation mechanism (such as a menu)

subject-oriented

-Refers to the fundamental difference in the purpose of an operational database system and a data warehouse. -Operational database system - developed to support a specific business operation -data warehouse - developed to analyze specific business subject areas

extraction

-Retrieval of analytically useful data from operational data sources to be loaded into DW -Examination of available sources -Available sources and requirements determine DW model -DW model provides a blueprint for ETL infrastructure and extraction procedures

When would you choose a Data mart vs. Data warehouse? See Table 2.3 in slides.

-Scope: DM:one subject area, EDW: multiple subject areas -development time, DM: months, EDW: years -development cost, DM: $10,000-$100,000+, EDW: $1,000,000+ -Development difficulty, DM: low to medium, EDW: high -data prerequisite for sharing, DM: common (within business area), EDW: common (across enterprise) -sources, DW: only some operational and external systems. EDW: many operational and external systems -size, DW: megabytes to several gigabytes. EDW: gigabytes to petabytes -time. DW: near current and historical data. EDW: historical -data transformations. DW: low to med. EDW: high -update frequency. DW: hourly, daily, weekly. EDW: weekly, monthly. -hardware. DW: workstations and departmental servers. EDW: enterprise servers and mainframe computers. -operating system. DW: windows and linux. EDW: Unix, zOS, OS/390 -databases. DW: workgroup or standard database servers. EDW: enterprise database servers -number of simultaneous users. DW: 10s, EDW: 100s-1,000s -user types. DW: business area analysts and managers. EDW: enterprise analysts and senior executives -business spotlight. DW: optimizing activities within the business area. EDW: cross-functional optimization and decision making

independent data mart

-Stand-alone data mart, created in the same fashion as DW -has own source systems and ETL infrastructure -multiple ETL systems are created and maintained -an inferior strategy --Inability for straightforward analysis across the enterprise --The existence of multiple unrelated ETL infrastructures -In spite of obvious disadvantages, a significant number of corporate analytical data stores are developed as a collection of independent data marts

data quality

-The data in a database is high quality if it correctly and non-ambiguously reflects the real-world it represents -Data quality characteristics: Accuracy Uniqueness Completeness Consistency Timeliness Conformity

star schema

-The result of dimensional modeling is a dimensional schema containing facts and dimensions -The dimensional schema is often referred to as the star schema -An extended, more detailed of the Star Schema is the Snowflake Schema. -In the star schema, the chosen subject of analysis is represented by a fact table -Designing the star schema involves considering which dimensions to use with the fact table representing the chosen subject -For every dimension under consideration, two questions must be answered: Question 1: Can the dimension table be useful for the analysis of the chosen subject? Question 2: Can the dimension table be created based on the existing data sources?

slowly changing dimensions

-Typical dimension in a star schema contains: --Attributes whose values do not change (or change rarely) such as store size and customer gender --Attributes whose values change occasionally over time, such as customer zip and employee salary. -Dimension that contains attributes whose values can change referred to as a slowly changing dimension -Most common approaches to dealing with slowly changing dimensions: Type 1, Type 2, Type 3

ETL infrastructure

-Typically includes use of specialized ETL software tools and/or writing code -Due to the amount of detail involved, ETL infrastructure is often the most time and resource consuming part of DW development process -Although labor intensive, creation of ETL infrastructure is redetermined by results of requirements collection and DW modeling processes (specifies sources and target)

database administration component

-Used for technical, administrative, and maintenance tasks of database systems -DCL (Data Control Language) and TCL (Transaction Control Language) SQL commands are used during these tasks

data definition component

-Used to create the components of the database --E.g. database tables, referential integrity constraints connecting the created tables. -Uses DDL (Data Definition Language) SQL commands

data manipulation component

-Used to insert, read, update, and delete information in a database -Uses DML (Data Manipulation Language) SQL commands

What are the essential skills for a data scientist?

-ability to write code -communicate in language their stake holders will understand -stroytelling with data -associative thinking

What are some of the principles of evidence-based decision-making?

-cashiers working in the store. in japan, they empowered their employees and leverage them because they know a lot more than you give them credit for. has nothing to do with big data. empowered them to manage inventory, be in control of the products and success of the store -single version of the truth- when there are many different systems out there, one organization could believe one version of the numbers is right when another group of people may be getting different numbers. -score cards -single version of the truth -empowering employees -none of these require big data -business rules= cardinalities, referential integrity, cascade, updates, deletes. impacts on how a business functions. EX- will accept returns without receipts. why? because it could take a long time to check, customer satisfaction, customers are more likely to buy something if they know they will be able to return it. this rule impacts the volume of sales, how many returns they have. explicitly state what business rules are and keep on managing them -hold employees accountable - through dashboards -use coaching to improve performance. you want all of your employees to be successful. coach them in order to be successful. provide them with feedback and constructive criticism to be able to achieve this

creating ETL infrastructure

-creating necessary procedures and code for: -Automatic extraction of relevant data from the operational data sources -Transformation of extracted data, so that quality is assured and structure conforms to the structure of the modeled and implemented DW -The seamless load of the transformed data into DW -Due to the details that have to be considered, creating ETL infrastructure is often the most time- and resource-consuming part of the data warehouse development process

database policies and standards

-database development --E.g. naming conventions -database use --E.g. business rules -database management and administration --E.g. policy for assigning administration tasks -Common purpose for database policies and standards is to reflect and support business processes and business logic

analytical information

-information collected and used in support of analytical tasks -Analytical information is based on operational (transactional) information -time: years, summarized, values over time (snapshots) -large amounts used in a process, low/modest frequency of access, read only, redundancy not an issue -used by a narrower set of users for decision making, subject oriented

load

-load extracted, transformed, and quality-assured data into target DW -Automatic, batch process inserts data into DW tables, without user involvement

Imagine that you are helping to develop a Balanced Scorecard for comScore. Describe metrics/KPIs for each of the four Balanced Scorecard dimensions based on the case study.

-metric= a measure, a fact that comscore has identified as a way to evaluate their performance. KPI= key performance indicator. looking at the happiness and success of their customers with their products KPIs/metrics used to evaluate success= customer retention (looking at success at market research), upselling (what else could they sell them), usage (what they actually used, monitored what the customers used of their data. if they are using it it is useful, if not then there i something wrong) -comscore feels it is important to educate their employees- they could use the courses they took to measure how well they did educating them -- another example of a KPI -in this class- could use grades or our future jobs and if they involve analytics as a measure of success of this program

data warehouse administration and maintenance

-perform activities that support the data warehouse end user, such as: -Provide security for information contained in the data warehouse -Ensure sufficient hard-drive space for the data warehouse content -Implement backup and recovery procedures

What is the New Deal and how does it address concerns with Big Data and privacy?

-problem the new deal is trying to address= privacy, knowing where everyone is. tracking customers. -new deal= about privacy and data error. = where companies have a central data about you, and you decided what and who can use and see it. i'll give you my data if you tell me what I get out of it. ex- Wayz, in order to use it, you have to allow them to track you and use your data because they have a valuable service. new deal sets up a dashboard so consumers can decide what they want people to know about me and who can see it.

uniqueness

-requires each real-world instance to be represented only once in the data collection -The uniqueness data quality problem is sometimes also referred to as data duplication

requirements collections, definition, visualization

-specifies desired capabilities and functionalities of the future DW -Based on data in the internal data source systems and external data sources -Requirements are collected through interviewing various stakeholders -Collected requirements should be clearly defined in a written document, and visualized as a conceptual data model

completeness

-the degree to which all the required data is present in the data collection

transformation

-transforming the structure of extracted data in order to fit the structure of the target data warehouse model -E.g. adding surrogate keys -Data quality control and improvement are included in the transformation process --Data from data sources frequently exhibit data quality problems --Data sources often contain overlapping information

OLAP/BI tools

-two purposes: 1) Ad-hoc direct analysis of dimensionally modeled data 2) Creation of front-end (BI) applications -Designed for analysis of dimensionally modeled data -Regardless of which DW approach is chosen, data accessible by the user is typically structured as a dimensional model --OLAP/BI tools can be used on analytical data stores created with different modeling approaches -Allow users to query fact and dimension tables by using simple point-and-click query-building applications (SAP BO Analysis, SAP BeXQuery Designer, Excel, Tableau, etc.) -Based on point-and-click actions of user, the OLAP/BI tool writes and executes the code in the language of the DBMS (e.g. SQL) that hosts the data warehouse or data mart that is being queried -Basic OLAP/BI tool features: Slice and Dice Pivot (Rotate) Drill Down / Drill Up -Require dimensional organization of underlying data for performing basic OLAP operations (slice, pivot, drill) -Additional OLAP/BI Tool functionalities: --Graphically visualize the answers --Create and examine calculated data --Determine comparative or relative differences --Perform exception analysis, trend analysis, forecasting, and regression analysis --Number of other analytical functions -Many OLAP/BI tools are web-based

What are the three large data platforms at comScore and what purpose do they serve? In answering this question, be sure to address why comScore needs all three platforms.

1) Greenplum is a large processing database for event-level data analysis. With this you are able to see and understand what exactly is happening and why. 50 day time period. loads hourly. current scenario for right now. just streaming and capturing, not doing too much to the actual data 2) The Greenplum enterprise data warehouse has aggregated and historical data. Through this, you are able to see trends over a long period of time. after 50 days. data that is useful that is streamed in and integrated is put into a data warehouse 3) Hadoop is a platform that has a much larger and longer history and data types. This is an infrastructure that supports all big data technologies. longer time period for storage, in depth analysis. ComScore needs all three platforms because they all serve different purposes. It seems that you cannot have one of these platforms without the other. ComScore is growing in fast volumes and all three of these platforms are essential to keep up with this growth. need the first year of data because that is what is most relevant, anything older than that isn't as important. need the middle tier to store for 12 months because there is too much data to store and they can't store more than one year. if they want more than one year they need to go to a larger platform like Hadoop every data platform is different and have a different time period that it functions for.

Describe the four sources of comScore's online data.

1) Panel Data. This was collected from 2 million internet users in and outside the United States. All members granted comScore permission to get their total online usage. This allowed comScore to find the exact number of ads displayed on computers and if their online purchasing was immediate or delayed. 2) Census Data. This is based upon sensors placed on most of the Top 100 US digital media properties. They supported several platforms like mobile devices, gaming units and smart TVs. 3) Perceptual Data. This is collected from panel members through given proprietary surveys. 4) Data obtained from strategic partners. ComScore obtained information from their partners, like a store, that shows customer's purchases. With this information and the information from the panelists, they were able to connect online advertisement with offline in-store purchases.

What are the three Vs and how do they define Big Data? Give examples of each

1) Variety= the different apps we are using, cell phone signals, GPS location, browsing history, products purchases, videos, pictures, how many times we share pictures and videos back and forth. 2) Volume= all the data we use. For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20 million filing cabinets' worth of text. An exabyte is 1,000 times that amount, or one billion gigabytes. 3) Velocity= we are doing it very fast, real time, tracking, cell phone signals. the speed of using apps

Describe and draw the 5 data warehousing architectures (Figure 2.7 in slides). Which architectures would Kimball and Inmon support (choose)?

1) independent data marts: source systems=> staging area => independent data marts (atomic summarize data)=> end user access and applications 2) linked dimensional datamarts: source systems=> staging area => dimensionalized data marts linked by conformed dimensions => end user access and applications 3) hub and spoke (corporate info factory): source systems => staging area => normalized relational warehouse => end user access and applications => dependent data marts 4) centralized DWA: source systems => staging area => normalized relational warehouse => end user access and applications 5) federated : existing data warehouses data marts and legacy systems =>data mapping/ metadata. logical/ physical integration of common data elements => end user access and applications

steps in the development of data warehouses

1) requirements 2) modeling 3) creating data warehouse 4) creating ETL infrastructure 5) developing front-end (BI) applications 6) DWH deployment 7) DWH use 8) DWH administration and maintenance

surrogate key

=Dimension tables are often given a simple, non-composite system-generated key -Values for surrogate keys are simple auto-increment integer values -Used as primary key instead of operational key -Surrogate key values typically have no other meaning or purpose

line-item detailed fact table

Each row represents a line item of a particular transaction

complete mirrored backup

Ensures against complete database destruction

drill up

Makes the granularity of the data in the query result coarser -Drilling up and drilling down does not filter your overall data. The overall total does not change.

drill down

Makes the granularity of the data in the query result finer -Drilling up and drilling down does not filter your overall data. The overall total does not change.

index

Mechanism for increasing the speed of data search and data retrieval on relations with a large number of records -Most relational DBMS software tools enable definition of indexes

What is metadata and why is it important? Provide an example of metadata

analytical -- because theres more data and varieties. also shows where it came from -Data about data. In a DW, metadata describe the contents of a DW and the manner of its acquisition and use

aggregated data

data representing summarization of multiple instances of data

What is the data-ink ratio in the context of dashboard design

more data than ink


संबंधित स्टडी सेट्स

Chapter 14: Environmental Health and Safety

View Set

Nationalist Diplomacy History Test

View Set

Crash Course: Social Interaction & Performance

View Set

atomai, izotopai, ryšiai, periodinė lentelė

View Set

Kelso's NCLEX Question Rationales

View Set

Chapter 18 Fluid and Electrolytes

View Set

Chapter 05: Developmental and Genetic Influences on Child Health Promotion

View Set