BIA Exam 2
data cleansing (scrubbing)
the detection and correction of low-quality, redundant data
online transaction processing (OLTP)
updating (i.e. inserting, modifying and deleting), querying and presenting data from databases for operational purposes
delete set-to-default
a referential integrity constraint option that allows a record to be deleted if it PK value is referred to by a FK value of a record in another relation. if deleted, the deleted value is set to default
linear (sequential) search
a search method that finds a particular value in a list by checking elements sequentially one at a time until the searched for value is found
binary search
a search method that takes advantage of sorted lists
accuracy
the extent to which data correctly reflects the real-world instances it is supposed to depict
conformity
the extent to which the data conforms to its specified format
data warehouse use
- the retrieval of the data in the data warehouse -Indirect use=Via the front-end (BI) applications -Direct use=Via the DBMS, Via the OLAP (BI) tools
data mart
A data store with same principles as DW, but with a more limited scope -Three of the most common data warehouse and data mart modeling approaches: -Normalized data warehouse -Dimensionally modeled data warehouse -Independent data marts
constellation (galaxy) of stars
A dimensional model with multiple fact tables
corrective data quality actions
Actions taken to correct the data quality problems
preventive data quality actions
Actions taken to preclude data quality problems
production release
Actual deployment of functioning system
What is meant by the statement - "Each of us is now a walking data generator."
All using phones Going on websites, facebook. Companies are looking and tracking what we look at
beta release
Deployment of system to a selected group of users to test system usability
timestamps
columns that indicates the time interval for which the values in the records are applicable
How did comScore make their big data more consumable, and less overwhelming, for customers?
actionable insights, visualizations, and dashboards. supported wizards, knowledge portals. -format data a specific way so it is easier to digest
detailed data
data composed of single instances of data
type 1 approach
-Changes the value in the dimension's record --The new value replaces the old value. -No history is preserved -Simplest approach, used most often when a change in a dimension is the result of an error
delete cascade
a referential integrity constraint option that allows a record to be deleted if it PK value is referred to by a FK value of a record in another relation - if deleted all other values are also deleted
Online analytical processing (OLAP) tools
tools enabling end users to engage in ad-hoc analytical querying of data warehouses
backup
saving additional physical copies of the data
Why is it challenging to find professionals that can effectively work with Big Data?
Because there's a large scope of different skills needed. Can't just hire computer scientists because they are missing the business side of it. And vice versa. Business people don't have the technical side. Communication, math, coding, business, domain expertise. Business students= business and communication side, some domain expertise. Computer scientists= math, coding, statistics.
COMMIT
Causes all the updates to be recorded on the disk
multiuser system
Data manipulation component used by multiple users at the same time
single-user system
Data manipulation component used by one user at a time
transaction-level detailed fact table
Each row represents a particular transaction
alpha release
Internal deployment of a system to members of development team for initial testing of functionality
authentication
Login procedure using user ID and password
database front-end
Provides access to the database for indirect use -can include many other components and functionalities, such as: menus charts graphs maps etc. -Can be multiple sets of front-end applications for different purposes or groups of end-users
pivot (rotate)
Reorganizes the values displayed in the original query result by moving values of a dimension column from one axis to another
data dictionary
Repository of the metadata
ROLLBACK
Rolls back all the updates since the last COMMIT
What is a DW and what are the key attributes of a DW. What is the purpose of a DW and what are its benefits compared to an operational database?
Subject oriented Integrated Time-variant (time series) Nonvolatile Summarized Not normalized Metadata Web based, relational/multi-dimensional Client/server, real-time/right-time/active ...
application development component
Used to develop front-end applications
CHECK
Used to specify a constraint on a particular column of a relation
update cascade
a referential integrity constraint option that allows the PK value of a record to be changed if its PK value if referred to by a FK value of a record in another relation, all the FK values that refer to the PK being changed are also changed
decryption key
reverts the information to its original state
data warehouse modeling
(logical data warehouse modeling ) - creation of the data warehouse data model that is implementable by the DBMS software
slice and dice
-Adds, replaces, or eliminates --Specified dimension attributes, or --Specific values of dimension attributes -Slice = Filter -Dice = 2 or more slices
conformed dimensions
A set of commonly used dimensions
dimensional modeling
-A data design methodology used for designing subject-oriented analytical databases, such as data warehouses or data marts -Commonly, dimensional modeling is employed as a relational data modeling technique -In addition to using the regular relational concepts (primary keys, foreign keys, integrity constraints, etc.) dimensional modeling distinguishes two types of tables: --Dimensions --Facts
data warehouse
-A structured repository of integrated, subject-oriented, enterprise-wide, historical, and time-variant data. -Purpose: retrieval of analytical information. A data warehouse can store detailed and/or summarized data. -DW sometimes referred to as target system - a destination for data from source systems -Typical DW retrieves selected, analytically useful data from operational data sources
access privileges
-Assigned to the database user account -Determine user's privileges on database columns, relations and views -Include the following actions: -SELECT, UPDATE, ALTER, DELETE, INSERT
catalog
-Data dictionary created by the DBMS -Sample catalog
type 3 approach
-Involves creating a "previous" and "current" column in the dimension table for each column where changes are anticipated -Applicable in cases in which there is a fixed number of changes possible per column of a dimension, or in cases when only a limited history is recorded. -Can be combined with the use of timestamps
recovery log
-Logs database updates -Ensures against loss of updates
source systems
-Operational databases and repositories that provide analytically useful information in DW subject areas -Each operational data store has two purposes: --Original operational purpose --Source system for the data warehouse -Source systems can include external data sources
checkpoints
-Part of a recovery log -Indicates a point when updates are written on the disk
report
-Presents data and calculations on data from one or more tables -Formatted and arranged to be displayed on the screen or printed as a hard copy
authorization matrix
-implements access privileges -Provided by the DBMS -Managed by the DBA
creating the data warehouse
-using a DBMS to implement the data warehouse data model as an actual data warehouse -Typically, data warehouses are implemented using a relational DBMS (RDBMS) software
transaction identifier
a column representing transaction ID
assertion
a mechanism for specifying user-defined constraints
multidimensional database model
a model for implementing dimensionally modeled data in which the database is implemented as a collection of cubes
update set-to-null
a referential integrity constraint option that allows a record to be changed if its PK value is referred to be a FK value of a record in another relation- if changed the value of the FK is set to null value
update set-to-default
a referential integrity constraint option that allows a record to be changed if its PK value is referred to be a FK value of a record in another relation- if changed, the value of the FK is set to default value
update restrict
a referential integrity constraint option that allows a record to be changed if its PK value is referred to by a FK value of a record in another relation
delete set-to-null
a referential integrity constraint option that allows a record to be deleted if its PK value is referred to by a FK value of a record in another relation. if deleted, the deleted value is set to null
delete restrict
a referential integrity constraint option that does not allow a record to be deleted if its PK value is referred to by a FK value of a record in another relation. if deleted, the deleted value is set to restrict
snowflake models
a star schema that contains the dimensions that are normalized
row indicator
column that provides a quick indicator of whether the record is currently valid
detailed fact tables
each record refers to a single fact
aggregated fact tables
each record summarizes multiple facts
query optimization
examining multiple ways of executing the same query and choosing the fastest option
What is preattentive processing and how can a dashboard provide it? Give some examples. Which forms of preattentive processing are best for presenting quantitative information?
goal = support automatic processing instead of sequential processing. instead of reading every line of text (like finding 5s), if you just bold all of the 5s you can easily pick out the 5s. use of certain functionality whether it is through line length, bolding, shape, enclosure, saturation. making something stand out so the user can find it easily. forms = line length, x y coordinate spatial representation
operational information (transactional information)
information collected and used in support of day to day operational needs -time:days/months, detailed, current -small amounts used in a process, high frequency of access, can be updated, non-redundant -used by all types of employees for tactical purposes, application oriented
What are the general design guidelines for dashboards?
no scrolling, white space to draw our attention, don't over use color, data ink ratio. being able to drill down, being intuitive to use, user should be able to use it automatically without any training
do we really need big data?
privacy issues regarding big data..
online analytical processing (OLAP)
querying and presenting data from data warehouses and/or data marts for analytical purposes
recovery
recovering the content of the database after a failure
data warehouse deployment
releasing the data warehouse and its front-end (BI) applications for use by the end users
consistency
the extent to which the data properly conforms to and matches up with the other data
Why are bullet graphs and sparklines these graphics better than other widgets such as gauges and other graphics?
want most of our data in data not ink. no ink for backgrounds, gridlines, borders, too much color. -sparklines = space efficient time series context for measures. quick sense of measure -bullet graph= displaying a key measure along with comparative measure (like a target) and qualitative scale it instantly declare whether the measure is good or bad
What does passive measurement mean and how did comScore use it?
when you track users online while they are online without having any input or contact with the user. not asking what they are doing. the user is giving comscore permission to monitor everything they do online. they get the user to agree to do this by paying them. collects data from panelists, compiling it together to see how they feel about certain companies and products
encryption key
- information scrambling algorithm
dimensional modeled data warehouse
-Collection of dimensionally modeled intertwined data marts (i.e. constellation of dimensional models) that integrates analytically useful information from the operational data sources -Same as normalized data warehouse approach when it comes to the utilization of operational data sources and the ETL process -Fact tables corresponding to the subjects of analysis are subsequently added -A set of dimensional models is created w/each fact table connected to multiple dimensions, and some dimensions are shared by more than one fact table -Additional dimensions are included as needed -Resulting DW is a collection of intertwined dimensionally modeled data marts, i.e. a constellation of stars -Can be used as a source for dependent data marts and other views, subsets, and/or extracts
dimension tables (dimensions)
-Contain descriptions of the business, organization, or enterprise to which the subject of analysis belongs -Columns in dimension tables contain descriptive information that is often textual (e.g., product brand, product color, customer gender, customer education level), but can also be numeric (e.g., product weight, customer income level) -This information provides a basis for analysis of the subject
fact tables
-Contain measures related to the subject of analysis and the foreign keys (associating fact tables with dimension tables) -The measures in the fact tables are typically numeric and are intended for mathematical computation and quantitative analysis -Foreign keys connecting the fact table to the dimension tables -Measures related to the subject of analysis -Sometimes, fact tables can contain other attributes that are not measures. Two common ones are: --Transaction identifier --Transaction time
type 2 approach
-Creates a new additional dimension record using a new value for the surrogate key every time a value in a dimension record changes -Used in cases where history should be preserved -Can be combined with the use of timestamps and row indicators
dependent data mart
-Does not have own source systems -Data comes from a DW
form
-Enables data input and retrieval for end users -Provides an interface into a database relation or query
refresh load
-Every subsequent load is referred to as a refresh load -time period for loading new data (e.g. hourly, daily). -Determined in advance: --Based on needs of DW users and technical feasibility --In active DW, loads occur in continuous micro batches
extraction-transformation-load ETL
-Facilitates retrieval of data from operational databases into DW -ETL includes the following tasks: -Extracting analytically useful data from operational data sources -Transforming data to conform to structure of the subject-oriented target DW model -Loading transformed and quality-assured data into target DW
granularity
-Granularity describes what is depicted by one row in the fact table -Detailed fact tables have fine level of granularity because each record represents a single fact -Aggregated fact tables have a coarser level of granularity than detailed fact tables as records in aggregated fact tables always represent summarizations of multiple facts -Due to their compactness, coarser granularity aggregated fact tables are quicker to query than detailed fact tables -Coarser granularity tables are limited in terms of what information can be retrieved from them -One way to take advantage of the query performance improvement provided by aggregated fact tables, while retaining the power of analysis of detailed fact tables, is to have both types of tables coexisting within the same dimensional model, i.e. in the same constellation
How has Hadoop enhanced our ability to leverage (effectively use) Big Data?
-High level question -Gives us parallel processing, distributed file storage and processing. And because we are in our current time period, the hardware of a cluster of 200 nodes are inexpensive and have all the combined abilities of a supercomputer. -Cheap technology
Compare Kimball and Inmon approaches toward DW development.
-Imon : EDW approach (top down) -Kimball : Data mart approach (bottom up)
first load
-Initial load -populates empty DW tables -Can involve large amounts of data, depending on desired time horizon of the DW
normalized data warehouse
-Integrated analytical database modeled w/traditional database modeling techniques of ER modeling and relational modeling, resulting in a normalized relational database schema -Populated with analytically useful data from the operational data sources via the ETL process -Serves as source data for dimensionally modeled data marts and for any other non-dimensional analytically useful data sets
executive dashboard
-Intended for use by higher level decision makers within an organization -Contains organized easy-to-read display of critically important queries describing organizational performance -In general, the usage of executive dashboards should require little or no effort or training -Executive dashboards can be web-based
Similarities and differences across from the three main systems
-Operational databases= day-to-day, not designed for analytical decision making. Up to 1 year typically. ER diagrams, relational models, SQL -Analytical databases/warehouses: more than 1 year, many years. Purpose is to support decision making. The data is not changed. -Big data / Hadoop - do not have models because it comes in so fast it would not be able to put it into models efficiently. It is different because no models, constantly changing. Can store many years of data. Can have short term or long term scope depending on the needs.
developing front-end (BI) applications
-Provides access to DW for users who are engaging in indirect use -design and create applications for end-users -Front-end applications included in most data warehousing systems, referred to as business intelligence (BI) applications -Front-end applications contain interfaces (such as forms and reports) accessible via a navigation mechanism (such as a menu)
subject-oriented
-Refers to the fundamental difference in the purpose of an operational database system and a data warehouse. -Operational database system - developed to support a specific business operation -data warehouse - developed to analyze specific business subject areas
extraction
-Retrieval of analytically useful data from operational data sources to be loaded into DW -Examination of available sources -Available sources and requirements determine DW model -DW model provides a blueprint for ETL infrastructure and extraction procedures
When would you choose a Data mart vs. Data warehouse? See Table 2.3 in slides.
-Scope: DM:one subject area, EDW: multiple subject areas -development time, DM: months, EDW: years -development cost, DM: $10,000-$100,000+, EDW: $1,000,000+ -Development difficulty, DM: low to medium, EDW: high -data prerequisite for sharing, DM: common (within business area), EDW: common (across enterprise) -sources, DW: only some operational and external systems. EDW: many operational and external systems -size, DW: megabytes to several gigabytes. EDW: gigabytes to petabytes -time. DW: near current and historical data. EDW: historical -data transformations. DW: low to med. EDW: high -update frequency. DW: hourly, daily, weekly. EDW: weekly, monthly. -hardware. DW: workstations and departmental servers. EDW: enterprise servers and mainframe computers. -operating system. DW: windows and linux. EDW: Unix, zOS, OS/390 -databases. DW: workgroup or standard database servers. EDW: enterprise database servers -number of simultaneous users. DW: 10s, EDW: 100s-1,000s -user types. DW: business area analysts and managers. EDW: enterprise analysts and senior executives -business spotlight. DW: optimizing activities within the business area. EDW: cross-functional optimization and decision making
independent data mart
-Stand-alone data mart, created in the same fashion as DW -has own source systems and ETL infrastructure -multiple ETL systems are created and maintained -an inferior strategy --Inability for straightforward analysis across the enterprise --The existence of multiple unrelated ETL infrastructures -In spite of obvious disadvantages, a significant number of corporate analytical data stores are developed as a collection of independent data marts
data quality
-The data in a database is high quality if it correctly and non-ambiguously reflects the real-world it represents -Data quality characteristics: Accuracy Uniqueness Completeness Consistency Timeliness Conformity
star schema
-The result of dimensional modeling is a dimensional schema containing facts and dimensions -The dimensional schema is often referred to as the star schema -An extended, more detailed of the Star Schema is the Snowflake Schema. -In the star schema, the chosen subject of analysis is represented by a fact table -Designing the star schema involves considering which dimensions to use with the fact table representing the chosen subject -For every dimension under consideration, two questions must be answered: Question 1: Can the dimension table be useful for the analysis of the chosen subject? Question 2: Can the dimension table be created based on the existing data sources?
slowly changing dimensions
-Typical dimension in a star schema contains: --Attributes whose values do not change (or change rarely) such as store size and customer gender --Attributes whose values change occasionally over time, such as customer zip and employee salary. -Dimension that contains attributes whose values can change referred to as a slowly changing dimension -Most common approaches to dealing with slowly changing dimensions: Type 1, Type 2, Type 3
ETL infrastructure
-Typically includes use of specialized ETL software tools and/or writing code -Due to the amount of detail involved, ETL infrastructure is often the most time and resource consuming part of DW development process -Although labor intensive, creation of ETL infrastructure is redetermined by results of requirements collection and DW modeling processes (specifies sources and target)
database administration component
-Used for technical, administrative, and maintenance tasks of database systems -DCL (Data Control Language) and TCL (Transaction Control Language) SQL commands are used during these tasks
data definition component
-Used to create the components of the database --E.g. database tables, referential integrity constraints connecting the created tables. -Uses DDL (Data Definition Language) SQL commands
data manipulation component
-Used to insert, read, update, and delete information in a database -Uses DML (Data Manipulation Language) SQL commands
What are the essential skills for a data scientist?
-ability to write code -communicate in language their stake holders will understand -stroytelling with data -associative thinking
What are some of the principles of evidence-based decision-making?
-cashiers working in the store. in japan, they empowered their employees and leverage them because they know a lot more than you give them credit for. has nothing to do with big data. empowered them to manage inventory, be in control of the products and success of the store -single version of the truth- when there are many different systems out there, one organization could believe one version of the numbers is right when another group of people may be getting different numbers. -score cards -single version of the truth -empowering employees -none of these require big data -business rules= cardinalities, referential integrity, cascade, updates, deletes. impacts on how a business functions. EX- will accept returns without receipts. why? because it could take a long time to check, customer satisfaction, customers are more likely to buy something if they know they will be able to return it. this rule impacts the volume of sales, how many returns they have. explicitly state what business rules are and keep on managing them -hold employees accountable - through dashboards -use coaching to improve performance. you want all of your employees to be successful. coach them in order to be successful. provide them with feedback and constructive criticism to be able to achieve this
creating ETL infrastructure
-creating necessary procedures and code for: -Automatic extraction of relevant data from the operational data sources -Transformation of extracted data, so that quality is assured and structure conforms to the structure of the modeled and implemented DW -The seamless load of the transformed data into DW -Due to the details that have to be considered, creating ETL infrastructure is often the most time- and resource-consuming part of the data warehouse development process
database policies and standards
-database development --E.g. naming conventions -database use --E.g. business rules -database management and administration --E.g. policy for assigning administration tasks -Common purpose for database policies and standards is to reflect and support business processes and business logic
analytical information
-information collected and used in support of analytical tasks -Analytical information is based on operational (transactional) information -time: years, summarized, values over time (snapshots) -large amounts used in a process, low/modest frequency of access, read only, redundancy not an issue -used by a narrower set of users for decision making, subject oriented
load
-load extracted, transformed, and quality-assured data into target DW -Automatic, batch process inserts data into DW tables, without user involvement
Imagine that you are helping to develop a Balanced Scorecard for comScore. Describe metrics/KPIs for each of the four Balanced Scorecard dimensions based on the case study.
-metric= a measure, a fact that comscore has identified as a way to evaluate their performance. KPI= key performance indicator. looking at the happiness and success of their customers with their products KPIs/metrics used to evaluate success= customer retention (looking at success at market research), upselling (what else could they sell them), usage (what they actually used, monitored what the customers used of their data. if they are using it it is useful, if not then there i something wrong) -comscore feels it is important to educate their employees- they could use the courses they took to measure how well they did educating them -- another example of a KPI -in this class- could use grades or our future jobs and if they involve analytics as a measure of success of this program
data warehouse administration and maintenance
-perform activities that support the data warehouse end user, such as: -Provide security for information contained in the data warehouse -Ensure sufficient hard-drive space for the data warehouse content -Implement backup and recovery procedures
What is the New Deal and how does it address concerns with Big Data and privacy?
-problem the new deal is trying to address= privacy, knowing where everyone is. tracking customers. -new deal= about privacy and data error. = where companies have a central data about you, and you decided what and who can use and see it. i'll give you my data if you tell me what I get out of it. ex- Wayz, in order to use it, you have to allow them to track you and use your data because they have a valuable service. new deal sets up a dashboard so consumers can decide what they want people to know about me and who can see it.
uniqueness
-requires each real-world instance to be represented only once in the data collection -The uniqueness data quality problem is sometimes also referred to as data duplication
requirements collections, definition, visualization
-specifies desired capabilities and functionalities of the future DW -Based on data in the internal data source systems and external data sources -Requirements are collected through interviewing various stakeholders -Collected requirements should be clearly defined in a written document, and visualized as a conceptual data model
completeness
-the degree to which all the required data is present in the data collection
transformation
-transforming the structure of extracted data in order to fit the structure of the target data warehouse model -E.g. adding surrogate keys -Data quality control and improvement are included in the transformation process --Data from data sources frequently exhibit data quality problems --Data sources often contain overlapping information
OLAP/BI tools
-two purposes: 1) Ad-hoc direct analysis of dimensionally modeled data 2) Creation of front-end (BI) applications -Designed for analysis of dimensionally modeled data -Regardless of which DW approach is chosen, data accessible by the user is typically structured as a dimensional model --OLAP/BI tools can be used on analytical data stores created with different modeling approaches -Allow users to query fact and dimension tables by using simple point-and-click query-building applications (SAP BO Analysis, SAP BeXQuery Designer, Excel, Tableau, etc.) -Based on point-and-click actions of user, the OLAP/BI tool writes and executes the code in the language of the DBMS (e.g. SQL) that hosts the data warehouse or data mart that is being queried -Basic OLAP/BI tool features: Slice and Dice Pivot (Rotate) Drill Down / Drill Up -Require dimensional organization of underlying data for performing basic OLAP operations (slice, pivot, drill) -Additional OLAP/BI Tool functionalities: --Graphically visualize the answers --Create and examine calculated data --Determine comparative or relative differences --Perform exception analysis, trend analysis, forecasting, and regression analysis --Number of other analytical functions -Many OLAP/BI tools are web-based
What are the three large data platforms at comScore and what purpose do they serve? In answering this question, be sure to address why comScore needs all three platforms.
1) Greenplum is a large processing database for event-level data analysis. With this you are able to see and understand what exactly is happening and why. 50 day time period. loads hourly. current scenario for right now. just streaming and capturing, not doing too much to the actual data 2) The Greenplum enterprise data warehouse has aggregated and historical data. Through this, you are able to see trends over a long period of time. after 50 days. data that is useful that is streamed in and integrated is put into a data warehouse 3) Hadoop is a platform that has a much larger and longer history and data types. This is an infrastructure that supports all big data technologies. longer time period for storage, in depth analysis. ComScore needs all three platforms because they all serve different purposes. It seems that you cannot have one of these platforms without the other. ComScore is growing in fast volumes and all three of these platforms are essential to keep up with this growth. need the first year of data because that is what is most relevant, anything older than that isn't as important. need the middle tier to store for 12 months because there is too much data to store and they can't store more than one year. if they want more than one year they need to go to a larger platform like Hadoop every data platform is different and have a different time period that it functions for.
Describe the four sources of comScore's online data.
1) Panel Data. This was collected from 2 million internet users in and outside the United States. All members granted comScore permission to get their total online usage. This allowed comScore to find the exact number of ads displayed on computers and if their online purchasing was immediate or delayed. 2) Census Data. This is based upon sensors placed on most of the Top 100 US digital media properties. They supported several platforms like mobile devices, gaming units and smart TVs. 3) Perceptual Data. This is collected from panel members through given proprietary surveys. 4) Data obtained from strategic partners. ComScore obtained information from their partners, like a store, that shows customer's purchases. With this information and the information from the panelists, they were able to connect online advertisement with offline in-store purchases.
What are the three Vs and how do they define Big Data? Give examples of each
1) Variety= the different apps we are using, cell phone signals, GPS location, browsing history, products purchases, videos, pictures, how many times we share pictures and videos back and forth. 2) Volume= all the data we use. For instance, it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20 million filing cabinets' worth of text. An exabyte is 1,000 times that amount, or one billion gigabytes. 3) Velocity= we are doing it very fast, real time, tracking, cell phone signals. the speed of using apps
Describe and draw the 5 data warehousing architectures (Figure 2.7 in slides). Which architectures would Kimball and Inmon support (choose)?
1) independent data marts: source systems=> staging area => independent data marts (atomic summarize data)=> end user access and applications 2) linked dimensional datamarts: source systems=> staging area => dimensionalized data marts linked by conformed dimensions => end user access and applications 3) hub and spoke (corporate info factory): source systems => staging area => normalized relational warehouse => end user access and applications => dependent data marts 4) centralized DWA: source systems => staging area => normalized relational warehouse => end user access and applications 5) federated : existing data warehouses data marts and legacy systems =>data mapping/ metadata. logical/ physical integration of common data elements => end user access and applications
steps in the development of data warehouses
1) requirements 2) modeling 3) creating data warehouse 4) creating ETL infrastructure 5) developing front-end (BI) applications 6) DWH deployment 7) DWH use 8) DWH administration and maintenance
surrogate key
=Dimension tables are often given a simple, non-composite system-generated key -Values for surrogate keys are simple auto-increment integer values -Used as primary key instead of operational key -Surrogate key values typically have no other meaning or purpose
line-item detailed fact table
Each row represents a line item of a particular transaction
complete mirrored backup
Ensures against complete database destruction
drill up
Makes the granularity of the data in the query result coarser -Drilling up and drilling down does not filter your overall data. The overall total does not change.
drill down
Makes the granularity of the data in the query result finer -Drilling up and drilling down does not filter your overall data. The overall total does not change.
index
Mechanism for increasing the speed of data search and data retrieval on relations with a large number of records -Most relational DBMS software tools enable definition of indexes
What is metadata and why is it important? Provide an example of metadata
analytical -- because theres more data and varieties. also shows where it came from -Data about data. In a DW, metadata describe the contents of a DW and the manner of its acquisition and use
aggregated data
data representing summarization of multiple instances of data
What is the data-ink ratio in the context of dashboard design
more data than ink