OIM 350 exam 3

Ace your homework & exams now with Quizwiz!

DBMS components

**data definition component -used to create the components of the database *e.g. database tables, referential integrity constraints connecting the created tables. -used DDL (Data definition language) SQL commands

Enablers of Big Data Analytics

In-memory analytics: -Storing and processing the complete data set in RAM In-database analytics: -Placing analytic procedures close to where data is stored Grid computing & MPP: -Use of many machines and processors in parallel (MPP - massively parallel processing) Appliances: -Combining hardware, software, and storage in a single unit for performance and scalability

ETL

**ETL Infrastructure -typically includes use of specialized ETL software tools and/or writing code -due to the amount of detail involved, ETL infrastructure is often the most time and resource consuming part of DW development process -Although labor intensive, creation of ETL infrastructure is redetermined by results of requirements collection and DW modeling processes (specifies sources and target)

Appendix I - NOSQL DATABASES

**NOSQL database example: MongoDB -A document-oriented NoSQL database -Organized around collections of documents *Collection - equivalent to a table *Documents - equivalent to a table row

OLAP

**OLAP/BI tools - requires dimensional organization of underlying data for performing basic OLAP operations (Slice, pivot, drill) -additional OLAP/BI tool functionalities: *graphically visualize the answers *create and examine calculated data *determine comparative or relative differences *perform exception analysis, tend analysis, forecasting, and regression analysis *number of other analytical functions -many OLAP/BI tools are web-based

OLAP

**OLAP/BI tools -basic OLAP/BI tool features: *slice and dice *pivot (rotate) *drill down/drill up

OLAP

**OLAP/BI tools -allow users to query fact and dimension tables by using simple point-and-click query-building applications (SAP BO Analysis, SAP BeXQuery Designer, Excel, Tableau, etc.) -Based on point-and-click actions of user, the OLAP/BI tool writes and executes the code in the language of the DBMS (e.g. SQL) that hosts the data warehouse or data mart that is being queried

OLAP

**OLAP/BI tools - two purposes: -Ad-hoc direct analysis of dimensionally modeled data -creation of front-end (BI) applications: provide access for indirect use

OLAP

**OLAP/BI tools: an interface to data warehouses modeled using different approaches -designed for analysis of dimensionally modeled data -regardless of which DW approach is chosen, data accessible by the user is typically structured as a dimensional model *OLAP/BI tools can be used on analytical data stores created with different modeling approaches

database administration

**access privileges -authorization matrix - implements access privileges *provided by the DBMS *Managed by the DBA

DBMS components

**data administration component -used for technical, administrative, and maintenance tasks of database systems -DCL (data control language) and TCL (transaction control language) SQL commands are used during these tasks **Application development component -used to develop front-end applications

DBMS components

**data manipulation component -used to insert, read, update, and delete information in a database -uses DML (Data manipulation language) SQL commands -single-user systems ~data manipulation component used by one user at a time -multiuser systems ~data manipulation component used by multiple users at the same time

data warehouse deployment

**data warehouse deployment -release populated data warehouse and front-end (BI) applications for use by end-users ~Alpha release *internal deployment of a system to members of development team for initial testing of functionality ~Beta release *deployment of system to a selected group of users to test system usability ~Production release *actual deployment of functioning system

database administration

**database administration -activities needed for proper functioning of database system: *monitor and maintain database system *secure database against unauthorized access *provide database backup and revoery *ensure database integrity *optimize database performance *develop and implement database policies and standards

database administration

**developing and implementing database policies and standards -policies and standards for database development *e.g. naming conventions -policies and standards for database use *e.g. business rules -policies and standards for database management and administration *e.g. policy for assigning administration tasks -common purpose for database policies and standards is to reflect and support business processes and business logic

ETL

**extraction - Retrieval of analytically useful data from operational data sources to be loaded into DW -examination of available sources -available sources and requirements determine DW model -DW model provides a blueprint for ETL infrastructure and extraction procedures

ETL

**load: load extracted, transformed, and quality-assured data into target DW -automatic, batch process inserts data into DW tables, without user involvement -initial load (first load), populates empty DW tables *can involve large amounts of data, depending on desired time horizon of the DW -every subsequent load is referred to as a refresh load -refresh cycle - time period for loading new data (e.g. hourly, daily). Determined in advance: *based on needs of DW users and technical feasibility *In active DW, loads occur in continuous micro batches

database administration

**monitor and maintain database system -sample activities *recognize when maintenance activities are needed *observe usage of tables *manage and upgrade database software and hardware resources -data dictionary: repository of the metadata -catalog: data dictionary created by the DBMS *sample catalog

database administration

**optimizing database performance -seeks to minimize the response time for database queries -involves actions such as: *indexing *denormalization *view materialization *query optimization

database administration

**providing database backup and recovery -DBMS actions in the event of a failure *rolling back to the checkpoint state *redoing the updates in the recovery log since the last checkpoint -TCL command COMMIT *causes all the updates to be recorded on the disk -TCL command ROLLBACK *rolls back all the updates since the last COMMIT

database administration

**providing database backup and recovery -backup - saving additional physical copies of the data -recovery - recovering the content of the database after a failure -recovery log *logs database updates *ensures against loss of updates -checkpoint *part of a recovery log *indicates a point when updates are written on the disk

database administration

**providing database backup and recovery -complete mirrored backup *ensures against complete database destruction **ensuring database integrity -preventing insertion, modification, or deletion actions that result in invalid, corrupt, or low-quality data in the database -database integrity can be compromised through events such as: *unauthorized malicious data updates *update failure *accidental misuse

database administration

**securing the database against unauthorized access -preventing unauthorized access to data -using methods such as: *authentication *access privileges *encryption

OLAP

**slice and dice -adds, replaces, or elimates *specified dimension attributes, or *specific values of dimension attributes -slice= filter -dice= 2 or more slices

ETL

**transformation - transforming the structure of extracted data in order to fit the structure of the target data warehouse model -e.g. adding surrogate keys -data quality control and improvement are included in the transformation process *data from data sources frequently exhibit data quality problems *data sources often contain overlapping information -data cleansing (scrubbing) - the detection and correction of low-quality, redundant data

DBMS Components

*DBMS software is used for: 1. creation of databases 2. manipulation of data in databases (i.e. insertion, storage, retrieval, update and deletion) 3. maintenance of databses 4. creating front-end applications (in some DBMS packages)

Big Data Technologies Hadoop - Demystifying Facts

*Hadoop consists of multiple products *Hadoop is open source but available from vendors too *Hadoop is an ecosystem, not a single product *HDFS is a file system, not a DBMS *Hive resembles SQL but is not standard SQL *Hadoop and MapReduce are related but not the same *MapReduce provides control for analytics, not analytics *Hadoop is about data diversity, not just data volume *Hadoop complements a DW; it's rarely a replacement

Big Data Technologies - MapReduce

*MapReduce distributes the processing of very large multi-structured data files across a large cluster of ordinary machines/processors *goal - achieving high performance with "simple" computers *developed and popularized by Google *good at processing and analyzing large volumes of multi-structured data in a timely manner *example tasks: indexing the Web for search, graph analysis, text analysis, machine learning, ...

Appendix I NOSQL DATABASES

*NOSQL databases -Database that is not based on the relational model -Does not use SQL as a query language -Data does not have to be stored in structures described by an organized database schema -Used with Big Data

OLAP

*Online transaction processing OLTP - updating (i.e. inserting, modifying and deleting), querying and presenting data from databases for operational purposes *online analytical processing (OLAP) - querying and presenting data from data warehouses and/or data marts for analytical purposes

Big Data and Data Warehousing

*What is the impact of Big Data on DW? -Big Data and RDBMS do not go nicely together -Will Hadoop replace data warehousing/RDBMS? *Use Cases for Hadoop -Hadoop as the repository and refinery -Hadoop as the active archive *Use Cases for Data Warehousing -Data warehouse performance -Integrating data that provides business value -Interactive BI tools

database administration

*access privileges -DCL commands GRANT and REVOKE *encryption -encryption key - information scrambling algorithm -decryption key - reverts the information to its original state

database administration

*authentication -login procedure using user ID and password *access privileges -assigned to the database user account -determine user's privileges on database columns, relations and views -include the following actions: *SELECT, UPDATE, ALTER, DELETE, INSERT

fundamentals of big data analytics

*big data by itself, regardless of the size, type, or speed, is worthless *big data+ "big" analytics= value *with the value proposition, big data also brought about big challenges -effectively and efficiently capturing, storing, and analyzing Big Data -new breed of technologies needed (developed or purchased or hired or outsourced...)

big data - definition and concepts

*big data is more than just "big" *the VS that define big data (HBR, 2012) -volume: data quantity. data volume is increasing exponentially -variety: data types. structures data and unstructured data of all varieties. -velocity: data speed. data are generated fast and need to be processed fast to maximize the data's business value -*veracity: data messiness. the data quality is uncertain (*not in HBR article)

big data - definition and concepts

*big data means different things to people with different backgrounds and interests *traditionally, "big data" = massive volumes of data *where does the big data come from? -everywhere! web logs, RFID, GPS systems, sensor networks

data scientist

*data scientist= big data guru -one with skills to investigate big data *very high salaries, very high expectations *where do data scientists come from? - M.S./Ph.D. in MIS, CS, IE,... and/or Analytics - PE, PML, ... DSP (Data Science Professional)

OLAP

*drill down -makes the granularity of the data in the query result finer *drill up -makes the granularity of the data in the query result coarser *drilling up and drilling down does not filter your overall data. the overall total does not change

OLAP

*drill hierarchy -set of attributes within a dimension where an attribute is related to one or more attributes at a lower level but only related to one item at a higher level *for example: StoreRegionName-> StoreZip-> StoreID -Used for drill down/drill up operations

Data warehouse/data mart front-end (BI) applications

*executive dashboard -intended for use by higher level decision maker within an organization -contains organized easy-to-read display of critically important queries describing organizational performance -in general, the usage of executive dashboards should require little or no effort or training -executive dashboards can be web-based

big data technologies - hadoop

*hadoop is an open source framework for storing and analyzing massive amounts of distributed, unstructured data *originally created by Doug Cutting at Yahoo! *Hadoop clusters run on inexpensive commodity hardware so projects can scale-out inexpensively *Hadoop is part of Apache Software Foundation *Hadoop + MapReduce = Big Data core technology

OLAP

*pivot (rotate) -reorganizes the values displayed in the original query result by moving values of a dimension column from one axis to another

Creating a data warehouse

-Use database management software (DBMS) to implement DW model as a collection of physically created and connected tables -DW are most often modeled as relational databases, and thus are *implemented using a relational DBMS

big data

-massive volumes of diverse and rapidly growing data that are not formally modeled -mostly unstructured, semi-structured, lightly structured -heterogeneous

when do you need Big Data Technology?

1. you can't process the amount and variety of data that you want to because of the limitations of your current platform 2. you need to (or want to) integrate data as quickly as possible to be current on your analysis. 3. the data is arriving so fast at your organization's doorstep that your traditional analytics platform cannot handle it.

Big Data Technologies

Hadoop (open source framework) -MapReduce (a model for distributed processing) -Hive (Hadoop-based data warehouse) -NoSQL (not only SQL database, a type of database) -HBASE (a specific non-relational database) -PIG (Hadoop-based query language)


Related study sets

Chapter 9 Production and Operation Management

View Set

opl 3 (the last quizlet i will ever make)

View Set

IM5 Assessment and Intervention Practice Questions

View Set

Final topics - Concept of Family

View Set

fundamentals potential questions

View Set