Database Exam 2
journalizing facilities of a database
Ability to store Transaction and Database Logs
statistical profiling
Analyze data patterns (distribution, outliers, frequencies) - specialized data profiling tools: IBM infosphere, SAP business objects data services, informatica data profiling solution
cold backup
Entire database is shut down during backup- usually full
Hot backup
Entire system is not brought donw. Parts of the system are brought down and backed up - selected portion is shut down and backed up at a given time
"Heap" type of storage sorts the data before it saves it to the database.
False
The distributed file system in which Hadoop saves its data is called MapReduce
False - it's called "HDFS": Hadoop Distributed File System - MapReduce is an algorithm for massive parallel processing of various types of computing tasks
Key-value store NOSql schema support - put, get, update and delete operations on the database
False, no update
Database security
Protection of the data against accidental or intentional loss, destruction, or misuse (breech in security)
%
SQL wildcard - LIKE allows you to compare strings using wildcards
physical access controls
Swipe cards, equipment locking, check-out procedures, screen placement, laptop protection
checkpoint faciity
Synchronization between database and transaction logs
One of the biggest design consideration for Big Data storage design is "schema on read" instead of "schema on write".
True -While Structured Database is designed based on write, for big data, the schema is designed not based on how the data should be stored, but based on how the data needs to be read.
The traffic cop in Hadoop that allocates resources across the system is called ____
YARN
restricted update
a customer ID can only be deleted if it is not found in order table
Journalizing facilities
audit train of transactions and database updates - transaction log: record of essential data for each transaction processed against the database - database change log: images of updated data-- before and after modification
Map
divide a request into part for local processing
Reduce
integrate the results of the individual map processes to produce a final result
JSON stands for
jave script object notation - one of the most popular Document Store Database - schema on read
3 areas of SOX audits
- IT change management: processes by which changes to operational systems and databases are authorized - logical access to data: security procedures to prevent unauthorized access; personnel controls and physical access controls - IT operations: policies and procedures for day to day management of infrastructure, apps, and databases
SQL Environment
- catalog: set of schemas that constitute the description of a database - schema: structure that contains descriptions of objects created by a user (base tables, views, constraints) - data definition language (DDL): commands that define a database, including creating, dropping tables, and establishing constraints - data manipulation language (DML): commands that maintain and query a database - data control language (DCL): commands that control a database, including administering privileges and committing data
authentication schemes
- single factor: something you know (password) - 2 factor: something you know + something you have (smart card +pin) - 3 factor: something you know + something you have + something you are (e.g., smart card +biometric+ PIN)
authorization matrix- incorporated in the data management system
- subjects: users or groups of users (who needs access) - objects: entities in the database (access to what) - actions: DML commands permitted for a given subject on a given object - CRUD(Create, Read, Update, Delete--most sensitive) constraints: further constraints on actions
Data Lake
-A large integrated repository for internal and external data that does not follow a predefined schema - Capture everything, dive in anywhere, flexible access
Transaction ACID Properties
-Atomic: Transaction cannot be subdivided -Consistent: Constraints don't change from before transaction to after transaction -Isolated: Database changes not revealed to users until after transaction has completed -Durable: Database changes are permanent
Improving data capture processes
-Automate data entry as much as possible -Manual data entry should be selected from preset options(drop down list, checkbox) -Use trained operators when possible -Follow good user interface design principles -Immediate data validation for entered data- referential integrity check
non-unique (secondary) often
-Done for fields that are often used to group individual entities (e.g., zip code, product category, phone, last name)
Internet security
-Firewall - IP whitelisting -IDS - Intrusion Detention Systems -Encryption-decryption
purposes of data quality:
-Minimize IT project risk -Make timely business decisions -Ensure regulatory compliance (SOX) -Expand customer base - competitive advantage
select statement clauses:
-SELECT:List the columns (and expressions) to be returned from the query -FROM: Indicate the table(s) or view(s) from which data will be obtained -WHERE: Indicate the conditions under which a row will be included in the result -GROUP BY: Indicate categorization of results -HAVING: Indicate the conditions under which a category (group) will be included - ORDER BY: Sorts the result according to specified criteria
Characteristics of Big Data
-Schema on Write- preexisting data model, how traditional databases are designed (relational databases) -Schema on Read - data model determined later, depends on how you want to use it (XML, JSON) - capture and store data, worry about how you want to use it later
unique (primary) index
-Typically done for primary keys, but could also apply to other unique fields
Causes of poor data quality
-external data sources -redundant data storage and inconsistent metadata -data entry -lack of organizational commitment
8 parameters of data quality
1. uniqueness 2. accuracy 3. consistency 4. completeness 5. timeliness 6. currency 7. conformance 8. referential integrity (see chapter 12 slides 5 and 6 for definitions)
4 basic facilities for backup and recovery:
1.Backup 2.Journalizing- log of transactions, not updating it 3.Checkpoint- automatic recovery points, if database fails can recover back to last checkpoint 4.Recovery Manager
steps in data quality improvement
1.Establish Data Governance and data stewardship program 2.Perform periodic data quality audit 3.Improve data capture processes 4.Apply modern data management principles and technology
types of file organizations
1.Heap - stored in no particular order as data comes in; not ideal for quick access, fast write, not optimized for read 2.Sequential - stored sorted sequentially by Primary Key; binary sort--data inserted at a position to keep data sorted (write); splits list by half until record is found (read) 3.Indexed - stored sorted or unsorted by Primary or secondary index 4.Hashed - address of each record determined by hashing algorithm
when to use idexes
1.Use on larger tables 2.Index the primary key of each table 3.Index search fields (fields frequently in WHERE clause) 4.Fields in SQL ORDER BY and GROUP BY commands 5.When there are >100 values but not when there are <30 values 6.Avoid use of indexes for fields with long values; perhaps compress values first 7.If key to index is used to determine location of record, use surrogate (like sequence number) to allow even spread in storage area 8.DBMS may have limit on number of indexes per table and number of bytes per indexed field(s) 9.Be careful of indexing attributes with null values; many DBMS s will not recognize null values in an index search 10.Use a query optimizer
authentication
4.ensuring that a user is who he or she claims to be.
authorization
5.rules to specify who has what access rights to what data elements.
Data Administration- data stewards for each function
A high-level function that is responsible for the overall management of data resources in an organization, including maintaining corporate-wide definitions and standards
data steward
A person responsible for ensuring that organizational applications properly support the organization's data quality goals - •Data Steward Co-ordinating committee - co-ordinate and standardize between different business units
Database Administration
A technical function that is responsible for physical database design and for dealing with technical issues such as security enforcement, database performance, and backup and recovery-- managing day to day operations
Pig
A tool that integrates a scripting language and an execution environment intended to simplify the use of MapReduce - useful development tool
Back-up facilities: stored in secure, off-site location
DBMS copy utility that produces backup copy of the entire database or subset-- periodic backup (nightly, weekly) - cold backup: database is shut down during backup - usually full - hot backup: selected portion is shut down and backed up at a given time
Find the total sales for the Delaware division of my company for the last quarter.
Descriptive analytics - describes the past status of the domain of interest using a variety of tools through techniques such as reporting, data visualization, dashboards, and scorecards
Hive
Integrates HiveQL, an SQL-like language into the Hadoop Ecosystem that provides a declarative interface for managing data stored in Hadoop. - useful for ETL tasks
Hadoop Distributed File System (HDFS)- most talked about Big-Data data management product
a file system designed for managing a large number of potentially very large files in a highly distributed environment - map reduce: •is an algorithm for massive parallel processing of various types of computing tasks - Hadoop is an open source implementation framework of MapReduce
cascaded update
changing a customer ID in the customer table will result in that value changing in the order table to match
Solution to a lost-update problem
concurrency control: process of managing simultaneous operations against a database so that data integrity is maintained, and the operations do not interfere with each other in a multi-user environment
Two-factor authentication adds "something you _____" in addition to one factor authentication which uses "something you know"
have
vector aggregate
multiple values returned from SQL query with aggregate function (via GROUP BY)
NoSQL
not only SQL - •a category of recently introduced data storage and retrieval technologies not based on the relational model
What would be the likely total sales for the Delaware division of my company for the next quarter?
predictive analytics - describes the past status of the domain of interest using a variety of tools through techniques such as reporting, data visualization, dashboards, and scorecards
What would my company need to do to increase the total sales for the Delaware division of my company for the next quarter by 15%
prescriptive analytics - uses results of predictive analytics along with optimization and simulation tools to recommend actions that will lead to a desired outcome
documentation
set the benchmark by documenting data definition, controls and business rules - relationships, constraints, required and possible values for fields
Scalar aggregate
single value returned from SQL query with aggregate function
set default update
when a customer ID is changed, any customer ID in the order table that matches the old customer ID is set to a predefined default value
set null update
when a customer ID is changed, any customer ID in the order table that matches the old customer ID is set to null
key value stores
§A simple pair of a key and an associated collection of values. §Key is usually a string - serves similar purpose to primary key §Database has no knowledge of the structure or meaning of the values. §3 basic operations - "put", "get" and "delete" - no update (don't know structure of data)
Hbase
§A wide-column store database that runs on top of HDFS §Modelled after Google's BigTable §Not as popular as Cassandra
Big data
§Data that exist in very large volumes and many different varieties (data types) and that need to be processed at a very high velocity (speed).
data dictionary
§Documents data elements of a database
Factors for selecting file organization
§Fast data retrieval and throughput §Efficient storage space utilization §Protection from failure and data loss §Minimizing need for reorganization §Accommodating growth §Security from unauthorized use
personnel controls
§Hiring practices, employee monitoring, security training, SOD - Segregation of Duties
graph oriented database (AI, NN)
§Maintain information regarding the relationships between data items. §Nodes with properties, §Connections between nodes (relationships) can also have properties.
wide column stores- amazon web services
§Semi-structured - Rows and columns §Each row has a different structure and length §Distribution of data based on both key values (records) and columns, using "column groups/families" §Column groups/families are columns that are best stored together
information repository
§Stores metadata describing data and data processing resources
system catalog
§System-created database that describes all database objects
analytics
§Systematic analysis and interpretation of data—typically using mathematical, statistical, and computational tools—to improve our understanding of a real-world domain.
analytics encompasses more than BI:
§Umbrella term that includes BI-- MIS->DSS->EIS(most long term decisions) §Transform data to useful form §Infrastructure for analysis §Data cleanup processes §User interfaces
The Five Vs of Big Data
§Volume - much larger quantity of data than typical for relational databases § §Variety - lots of different data types and formats § §Velocity - data comes at very fast rate (e.g. mobile sensors, web click stream) § §Veracity - accuracy - traditional data quality methods don't apply; how to judge the data's accuracy and relevance? § §Value - big data is valuable to the bottom line, and for fostering good organizational actions and decisions
5 goals of data governance
ØCentralized maintenance ØTransparency across the organization ØMeasure and improve data quality and data availability ØMitigate business risks from bad data ØData security
cloud-based data management services
•Cloud computing- Provisioning/acquiring computing services on demand using centralized resources accessed through public Internet or private networks
recovery manager
•DBMS module that restores the database to a correct condition when a failure occurs and then resumes processing user requests . Recovery and restart procedures: §Disk Mirroring(failover) - switch between identical copies of databases - needs two copies maintained §Backward Recovery (Rollback) - apply before images: recovery manager restores database to stage before step 1 was updated; database has no changes to transaction, but is consistent and transaction is repeated manually §Restore/Rerun - restore and reprocess transactions against the backup manually (only done as a last resort) §Forward Recovery (Roll Forward) - apply after images (preferable to restore/rerun)
Checkpoint facilities
•DBMS periodically refuses to accept new transactions- system is in a quiet state •Database and transaction logs are synchronized •A DBMS may perform checkpoints automatically (which is preferred) or in response to commands in user application programs - This allows recovery manager to resume processing from short period, instead of repeating entire day
MapReduce Algorithm
•Enables parallelization of data storage and computational problem solving in an environment consisting of a large number of commodity servers • •Programmers don't have to be experts at parallel processing (YARN takes care of this) • •Core idea - divide a computing task so that a multiple nodes can work on it at the same time • •Each node works on local data doing local processing. - 2 stages: - map: divide for local processing - reduce: integrate the results of the individual map processes to produce a final result
downside of cloud-based data management services
•Existing systems do not yet provide capacity using a model that would automatically adapt to the changing requirements targeting the system • •Current systems are not yet providing full consistency guarantees in a highly distributed environment • •Live migration is still a challenging task that requires manual planning, initiation, and management • •It is challenging to be able to monitor the extent to which cloud providers are maintaining their Service Level Agreement (S L A) commitments • •D B a a S solutions are still struggling to find fully scalable models for providing A C I D support for transactions
document stores (most common)
•Like a key-value store, but "document" goes further than "value". •Document is structured so specific elements can be manipulated separately (this is the problem with key value) •Uses JSON(JavaScript Object Notation) •Document is accessed using the "Key" •Individual values in the document can be accessed using JSON notations and updated
benefits of cloud based management services
•No need for initial investments in hardware, physical facilities, and systems software • •Significantly lower need for internal expertise in the management of the database infrastructure • •Better visibility of overall costs of data management • •Increased level of flexibility (elasticity) in situations when capacity needs to fluctuate significantly - scale up or down • •Allows organizations to explore new data management technologies more easily • •Mature cloud service providers have expertise to provide a high level of availability, reliability, and security
Hadoop Distributed File System (HDFS)
•Not a DBMS, not relational - but a file system that can handle very large files • •Breaks data into blocks and distributes them on various computers (nodes) throughout a Hadoop cluster • •Each cluster consists of a master server - NameNode and some slaves - DataNodes • •The NameNode manages the file system namespace and the access of data from the DataNodes. • •Assumes hardware failure - by default HDFS maintains 3 copies of the data - avoids single point of failure • •YARN ("yet another resource negotiator") - traffic cop - resource allocator across the distributed system (similar to DBMS) • •No updates to existing data in files, just appending to files • •"Move computation to the data", not vice versa
SOX
•Requires companies to audit the access to sensitive data •Designed to ensure integrity of public companies' financial statements
NoSQL characteristics
•Scaling out rather than scaling up - commodity servers •Single point failure will not lead to failure of entire system - natural for a cloud environment •Supports schema on read •Largely developed from open source •Not ACID compliant! (Availability, Consistency, Isolation, Durability) - stress is on other criteria - BASE •BASE - Basically Available, Soft state, Eventually consistent
business intelligence
•a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information.
index
•a table or other data structure used to determine in a file the location of records that satisfy some condition - PK's are automatically indexed