FINAL REVIEW Chapter 1 - MONGO DB
Optional attribute:
Attribute does not require a value and can be left empty
Transitive Dependency
Attribute is dependent on a non-key attribute
MapReduce + Hadoop =
Big Data core technology
Business Rules
Brief, Precise description of a policy, procedure, or principle of a business.
Atomicity
Characteristic of an atomic attribute
Production database
Contains the data you are using for production tasks
Operational database:
Designed to support a company's day to day operations
MongoDB (Drop Command)
Drops a collection
Partial Dependency
Functional dependence in which the determinant is only part of the primary key
Business Intelligence
Information collected from multiple sources such as suppliers, customers, competitors, partners, and industries that analyzes patterns, trends, and relationships for strategic decision making
___________ allows information to be intelligently combined from two or more tables
Join
___________ consist of one or more attributes that determine other attributes
Keys
Granularity
Level of detail represented by the values stored in a table's row
Semi-Structured Data:
Lies between the lines of both data types Generated during a web transaction → Web applications JSON Data
Natural join
Links tables by selecting only the rows with common values in their common attributes
OLAP (online analytical processing)
Manipulation of information to create business intelligence in support of strategic decision making
Data Integrity Management:
Minimizes redundancy and maximizes consistency
Enterprise database
More than 50 people, supports many users across many departments
inner join
Most common type of join; includes rows in the query only when the joined field matches records in both tables.
Class
Represents an object or a set of objects that share a common structure and behavior.
INTERSECT
Returning only rows that appear in both sets query INTERSECT query
JOIN with a Lack of Common Attributes
Returns a products of the two tables
Alter Table
The SQL command used to make changes to table structure. When the command is followed by a keyword (ADD or MODIFY), it adds a column or changes column characteristics.
COMMIT statement
To commit changes made within a manual transaction to the database.
Online Analytical Processing (OLAP)
Tools for retrieving, processing, and modeling data from the data warehouse
MongoDB
an open source, document-oriented, non-relational DBMS
ROLLBACK statement
is used restore the database to its previous condition
Object-oriented database management system
A group of programs that manipulate an object-oriented database and provide a user interface and connections to other application programs.
Hadoop Distributed File System (HDFS)
A highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speeds.
JSON (JavaScript Object Notation)
A human-readable text format for data interchange that defines attributes and values in a document.
Composite Key
A primary key that consists of more than one attribute.
Database Management System (DBMS)
A program/service that creates, reads, updates, and deletes data in a database while controlling access and security
Subquery
A query (Select Statement) inside another query inner query is executed first
outer join
A relational algebra JOIN operation that produces a table in which all unmatched pairs are retained; unmatched values in the related table are left null.
Natural Join
A relational operation that yields a new table composed of only the rows with common values in their common attribute(s). -Eliminates duplicate columns
Crow's Foot notation
A representation of the entity relationship diagram that uses a three-pronged symbol to represent the "many" sides of the relationship.
Atomicity, Consistency, Isolation, Durability
A set of best practices for programmers to seek in all application or data base design
relational algebra
A set of mathematical principles that form the basis for manipulating relational table contents; the eight main functions are SELECT, PROJECT, JOIN, INTERSECT, UNION, DIFFERENCE, PRODUCT, and DIVIDE.
relational database management system
A software program in which data is organized as a collection of tables, and relationships between tables are formed through a common field.
Data warehouse:
Stores data in a format optimized for decision support
Data Dictionary:
Stores definition of data elements and their relationship Not the data itself, it describes the structure of the data.
use demo
Switch current database to demo
database instance
The actual data stored in a database at a particular moment in time
FROM clause
The clause that indicates the table from which to retrieve the specified columns
Client Node
The client node acts as the interface between the user application and the HDFS.
Transaction
is a sequence of database operations that access the database have four main properties: atomicity, consistency, isolation, and durability
1NF Characterisitcs
- All attributes are defined - There are no repeating groups in the table - All attributes are dependent on the primary key *There are some partial dependencies*
3NF characteristics
- Contains no transitive dependencies - Is in 2NF
2NF characteristics
-1NF has a composite primary key (dual) OR a single-attributed primary key - Has no partial Dependencies
Workgroup databases
50 people or less, support small number of users within a specific department
column family database
A NoSQL database model that organizes data into key-value pairs, in which the value component is composed of a set of columns that vary by row.
key-value database
A NoSQL database model that stores data as a collection of key-value pairs in which the value component is impossible to understand to the DBMS
Document Database
A NoSQL database model that stores data in key-value pairs in which the value component is composed of a tag-encoded document. documents encapsulate and convert data (or information) in some standard formats or encodings
concurrent backup
A backup that takes place while one or more users are working on a database.
Job Tracker
A central control program used to accept, distribute, monitor, and report on MapReduce processing jobs in a Hadoop environment.
MapReduce program
A composite program that consists of a Map procedure that performs filtering and sorting and a Reduce method that performs a summary operation.
BSON (Binary JSON)
A computer-readable format for data interchange that expands the JSON format to include additional data types including binary objects.
logical independence
A condition that exists when the internal model can be changed without affecting the conceptual model.
Centralized database:
A database located at a single site.
single-user database
A database that supports only one user at a time.
Conceptual Schema
A detailed, hardware independent specification of the overall structure of organizational data.
class diagram
A diagram consisting of classes (i.e., sets of objects) and relationship among the classes, usually in UML object notation
graph database
A graph database offers an alternative way to track relationships; its structure resembles sociograms with their interlinked nodes
Database schema
A graphic that documents the data model and shows the tables, attributes, keys, and logical relationships for a database.
Equijoin
A join operator that links tables based on an equality condition that compares specified columns of the tables.
Key Attribute
A key attribute is the unique characteristic of the entity. For ex. Name and hire date are attributes of the entity Employee.
prime attribute
A key attribute; that is, an attribute that is part of a key or is the whole key.
Unified Modeling Language (UML)
A language based on object-oriented concepts that provides tools such as diagrams and symbols to graphically model a system.
Structured Query Language (SQL)
A language used to create and manipulate databases.
data warehouse
A logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks
Distributed Database:
A logically related database that is stored over two or more physically independent sites.
Entity Relationship Diagram/Model
A methodology for documenting databases illustrating the relationship between various entities in the database.
Cascading "Order" Sequence
A nested ordering sequence for a set of rows, such as a list in which all last names are alphabetically ordered and, within the last names, all first names are ordered.
Recursive Query
A nested query that joins a table to itself. EX. Identifying managers of employees in EMP table
NoSQL
A nonrelational database that is better tuned for accessing large data sets. Advantages •High scalability, availability, and fault •Uses low-cost commodity hardware •Supports Big Data Disadvantages •Complex programming is required •There is no relationship support •There is no transaction integrity support
Existence independence
A strong entity that exists apart from all of its related entities
Database Security
A subset of information security that focuses on the assessment and protection of information stored in data repositories like database management systems and storage media.
Existence dependence
A weak entity that exists in the database only when it is associated with another related entity
performance tuning
Activities that make a database perform more efficiently in terms of storage and access speed.
WHERE clause
Adds conditional restrictions to the SELECT statement
Insert Statement
Adds one or more records to any single table in a relational database
Unary relationship (recursive relationship)
An association is maintained within a single entity type
primary key (PK)
An attribute or a combination of attributes that uniquely identifies an instance in a table --> cant have a null value *Derivied from candidate keys
Superkey
An attribute or attributes that uniquely identify each entity in a table. --> Sets the basis for identifying candidate keys
atomic attribute
An attribute that cannot be further subdivided to produce meaningful components. EX: Last Name
nonprime attribute
An attribute that is not part of a key.
Derived attribute:
An attribute whose values can be calculated from related attribute values. Age Attribute value is derived from Birthdate attribute
Candidate Key
An attribute(s) that is derived from the Superkey that bears no redundant attributes.
composite/associative entity
An entity with 1:M relationship with parent entities composed of primary key attributes of each parent entity
Neo4j
An organization that requires a graph database that is highly scalable would select the ________ database management system.
Single-valued attribute:
Attribute that has only a single value • e.g.) SSN, a part's serial number
Key attribute:
Attribute that is a part of a key
Multivalued attributes:
Attributes that have many values • e.g.) degrees, a car's color
incremental backup
Backup that copies only the changed data since the last backup. -To prevent redundancy since why would anyone want to back up the same data that's not changed recently.
EXCEPT (Minus)
Combines rows from two queries and returns only the rows that appear in the first set query EXCEPT query or query MINUS query
Heartbeat
Communication between nodes within HDFS to ensure a node is active
Data Definition Language (DDL)
DBMS language that builds the data dictionary, creates the database, describes logical views, and specifies record or field security constraints
Data Manipulation Language (DML)
DBMS language that changes database content, including data element creations, updates, insertions, and deletions
DML
Data Manipulation Language--Used to query and update data stored in the tables.
Isolation
Data used during transaction cannot be used by second transaction until the first is completed
NoSQL
Database management systems that handle non-relational data like documents, increasingly used for maintaining and querying the large amounts of data from the Internet. Queries can be made using "Not only Structured Query Language."
very large databases (VLDBs)
Database that contains huge amounts of data—gigabyte, terabyte, and petabyte ranges are not unusual.
ANSI
Defines the standard data types that are supported by RDBMS vendors
Transaction Managment
Definition based on transaction isolation levels. Transaction isolation levels refer to the degree to which transaction data is "protected or isolated" from other concurrent transactions
strong (identifying) relationship
Exists when an entity pulls a primary key from its parent entity and uses it in its primary key as a foreign key.
JOIN ON
Express a join when the tables have no common attribute names Query returns only the rows that meet the indicated join condition
ETL
Extract, Transform, and Load. Used to standardize data across systems, that allow it to be queried.
foreign key (FK)
Foreign keys are the primary keys of other tables placed in the current table to support the link between the two tables.
Structured Data
Formatted Unstructured Data Categorized pieces of information for one instance
database dump
If your database is large or takes too long to pack, dumping a database is the preferred backup method. This method creates a dump (.dmp) file containing only the database metadata, instead of producing a pack file that contains the file system data as well as the metadata.
Chen notation
In ____________________, relationships are represented by a diamond connected to the related entities through a relationship line.
logical design
Logical design is used to translate the conceptual design into the internal model for a selected database management system
schema-less databases,
NoSQL tools create schema-less databases, allowing data structures such as fields to be added
Unstructured Data:
Not well-defined or under a specific format Like emojis, letters, numbers --> Social media
Optional participation
One entity occurrence does not require a corresponding entity occurrence in a particular relationship --> No instances of entity are required
Mandatory participation
One entity occurrence requires a corresponding entity occurrence in a particular relationship --> Atleast 1 instance of an entity
Project
Only views specific columns in a table
Select
Only views specific rows in a table
Physical model
Operates at lowest level of abstraction •Describes the way data are saved on storage media such as magnetic, solid state, or optical media
transactional database
Optimized for running production systems — everything from websites to banks to retail stores. They excel at reading and writing individual rows of data very quickly while maintaining data integrity.
Index
Orderly arrangement to logically access rows in a table
Consistency
Permanence of database's consistent state
show databases
Print a list of all available databases (that could not be accessed on the server as well)
show collections;
Print a list of all collections for current database.
show dbs
Print a list of all databases on the server.
show roles
Print a list of all roles, both user-defined and built-in, for the current database.
show users
Print a list of users for current database.
Data Anomalies
Refers to data inconsistencies caused by redundancy
INNER JOINS
Returns only rows from the tables that match on a common value
OUTER JOIN
Returns the same matched rows as inner join as well as UNMATCHED rows
Rules of Precendence
Rules that establish the order in which computations are completed
JOIN USING
SELECT * FROM T1 JOIN T2 USING (Common Column) returns only the rows with matching values in the columns indicated in the USING clause
attribute domain
Set of allowable values for one or more attributes within a table
Arithmetic Operators
Symbols to compute mathematical processes
schema-less
Systems that can absorb any type of data, structured or not, from any type of source.
UPDATE Statement
The UPDATE statement is used to update existing records in a table. UPDATE table_name SET column1=value1,column2=value2,... WHERE some_column=some_value; Notice the WHERE clause in the SQL UPDATE statement! The WHERE clause specifies which record or records that should be updated. If you omit the WHERE clause, all records will be updated!
Data Node
The data node stores fixed-size data blocks (that could be replicated to other data nodes).
Extensible Markup Language (XML)
The markup language designed to transport and store data on the Web.
Name Node
The name node stores all the metadata about the file system.
Normalization
The process of reducing redundancy in database tables to prevent errors. However, it reduces database performance since it creates lots of tables --> which requires more joins -Refines primary keys
Denormalization
The process of splitting or combining normalized relations into physical tables based on affinity of use of rows and fields.
Disaster Management
The set of DBA activities dedicated to securing data availability following a physical disaster or a database integrity failure.
Atomicity
The transaction property that requires all parts of a transaction to be treated as a single, indivisible, logical unit of work. All parts of a transaction must be completed or the entire transaction is aborted.
batch processing
Transaction processing system (TPS) that processes data in batches at fixed periodic intervals.
constraint
Used to limit the type of data that can go into a table to ensure that the table is accurate
UNION all
Used to produce a relation that retains the duplicate rows
CREATE VIEW command
Views are built using the CREATE VIEW command. Within this command, you specify the base table(s) on which the view is to be based and the attributes and rows of the table(s) that are to be included in the view.
5 Vs of Big Data
Volume: Data comes from different sources Variety: Data comes in different formats Velocity: Data is transferred at optimal speeds Veracity: Data should be accurate Value: Data contains lots of valuable information
Product
Yields all possible pairs of rows from two tables
Difference
Yields all rows in one table that are not found in the other table
Big Data
a broad term for datasets so large or complex that traditional data processing applications are inadequate. Volume, Velocity, Variety, Veracity (accurate), Value
Database
a collection of organized data that allows access, retrieval, and use of data
stored procedure
a computer program stored in the database that is used to enforce business rules
data mart
a data collection, smaller than the data warehouse, that addresses the needs of a particular department or functional area of the business
aggregation function
a function where the values of multiple rows are grouped together to form a single summary value. SUM, MIN, AVG etc
conceptual model
a verbal or graphical explanation for how a system works or is organized
Alias
are used to give a table, or a column in a table, a temporary name
Required attribute:
attribute must have a value and cannot be left empty
Composite attribute:
attribute that can be subdivided to yield additional attributes -->EX: Body Paint --> subframe paint, rear paint
Audit log:
automatically records description of database operations performed by all users
Physical independence
changes in physical model do not affect internal model
Entity Integrity
condition in which each row in the table has its own unique identity Primary key must have a unique value in each row
Metadata
data that describes data
Create Collection / Rename Collection
db.createCollection("newproducts"); db.newproducts.renameCollection("products"); self explanatory
Update and Remove (Mongo DB)
db.products.update({name: "regular desk chair"},{name: "basic desk chair", price: 100, brand: "RoughRider", type: "chair"}); db.products.remove({name: "basic desk chair"}); Update changes a record in a collection whereas remove drops a record in the colleciton
Data Visualization
describes technologies that allow users to see or visualize data to transform information into a business perspective
bucket
equivalent of tables
dirty data
erroneous or flawed data
full backup
exact copy of an entire database
Cardinality
expresses the minimum and maximum number of entity occurrences associated with one occurrence of related entity
Predictive Analytics
extracts information from data and uses it to predict future trends and identify behavioral patterns
Lock:
guarantees exclusive use of a data item to a current transaction
relationship degree
indicates the number of entities or participants associated with a relationship --> unary etc.
Secondary key
key used strictly for data retrieval purposes
business intelligence
leverages software and services to transform data into insights that influences an organization's decisions.
Theta join
links tables using an inequality comparison operator
data model
logical data structures that detail the relationships among data elements using graphics or pictures --> entities and attributes
Inheritance
object inherits methods and attributes of classes above it
high availability
occurs when a system is continuously operational at all times
Authorization management:
procedures to protect database security and integrity: -User access management -View definition -DBMS access control -DBMS usage monitoring
Task tracker:
reduces tasks on a node
hierarchical model
resembles an upside-down tree in which each class has only one parent
find()
retrieves objects from a collection that match the restrictions provided
Mapper Method
takes a collection of data and sorts and filters it into a set of key-value pairs
Data Administrator (DA)
the person in charge of the structure and integrity of the data; manages important aspects of data definition and database design including: data standards, data use, data quality Has higher authority than dataBASE admin.
Database Administrator (DBA)
the person responsible for coordinating, controlling, and managing the database
data mining
the process of analyzing data to extract information not offered by the raw data alone
Database Design
the process of converting a data model into tables, relationships, and data constraints Well-designed database: facilitates data management and generates accurate and valuable information Poorly designed database: causes difficult-to-trace eros that may lead to poor decision making
Ternary relationship
three entities are associated
HAVING clause
to filter the results grouped by a GROUP BY clause. Can only reference aggregate functions or column that are also specified in the GROUP BY clause.
•Binary relationship
two entities are associated
CREATE INDEX statement
used to define a clustered or non clustered index.
DISTINCT
used to return only distinct (different) values a column often contains many duplicate values; and sometimes you only want to list the different (distinct) values.
Wildcard Character
used to search for words that the user is not sure how to spell or for word variations
EXISTS
used to test for the existence of any record in a subquery. The operator returns TRUE if the subquery returns one or more records.
relational model
uses a two-dimensional table of rows and columns of data. Rows are records (also called tuples), and columns are fields (also referred to as attributes).
functional dependence
value of one or more attributes determines the value of one or more other attributes
ROLLUP Extension
•Used with GROUP BY clause to generate aggregates by different dimensions •Enables subtotal for each column listed except for the last one, which gets a grand total
Simple attribute:
Attribute that cannot be subdivided • e.g.) Age, gender, marital status
Collections (MongoDB)
Can be thought of as the relation (table) in DBMS, but with many differences.
Online Transaction Processing (OLTP)
Capturing of transaction and event information using technology to process it in conformity to business rules, storing it, and update current information.
ORDER BY
Clause used to sort the output of a SELECT statement
Union
Combines all rows from two tables, excluding duplicate rows
UNION
Combines rows from two or more queries without including duplicate rows query UNION query
Concurrency Control
Coordination of the simultaneous transactions executed in a multiuser database system Objective: ensures serializability of transactions in a multiuser database environment
Durability
Ensures that once transactions are committed they cannot be undone or lost
Referential Integrity
Every reference to an entity instance by another entity instance is valid --> referred entity must exist
Weak (non-identifying) relationship
Exists when an entity does not bear any primary keys from its parent entity.
GROUP BY
Groups rows that have the same values into summary rows, like "find the number of customers in each country" often used with aggregate functions to group the result-set by one
New Technologies of Big Data
Hadoop Hadoop Distributed File System MapReduce, NoSQL
Embedded SQL
Hard-coded SQL statements included in a program written in another language, such as C or Java.
Intersect
Identifies what rows are identical in both tables.
Two Main characteristics of Data
Ubiquitous (Global, everywhere) Pervasive (Unescapable, persistent)
CUBE extension
Used with GROUP BY clause to generate aggregates by the listed columns Enables you to get a subtotal for each column listed in the expression, in addition to a grand total for the last column listed
Divide
Uses one double-column table as the dividend and one single-column table as the divisor Output is a single column that contains all values from the second column of the dividend that are associated with every row in the divisor
Date Aritmetic
Values are stored as a number of days;