DSBA 6160 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

A columnar database

(NOSQL)is a database management system (DBMS) that stores data in columns instead of rows. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query

OLAP

(Online Analytical Processing) is the technology behind many Business Intelligence (BI) applications. OLAP is a powerful technology for data discovery, including capabilities for limitless report viewing, complex analytical calculations, and predictive "what if" scenario (budget, forecast) planning.

Peter Chen

*ER Modeling for database, 1970* Development of entity-relationship modeling

Data Dictionary/System Catelogue

- DBMS stores definitions of the data elements and their relationships (metadata) in a data dictionary. - So, all programs that access the data in the database work through the DBMS. - The DBMS uses the data dictionary to look up the required data component structures and relationships which relieves you from coding such complex relationships in each program. - Additionally, any changes made in a database structure are automatically recorded in the data dictionary, thereby freeing you from having to modify all of the programs that access the changed structure. - In other words, the DBMS system provides data abstraction, and it removes structural and data dependence from the system.

Data marts

-Subset of data warehouse -Summarized or highly focused portion of firm's data for use by specific population of users -Typically focuses on single subject or line of business - segments of the data organized into subsets that focus on specific subjects; e.g. may contain specialized information about a single department

Clustering

. Common mining algorithms You want an algorithm to find the rules of classification and the number of classes. The main difference from classification tasks is that you don't actually know what the groups and the principles of their division are. For instance, this usually happens when you need to segment your customers and tailor a specific approach to each segment depending on its qualities. Dividing data set into groups Also Called unsupervised learning technique Members of each group as similar to eachother as possible Different groups are as far/different from eachother as possible

Distributed database system

A database system in which the data used by the system is located on multiple computers that are connected via a network. data is physically stored across several sites, and each site is typically managed by a DBMS capable of running independent of the other sites. In contrast to parallel databases, the distribution of data is governed by factors such as local ownership and increased availability.

Dimmension Table

A dimension table contains dimensions of a fact. They are joined to fact table via a foreign key. Dimension tables are de-normalized tables. The Dimension Attributes are the various columns in a dimension table Dimensions offers descriptive characteristics of the facts with the help of their attributes No set limit set for given for number of dimensions The dimension can also contain one or more hierarchical relationships

Fact Table

A fact table is a primary table in a dimensional model. A Fact Table contains Measurements/facts (ex. avg, sum) Foreign key to dimension table

Foreign key

A primary key of one table that appears as an attribute in another table and acts to provide a logical relationship between the two tables 2 foreign keys can be used if entity is lacking a primary key

Second Normal Form

A relation in first normal form in which every nonkey attribute is fully functionally dependent on the primary key. A relation is said to be in Second Normal Form when every nonkey attribute is fully functionally dependent on the primary key. That is, every nonkey attribute needs the full primary key for unique identification

Third Normal Form (3NF)

A relation is said to be in Third Normal Form if there is no transitive functional dependency between nonkey attributes When one nonkey attribute can be determined with one or more nonkey attributes there is said to be a transitive functional dependency. The side effect column in the Surgery table is determined by the drug administered Side effect is transitively functionally dependent on drug so Surgery is not 3NF

First Normal Form (1NF)

A relation that has a primary key and in which there are no repeating groups.; must contain only atomic values at each row and column (ex. Column is Color and a singular case has blue and green).

Transtive Functional Dependency

A transitive dependency can only occur in a relation of three of more attributes. This dependency helps us normalizing the database in 3NF (3rd Normal Form). {Book} ->{Author} (if we know the book, we knows the author name) {Author} does not ->{Book} {Author} -> {Author_age} https://beginnersbook.com/2015/04/transitive-dependency-in-dbms/

Stages in Database Design

Analyze the user environment. Develop a conceptual data model. Choose a DBMS. Develop the logical model. Develop the physical model. Evaluate the physical model. Perform tuning if indicated by the evaluation. Implement the physical model.

transaction time

Another type of time that is important in some environments is the date and time that changes were made to the database, known as transaction time

Cloud computing: 3 parts

Application, platform, infrastructure.

ACID Relational Database

Atomicity: Operations executed by the database will be atomic / "all or nothing." For example, if there are 2 operations, the database ensures that either both of them happen or none of them happens. Consistency: Anyone accessing the database should see consistent results. Isolation: If there are multiple clients trying to access the database, there will be multiple transactions happening simultaneously. The database needs to be able to isolate these transactions. Durability: When writing a result into the database, we should be guaranteed that it won't go away.

ACID

Atomity: Operations executed by database will be atomic ('All or nothing') Consistency: Anyone accessing database should see consistent results Isolation: If there are multiple clients accessing database, there will be multiple transactions happening simultaneously. The database needs to isolate these transactions Durability: When writing a result into a database, we should guarantee that it wont go away.

Normalization

Based on observation that relations with certain properties are more effective in inserting, updating, and deleting data than other sets of relations, containing same data It is a multi-step process begginning with an 'Unnormalized; relation

Advantages of a relational database

Can handle lots of complex queries, database transactions, and routine analysis. ACID(Atomity, Consistency, Isolation, Durability): set of properties that ensure reliable database transactions .

Disadvantages of relational database

Cannot store complex or very large images, numbers, designs, and multimedia products Can become very costly with maintenance and new servers

Big Data Challenges

Capture, curation, storage, search, sharing, transfer, analysis, visualization

Boyce-Codd Normal Form (BCNF)

Codd normal form (BCNF)- A special form of third normal form (3NF) in which every determinant is a candidate key. A table that is in BCNF must be in 3NF. See also determinant. Most 3NF relations are also BCNF relations. A 3NF relation is NOT in BCNF if: Candidate keys in the relation are composite keys (they are not single attributes) There is more than one candidate key in the relation, and The keys are not disjoint, that is, some attributes in the keys are common

Association Rules

Common mining algorithm Analyzes and predicts customer behavior If/then statements Ex if customer buys bread → 80% chance buys butter Good for product promotion and product pricing

Classification

Common mining algorithms You want an algorithm to answer binary yes-or-no questions (cats or dogs, good or bad, sheep or goats, you get the idea) or you want to make a multiclass classification (grass, trees, or bushes; cats, dogs, or birds etc.) You also need the right answers labeled, so an algorithm can learn from them. Learning phase → classification phase Predicting a value of a categorical variable (ex. Whether or not a loan will be defaulted..yes/no)

Regression

Common mining algorithms You want an algorithm to yield some numeric value (dependent variable) based on one or more predictor variables. Ex: price of a house given square feet, number of rooms, etc.

UTC Time

Coordinated Universal Time ; used because database can be accessed and used across different time-zones

Benefits of Cloud

Cost and management: economies of scale, 'out sourced' resource management Reduced time to deployment: ease of assembly, works 'out of the box' Scaling: On demand provisioning, co-locate data and compute Reliability: Massive, redundant, shared resources Sustainability: Hardware not owned

DDL

Data Definition Languages (DDL) are used to define the database structure. Any CREATE, DROP and ALTER commands are examples of DDL SQL statements.

Tasks performed by DBMS

Data Dictionary Management, Data Storage Management, Data Transformation and Presentation, Security Management, Multi user Access Control, Backup and Recovery Management, Data Integrity Management, Database Access Languages andApplication Programming Interfaces and Database Communication interfaces.

Key Characteristics of a Data Warehouse

Data is structured for simplicity of access and high-speed query performance. End users are time-sensitive and desire speed-of-thought response times. Large amounts of historical data are used. Queries often retrieve large amounts of data, perhaps many thousands of rows. Both predefined and ad hoc queries are common. The data load involves multiple sources and transformations.

Why use distributed database

Data is too large - Applications are by nature distributed - Bank with many branches - Chain of retail stores with many locations - Library with many branches Get benefit of distributed and parallel processing - Faster response time for queries

Data Formats for Data Mining

Data mining application should be considered in the original design of the warehouse Requires summarized data as well as raw data taken from original data sources Requires knowledge of the domain and of the data mining process Best data format may be "flat file" or vector,where all data for each case of observed values appears as a single record Data values may be either numerical or categorical. Some categorical values may be ordinal, while others may be nominal

Attributes

Defining properties or qualities of entity type Represented by oval on E-R diagram Domain - set of allowable values for attribute Attribute maps entity set to domain May have null values for some entity instances - no mapping to domain for those instances May be multi-valued - use double oval on E-R diagram May be composite - use oval for composite attribute, with ovals for components May be derived - use dashed oval

Data Mining Process Model

Developed from CRISP-DM (Cross Industry Standard Model for Data Mining) -Business Understanding - identify the problem -Data Understanding - gain insight, use visualization -Data Preparation - select, clean, format data, identify outliers -Modeling - identify and construct type of model needed, predictor and target variables, or training set -Evaluation - test and validate model -Deployment - put results to use

Why Extend the E-R Model?

E-R suitable for traditional business applications E-R not semantically rich enough for advanced applications Applications where E-R is inadequate Geographical information systems Search engines Data mining Multimedia CAD/CAM Software development Engineering design Financial services Digital publishing Telecommunications ...and others

Cardinality

Entity relationship modeling; Refers to the maximum number of times an instance in one entity can relate to instances of another entity

Purpose of E.R design

Facilitates database design Express logical properties of mini-world of interest within enterprise - Universe of Discourse Conceptual level model Not limited to any particular DBMS E-R diagrams used as design tools A semantic model - captures meanings

Why use non-relational

Great at storing large amounts of data w/ little structure. Comanies growing at a rapid pace like start-ups utilize more non relational databases for its scalability and flexibility. Paired w/ the cloud, can save a bunch of money.

Transactions - Log

In case of disk failure, the backup can be brought up to date using a log of transactions Recovery log - Contains records of each transaction showing -The start of transaction -Write operations of transaction -End of transaction - If system fails, the log is examined to see what transactions to redo and/or what transactions to undo - Redo means redoing writes, not re-executing the transaction; undo means rolling back writes - Several different protocols are used

Super Key

Key that uniquely identifies entity

Other NoSQL Types

Key/value (Dynamo) Columnar/tabular (HBase) Document (mongoDB)

Data mining languages

Language "R": R is the language revealed in 1997 as the free substitute toward expensive statistical software similar Matlab otherwise SAS. ... Python: Python has firm data mining abilities and more practical abilities to create a product. ... Java: ... JULIA: ... HADOOP AND HIVE. ... KAFKA & STORM:

Advantages of Non-relational databases

Large volumes of structured, semi-structured, and unstructured data Object oriented programming that is easy to use and flexible Efficient, scale out architecture instead of expensive, monolithic architecture.

Relational Database

Largely transactional, its been around a long time, SQL is the language for communication

Disadvantages of Non-relational databases

Less support since NoSQL databases are usually open-source Administration: NoSQL databases requires technical skill in order to install and maintain Less mature: NoSQL are still growing and many features are still being implemented.

Methods to ensure Serializability

Locking Timestamping Concurrency control subsystem is "part of the package" and not directly controllable by either the users or the DBA A scheduler is used to allow operations to be executed immediately, delayed, or rejected •If an operation is delayed, it can be done later by the same transaction •If an operation is rejected, the transaction is aborted but it may be restarted later

(Min, Max) Notation for Cardinality and Participation

Min - least number of relationship instances an entity instance must participate in - can be 0 (partial participation) - or 1 or more (total participation) Max - greatest number of relationship instances the entity can participate in - can be 1 - or many (written M,N, or *) - or some constant integer - Written on line connecting entity rectangle to relationship diamond

What is a non-relational database or NoSQL?

NOT only SQL. Existed since 1960s, but term wasnt used til '98 by Carl Strozzi(who led development of NoSQL). A NoSQL database provides as mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases

Key-value stores

NoSQL database A simple pair of a key and an associated collection of values. Key is usually a string. Database has no knowledge of the structure or meaning of the values. stores key value pairs, fast lookup key -> value massive scalability good for simple associative data and big data bad for complex highly relational data ex. redis Uses an associative array(map or dictionary) as their fundamental data model. The data is represented as collection of key-value pairs and a key at most once in the collection. You can store a value along with a key used to reference that value. Redis, Amazon, DynamoDB

Graph

NoSQL database Utilizes graph structures to represent and store data. This allows users the ability to traverse quickly among all the connected values and find insights in the relationship neo4J, OrientDB, tital

Document Oriented Database

NoSQL database also called document store Computer program designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data MongoDB, Couchbase

Column Store

NoSQL database It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide column store can be interpreted as a two-dimensional key-value store. Uses concept of keyspace which contains all the column families that contains rows and columns to store and organize data. Apache, HBase, Cassandra

Entity

Object that exists and that can be distinguished from other objects Can be person, place, event, object, concept in the real world Can be physical object or abstraction Entity instance is a particular person, place, etc. Entity type is a category of entities Entity set is a collection of entities of same type-must be well-defined Entity type forms intension of entity - permanent definition part Entity instances form extension of entity - all instances that fulfill the definition at the moment In E-R diagram, rectangle represents entity set, not individual entities

Types of Cardinality

One-to-one, one-to-many, many-to-many

OLTP

Online Transaction Processing- records all business transactions as they occur;acts as monitor; detects process aborts;restarts aborted processes;backs out failed transaction;allows distribution of multiple copies of application servers;performs dynamic load balancing DBMSs

OLAP

Online analytical processing the manipulation of information to create business intelligence in support of strategic decision making. Data Warehousing

Difference Between parallel and distributed

Parallel Databases • Machines are physically close to each other, e.g., same server room • Machines connects with dedicated high-speed LANs and switches • Communication cost is assumed to be small • Can shared-memory, shared-disk, or shared-nothing architecture Distributed Databases (does not need to be parallel system, But can be • Machines can far from each other, e.g., in different continent • Can be connected using public-purpose network, e.g., Internet • Communication cost and problems cannot be ignored • Usually shared-nothing architecture

Data modeling: 3 Levels of Abstraction

Physical layer: flow of data is stored in hardware Logical layer: How the data is stored in database (types of records, relationships, ect) View Layer: How applications access data

Goals of data mining:

Predict the future behavior of attributes Classify items, placing them in the proper categories Identify the existence of an activity or an event Optimize the use of the organization's resources

Motivations of MongoDB

Problems with SQL -Rigid schema -Not easily scalable (designed for 90's technology or worse) -Requires unintuitive joins Perks of mongoDB -Easy interface with common languages (Java, Javascript, PHP, etc.) -DB tech should run anywhere (VM's, cloud, etc.) -Keeps essential features of RDBMS's while learning from key-value noSQL systems

RTAP

Real time analytics processing (Big data architecture and technology. Old model: few companies generate data, all others consume New: Everybody creates, all consume.

Applications of Data Mining

Retailing Customer relations management (CRM) Advertising campaign management Banking and Finance Credit scoring Fraud detection and prevention Manufacturing Optimizing use of resources Manufacturing process optimization Product design Science and Medicine -Determining effectiveness of treatments -Analyzing effects of drugs -Finding relationships between patient care and outcomes Astronomy Weather prediction Bioinformatics Homeland Security -Identify and track terrorist activities -Identify individual terrorists Search Engines

Role (E.R. M)

Role: function that an entity plays in a relationship Optional to name role of each entity, but helpful in cases of - Recursive relationship - entity set relates to itself - Multiple relationships between same entity sets

Deadlock

Serializability Often, transaction cannot specify in advance exactly what records it will need to access in either its read set or its write set Deadlock- two or more transactions wait for locks being held by each another Deadlock detection uses a wait-for graph to identify deadlock Draw a node for each transaction If transaction S is waiting for a lock held by T, draw an edge from S to T Cycle in the graph shows deadlock Deadlock is resolved by choosing a victim-newest transaction or one with least resources Should avoid always choosing the same transaction as the victim, because that transaction will never complete - called starvation

Steps in a transaction

Simple update of one record: Locate the record to be updated Bring the page into the buffer Write the update to buffer Write the modified page out to disk More complicated transactions may involve several updates Modified buffer page might not be written to disk immediately after transaction terminates-must assume there is a delay before actual disk write is done

Data Mining vs querying and OLAP

Standard database querying - Can only tell users what is in the database, reporting facts already stored OLAP -Analyst can use the database to test hypotheses about relationships or patterns in the data -Analyst has to formulate the hypothesis first, and then study the data to verify it Data mining -Can study the data without formulating a hypothesis first -Uncovers relationships or patterns by induction -Explores existing data, finding important factors that an analyst might never have included in a hypothesis

Candidate Key

Super key that guarantees to be unique

Keys

Superkey: attribute or set of attributes that uniquely identifies an entity Composite key: key with more than one attribute Candidate key: superkey such that no proper subset of its attributes is also a superkey (minimal superkey - no unnecessary attributes) Primary key: candidate key actually used for identifying entities and accessing records Alternate key: candidate key not used for primary key Secondary key: attribute or set of attributes used for accessing records, but not necessarily unique Foreign key: term used in relational model (but not in the E-R model) for an attribute that is primary key of a table and is used to establish a relationship, usually with another table, where it appears as an attribute also

Steps in Database Design

The design process consists of the following steps: Determine the purpose of your database. ... Find and organize the information required. ... Divide the information into tables. ... Turn information items into columns. ... Specify primary keys. ... Set up the table relationships. ... Refine your design. ... Apply the normalization rules.

Relationship Participation Constraints

Total participation - Every member of entity set must participate in the relationship - Represented by double line from entity rectangle to relationship diamond Partial participation - Not every entity instance must participate - Represented by single line from entity rectangle to relationship diamond

Deadlocks

Two-phase locking guarantees serializability, but it does not prevent ________. Often, transaction cannot specify in advance exactly what records it will need to access in either its read set or its write set Deadlock- two or more transactions wait for locks being held by each another Deadlock detection uses a wait-for graph to identify deadlock - Draw a node for each transaction - If transaction S is waiting for a lock held by T, draw an edge from S to T Cycle in the graph shows deadlock Deadlock is resolved by choosing a victim-newest transaction or one with least resources Should avoid always choosing the same transaction as the victim, because that transaction will never complete - called starvation

Purpose of Data Mining

Usually the ultimate purpose is to : Give a company a competitive advantage, enabling it to earn a greater profit Provide better service Advance scientific knowledge Make better use of resources

Big Data 5 Vs

Volume, velocity, variety, veracity(conformity to facts; accuracy.) , value

Discriminator (E.R.M)

Weak entity may have a partial key, a discriminator, that distinguishes instances of the weak entity that are related to the same strong entity

Preparing data:

You may need to: Remove highly correlated variables Binning (helps reduce high dimensionality and can transform numeric variables to categorical) Remove missing values, remove outliers Create new features Join datasets Reduce dimensions of data - selecting interesting subsets Remove duplicates Normalize Data Create flag variables And more!

Primary key

a candidate key selected as the primary means of identifying rows in a relation

E.F. Codd

defined the relational database. A relational database is a digital database based on the relational model of data. The data is stored in rows(entries) and column. Relationships are established through primary and foreign keys.

SQL injection

is a code injection technique that might destroy your database. is the placement of malicious code in SQL statements, via web page input.

Data Provenance

is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modeling authenticity, and implementing access control for derived data.

Concurrency control

is the procedure in DBMS for managing simultaneous operations without conflicting with each another.

MongoDB History/General

mongoDB = "Humongous DB" Open-source Document-based "High performance, high availability" Automatic scaling C-P on CAP (CAP Theorem is a concept that a distributed database system can only have 2 of the 3: Consistency, Availability and Partition Tolerance)

CRISP DM

most comprehensive, common, and standardized data mining process Business understanding, Data understanding, Data preparation, Model Building, Testing and Evaluation, and Deployment. Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model. Wikipedia

Types of Cloud Computing

public, private, hybrid, and community

Parallel database system

seeks to improve performance through parallelization of various operations, such as data loading, index building and query evaluating. Although data may be stored in a distributed fashion in such a system, the distribution is governed solely by performance considerations. Often times machines are in close proximity to eachother

DML

stands for Data Manipulation Language. The SQL statements that are in theDML class are INSERT, UPDATE and DELETE. ..

Relational Database Engine: Storage manager

the interface between the database and the operating system. It is responsible for authorization, interaction with the OS file system (accessing storage and organizing files), and efficient data storage/modification (indexing, hashing, buffer management). One very important piece of the storage manager is the transaction manager. - It ensures the database is consistent (if a failure occurs) and atomic. It also does concurrency control to make sure multiple operations result in a consistent database

Data Mining

the process of analyzing data to extract information not offered by the raw data alone Important process in BI Discovering new information from very large data sets Knowledge discovered is usually in the form of patterns or rules Uses techniques from statistics and artificial intelligence Need a large database or a data warehouse

Classifications of Cloud Computing:

•Infrastructure as a service (IaaS) - Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. Platform as a Service (PaaS) - Offering a development platform on the cloud. Software as a service (SaaS) - Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on pay-per-use basis. This is a well-established sector.

Serial Schedules:

•Serial execution- execute one transaction at a time, with no interleaving of operations. Ex. A, then B -Can have more than one possible serial execution for two or more transactions -Ex:A,B or B,A -For n transactions, there are n! possible serial executions -They may not all produce the same results -However, DB considers all serial executions to be correct •A schedule is used to show the timing of the operations of one or more transactions •Shows the order of operations •Schedule is serializable if it produces the same results as if the transactions were performed serially in some order •Objective is to find serializable schedules to maximize concurrency while maintaining correctness

Locks

•Transaction can ask DBMS to place locks on data items •Lock prevents another transaction from modifying the object •Transactions may be made to wait until locks are released before their lock requests can be granted •Objects of various sizes (DB, table, page, record, data item) can be locked. •Size determines the fineness, or granularity, of the lock •Lock implemented by inserting a flag in the object or by keeping a list of locked parts of the database •Locks can be exclusive or shared by transactions -Shared locks are sufficient for read-only access -Exclusive locks are necessary for write access


Ensembles d'études connexes

ISSP - SEA Study Guide Flash Cards

View Set

Principles of Finance: Chapters 10-12

View Set

REAL Art Final 100 (Quiz 1&2 terms)

View Set

CompTIA A+ (220-1001) Cert Prep 8: Internet and the Cloud

View Set