DBMS

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Many database applications require data from a variety of preexisting databases located in a heterogeneous collection of hardware and software platforms Data models may differ (hierarchical, relational, etc.) Transaction commit protocols may be incompatible Concurrency control may be based on different techniques (locking, timestamping, etc.) System-level details almost certainly are totally incompatible. A multidatabase system is a software layer on top of existing database systems, which is designed to manipulate information in heterogeneous databases Creates an illusion of logical database integration without any physical database integration

Homogeneous vs heterogeneous

Map: extract desired information, take a set of (key, value) pairs and generate a set of intermediate (key, value) pairs by applying some function f to all these pairs. Reduce: merge all pairs with the same key applying a reduction function R on the values f and R are user defined functions

How map reduce work

• Example systems: MongoDB, SimpleDB, CouchDB • Document: encapsulates and encodes data (or information) in some standard formats or encodings (XML, JSON, BSON, etc.) • A document is addressed by its key: - key-document • DBMS offers APIs/query language to retrieve document contents • Different implementations have different ways to organize documents (e.g. collection, tag, etc.) • Documents within a collection can have different fields (unlike records in a relational table)

How mongo db works (modern knowledge)

Hyperlink components Destination page Anchor text Hub Web page or a Website that links to a collection of prominent sites (authorities) on a common topic

How the web search is used. Differnet methods.

Average precision Useful for computing a single precision value to compare different retrieval algorithms Recall/precision curve Usually has a negative slope indicating inverse relationship between precision and recall F-score (or F1-score) Single measure that combines precision and recall to compare different result sets F = 2pr/(p+r)

Measurement i.r. query result

• Network DBMS - Using the network data model: a graph structure - Data are represented by collections of records - Relationships between data are represented by links - Records are organized as arbitrary graphs.

Network model, hybrid., relational, o.o.object relation, xml. No sql. (evolution)

key-value(Oracle Nosql),document(mongoDB),column-based(Hbased).graph-based(Neo4j)

Types of sql. Data Models Example for each model

• Volume: same as before • Velocity: same as "speed" • Variety: same as "diversity" • Veracity:dataindoubt - you do not know exactly what you have

Unde stand the principle (page rank)

Information retrieval Process of retrieving documents from a collection in response to a query by a user

What is i.r. and web search system.

• Volume: same as before • Velocity: same as "speed" • Variety: same as "diversity" • Veracity:dataindoubt - you do not know exactly what you have

characteristics of big data (4 v's)

A serializable schedule of n concurrent transactions: equivalent to some serial schedule of the same n transactions

concept of serializability

Assumes fail-stop model - failed sites simply stop working, and do not cause any other harm, such as sending incorrect messages to other sites. Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached. The protocol involves all the local sites at which the transaction executed Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be Ci

2 phase complete(commit), commit protocol,

Ensures serializability - Each trans issues lock and unlock requests in 2 phases • Growing phase: A trans may obtain locks but may not release any lock • Shrinking phase: A trans may release locks but may not obtain any new locks This technique: does not ensure freedom from deadlock

2 phase locking

MapReduce is a useful abstraction Cluster issues (failures, network problems, slow machines) handled by library Focus on problem Greatly simplifies large-scale computations Common operations must be coded by hand • Join,filter,projection,aggregates,sorting,distinct Do low-level stuff by hand Extremely rigid data flow Hard to understand, maintain, extend, and optimize code

Advantage of it

• Advantages: - Strong mathematical foundation - Supports SQL, a simple query language • Disadvantages: - Atomic attributes => cannot store complex values - Fixed predefined data types => limited - Semantic overloading => cannot distinguish relationship representation from data representation - Normalization => creation of relations that may not represent real-world entities => may need join operations to get back to the real-world entities => inefficient - Fixed operations - Impedance mismatch • Query in SQL, but applications are in other languages (C, Java, etc.) • Need conversion to map different data types from SQL and app languages => inefficient - Other problems • Short transactions • Schema change is difficult

Advantages & Disadvantages of Relational DBMS

• Complex (later) database applications - CADEngineeringdrawings • Components • Versions • Relationships between components and versions - Software engineering • Source code • Module specifications • Relationships between modules • Definitions and usage of variables/parameters • Development history - Multimedia data • Text, audio, video, image - Hypertextdata • Web database • Want to retrieve documents via links and structures - Socialmediadata • Graph data • Want to retrieve data via links and weights on links

Applications

Keyword search query

Difference between i.r. query and db query

A query is a single SQL statement that does Select, Update, Insert or Delete of rows. A transaction is a consecutive sequence of SQL statements (from the application viewpoint) that have the "ACID" properties

Difference between transaction and query

• Simple interface - get(key) - put(key, value) // two flavors: insert and/or replace • Principles - consistent hashing, allows to extend / shrink "ring" - "finger tables": support logarithmic access times - chain replication: support fault tolerance • Examples - late 90s (research): Chord, Pastry, ... - mid 2000s (applied research, product): Cassandra, Dynamo, ... • Virtualizes storage layer: fault tolerant, elastic, fast - decouples machine from "storage service"

Distributed hashing, how they use this in no sql.

Object-relational DBMS - Using the relational data model with added complex attributes and other OO features to accommodate complex objects and operations

ORDB

ER: - Set of entities - Set of relationships between entities - Entity: has attributes • OODB: - Classes: • Attributes: simple/complex/relationship • Operations/methods

Object relation model. (convert ER to object relational)

• PR(A) = (1-d)/N + d * (PR(T1)/C(T1)+...+PR(Tn)/C(Tn))

Page rank

- Easier to keep track of variables - Easier to understand where you are in the process of analyzing data - High level primitives (group)traditional database optimizations.

Pig Latin (why we use this instead of programming language)Why using pig Latin

Recall (r) Number of relevant documents retrieved by a search / Total number of existing relevant documents Precision (p) Number of relevant documents retrieved by a search / Total number of documents retrieved by that search

Precision, recall. Etc

• ReduceCost - utilization of hardware and software (ideal is ~100%) - pay-as-you-go & efficient (no overheads) - no vendor lock-in • Reduce Time to Market - focus on business problem (not IT) - no configuration, no provisioning, automatic security etc. - development framework (for enterprise Web apps) • Operating & Support (-> cost + time-to-market) - SLAs: availability, guaranteed response times, ... - security - elasticity: scale-out and down with workload - multi-tenancy (support for SaaS)

advantage of having cloud computing

Given support threshold s as 30% and confidence threshold a as 50% Support of {Bread, PeanutButter} is 60% > s; so it is a large itemset. l = {Bread, PeanutButter} x = {Bread} l-x = {PeanutButter} Support (l)/Support (x) = Support(Bread,PeanutButter)/Support(Bread) = 60/80 >a;sox=>(l-x)isarulesatisfyingsanda, i.e.we have one correct rule: Bread => PeanutButter Question: is PeanutButter => Bread a correct rule?

association rule mining, how it works

For centralized systems, the primary criterion for measuring the cost of a particular strategy is the number of disk accesses. In a distributed system, other issues must be taken into account: The cost of a data transmission over the network. The potential gain in performance from having several sites process parts of the query in parallel. Semijoin

factors to consider to process query in distributed system

Object-orientedDBMS - Using an object-oriented (OO) data model: set of classes - Class: • Attributes: simple/complex/relationships • Methods/operations • Encapsulation - Class hierarchy: inheritance • Superclass • Subclass

definition of OODB,

-A distributed database: a logically interrelated collection of shared data physically distributed over a computer network -Distributed DBMS: the software system that permits the management of the distributed database and makes the distribution transparent to users.

difference between Distributed Databases. vs centralized db

Hierarchical,Network,RelationalDBMS,Object-oriented,Object-relational

evolution of dbms model

Horizontal: allows parallel processing on fragments of a relation allows a relation to be split so that tuples are located where they are most frequently accessed Vertical: allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed tuple-id attribute allows efficient joining of vertical fragments allows parallel processing on a relation Vertical and horizontal fragmentation can be mixed. Fragments may be successively fragmented to an arbitrary depth.

fragmentation

with pseudo code.using the pseudo code, conver ER diagram, to object oriented model.

given diagram, convert ER to object database

picture

how they connect to each other

picture

key components of dbms

• Each machine can store several partitions • Partitions should have even load -> machines have even load Overload in one partition -> split that partition • Split does not affect other partitions • Clients can send requests to any node in ring • Nodes serve as routers within ring (logarithmic time and space) • Failure of one/two machines does not result in loss of data • Many protocols to keep finger table, replication, etc. up to date

key-value has fault tolerance and retrieve quickly with hashing? , how it works

- Allow a trans to access a data item only if it is currently holding a lock on that item. • Example: 2-phase locking, graph-based locking Locks: 2 types - Shared Lock: S • If trans T holds an S lock on data item Q => T can read Q but cannot write Q - Exclusive Lock: X • If trans T holds an X lock on data item Q => T can read Q and can write Q

lock based techniques

Deferred - Record DB updates in a log file - Do not update DB until trans commits - When trans commits, use the log info to update DB - Log info is ignored if a failure occurs before the trans commits => NO UNDO operation is needed Immediate - DB may be updated by some operations of a trans before the trans commits. These updates are called UNCOMMITTED DATA or DIRTY DATA - Record both old values and new values of updated items in the log: when executing write(X, x1), create a log record for X:

major techniques for recovery

• Originally motivated by Web 2.0 applications • Goal is to scale simple OLTP (OnLine Transaction Processing)-style workloads to thousands or millions of users • Users are doing both updates and reads

motivation

Validation-Based Techniques - No conflict checking is done during trans execution - Each trans goes through 2 or 3 phases: • 1-READ PHASE: During trans execution, a trans can read values to variables local to trans and perform update on these local variables - At the end of trans execution: • 2-VALIDATIONPHASE:Checkifanyoftransupdatesviolates serializability. If a trans fails the validation test, the system aborts the transaction. • 3-WRITEPHASE:Ifnoviolations, - Commit trans and apply updates to the database - Otherwise, discard updates, abort trans

optimistic techniques

Ship copies of all three relations to site SI and choose a strategy for processing the entire query locally at site SI. Ship a copy of the account relation to site S2 and compute temp1 = account depositor at S2. Ship temp1 from S2 to S3, and compute temp2 = temp1 branch at S3. Ship the result temp2 to SI. Devise similar strategies, exchanging the roles S1, S2, S3

query processing for how to do distributed query processing

Advantages of Replication Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r. Disadvantages of Replication Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency control operations on primary copy

replication

Timestamp-Ordering Technique: - Orders trans based on their timestamps - A serializable schedule: equivalent to a serial schedule that corresponds to the order of trans timestamps - Ensures that any conflicting READ and WRITE operations are executed in the timestamp order as follows:

time stamp techniques

• Six key features: 1. Scale horizontally "simple operations" - key lookups, reads and writes of one record or a small number of records, simple selections 2. Replicate/distribute data over many servers 3. Less powerful query language 4. Weaker concurrency model than ACID 5. Efficient use of distributed indexes and RAM 6. Flexible schema

understand the principle

Idea:Decouplesoftwarefromhardware - provide a sandbox with OS and any software stack - move sandbox (=VM) if hardware is overloaded • Advantage - better utilization of hardware • History - idea pioneered in the 60s for IBM mainframes - also applicable to desktops • Hypervisor - dispatch VM calls to CPUs - monitor usage of VMs - move VMs if overloaded

virtual machine

• Principles - share resources, dynamic provisioning, migration • Apply these principles at different levels - software service: map URL to virtual machine - machines: map virtual machine to physical machine - storage: map key to block on physical machine (KVS) • Advantages of Virtualization - increase utilization - improve fault tolerance - improve manageability

virtualization

Vertical search engines Topic-specific search engines Metasearch engines Query different search engines simultaneously Digital libraries Collections of electronic resources and services

web search.

• Use of computing resources as a service - resources = software, platform, infrastructure • e.g., word processor, database system, CPU, disk - service: automate deployment of resource • e.g., start and end time, availability, SLAs (Service Level Agreements), etc. • Resources can be remote or local - you care about the what and when: • what kind of resource is used at what point in time - you do not care about the how and where • unless you have legal / compliance issues

what is cloud computing

recover the database contents to a state that ensure database consistency transaction atomicity, and durability

what is recovery

• Want a data management system that is - Elastic and highly scalable - Flexible (different records have different schemas) • To achieve the above goals, willing to give up - Complex queries: e.g., give up on joins - Multi-object transactions - ACID guarantees: e.g., eventual consistency is OK - Not all NoSQL systems give up all these properties

what is the meaning of nosql

a logical unit of work, sequence of several db operations that perform a single logical function in a DB application

what is transaction

terminologies

what is web db,terminologies

naive users, application programmers, sophisticated users, database administrators

what type users

• Scaling a relational DBMS is hard • Much more difficult to scale transactions • Because we need to ensure ACID properties - Hard to do beyond a single machine • Current DBMS technology does not provide adequate tools to scale-out a database from a small number of machines to a large number of machines.

when we need no sql.

to prevent transactions from destroying database consistency

why we need concurrency control


Ensembles d'études connexes

Course Point Pathophysiology- Chapter 25

View Set

Southern Adventist University, Personal Finance, Quiz 5

View Set