Chapter 12: Distributed Database Management Systems
transaction transparency
- allows a transaction to update data at more than one network site - either entirely completed or entirely aborted
distributed global schema
- database description - common database schema used by local TPs to translate user requests into subqueries (remote requests) that will be processed by different DPs
remote request
lets a single SQL statement access the data that are to be processed by a single remote database processor
database fragments
parts of a distributed database system
BASE
- a data consistency model in which data changes are not immediate but propagate slowly through the system until all replicas are eventually consistent - trade-off between consistency and availability - basically available, soft state, eventually consistent
distributed processing
- a database's logical processing is shared among two of more physically independent sites that are connected through a network - does not require a distributed database (can be a single-site DB)
multiple-site processing, multiple-site data
- a fully distributed DBMS with support for multiple data processors and transaction processors at multiple sites - MPMD - classified as either homogeneous or heterogeneous
types of distributed query costs
- access time (I/O) costs from multiple remote sites - communication costs with data transmission among nodes - CPU time costs from processing overhead of managing distributed transactions
single-site processing, single-side data
- all processing is done on a single host computer and all data are stored on the host computer's local disk system - can have multiple end user dumb terminals - SPSD
distribution transparency
- allows a distributed database to be managed as a single logical database - 3 levels: fragmentation, location, & local mapping transparency
performance transparency
- allows the system to perform as if it were a centralized DBMS - no system degradation due to use of network - ensures most cost-effective path to remote data - able to "scale out" transparently
data fragmentation
- allows you to break a single object into two or more segments, or fragments - each fragment can be store at any site over network
functions of a DDBMS
- application interfaces that interact w/users, app prgms, and other DBMSs w/in distributed DB - validation of syntax for data requests - transformation of complex requests into atomic data request components - query optimization over distributed database fragments - mapping to find data locations - I/O interface to/from permanent local storage - formatting data for user, app prgm - security at both local & remote DBs - backup & recovery - DB administration features - concurrency control to manage simultaneous data access - transaction management to move data from one consistent state to another
distributed transaction
- can reference several different local or remote DP sites - each single request only references one local DP site - transaction as a whole can reference multiple DP sites
Hadoop node types
- client node: makes requests to file system (reads/writes) - name node: contains the metadata for the file system - data node: store the actual data files
disadvantages of DDBMS
- complexity of management and control - technological difficulty - security - lack of standards - increased storage & infrastructure requirements - increased training costs - costs
remote transaction
- composed of several requests - accesses data at a single remote site
components of DDBMS
- computer workstations or remote devices (sites or nodes) - network hardware & software components in each workstation or device - communication media that carries data from one node to another - transaction processor - data processor
distributed database dictionary
- contains the description of the entire database as seen by the database administrator - DDD - aka distribute data catalog (DDC)
factors for DDBMS to resolve data requests
- data distribution (which fragment to access) - data replication (all copies kept consistent, providing replica transparency) - network & node availability (despite network latency or partitioning)
advantages of DDBMS
- data located near site of greatest demand - faster data access (using subset of data) - faster data processing (spread out work) - growth facilitation (add new sites easily) - improved communications - reduced operating costs - user-friendly interface - less danger of single-point failure - processor independence
Hadoop Distributed File System
- distributes data based on key assumptions of * high volume * write-once, read-many * streaming access * move computations to the data * fault tolerant - de facto standard for Big Data storage and processing - HDFS
DDBMS transparency features
- distribution transparency - transaction transparency - failure transparency
failure transparency
- ensures that the system will continue to operate in the event of a node or network failure - lost functions picked up by other nodes in network
write-ahead protocol
- forces the log entry to be written to permanent storage before the actual operation takes place - enables DO, UNDO, REDO operations can survive a system crash while being executed
distributed database management system
- governs the storage and processing of logically related data over interconnected computer systems in which both data and processing are distributed among several sites - DDBMS
two-phase commit protocol
- guarantees that if a portion of a transaction operation cannot be committed, all changes made at the other sites participating in the transaction will be undone to maintain a consistent database state - each DP maintains its own transaction log - 2PC
fragmentation transparency
- highest level of transparency - end user or programmer does not need to know that a database is partitioned - doesn't specify fragment names or locations
distributed request
- lets a single SQL statement reference data located at several different local or remote DP sites - therefore the transaction can access several sites - provides for fully distributed database processing
multiple-site processing, single-site data
- multiple processes run on different computers that share a single data repository - MPSD
problems of centralized DBMS
- performance degradation over growing number of remote sites, greater distances - high costs of maintaining, operating central mainframe - reliability problems created by dependence on single site - scalability problems imposed by single location - organizational rigidity imposed by database
data allocation
- process of deciding where to locate data - 3 strategies: centralized, partitioned or replicated
styles of replication
- push replication (after a data update, orig DP node sends changes to replicas to ensure they immediately update) - pull replication (after a data update, orig DP nodes sends messages to the replica to notify nodes to update -- nodes decide when update occurs)
heartbeat report
- sent by data node every 3 seconds - lets the name node know that the data node is still available
block report
- sent by data node every 6 hours - informs the name node of which blocks are on that data node
client/server architecture
- similar to that of the network file server except that all database processing is done at the server site, thus reducing network traffic - variation of MPSD
distributed database
- stores a logically related database over two or more physically independent sites - requires distributed processing
unreplicated database
- stores each database fragment at a single site - no duplicate data fragments
fully replicated database
- stores multiple copies of each database fragment at multiple sites - all database fragments are replicated
transaction processor
- the software component found in each computer or device that receives & processes application's remote & local data requests - TP - aka application processor (AP) - aka transaction manager (TM)
data processor
- the software component residing on each computer or device that stores and retrieves data located at the site - DP - aka data manager (DM)
DO-UNDO-REDO protocol
- used by the DP to roll transactions back and forward with the help of the system's transaction log entries
CAP Theorem
3 desired properties of a DDBMS - Consistency - Availability - Partition tolerance *impossible for a system to provide all 3 at the same time Dr. Eric Brewer, 2000
heterogeneity transparency
allows the integration of several different local DBMSs (relational, network and hierarchical) under a common, or global, schema
unique fragment
condition that indicates each row is unique, regardless of the fragment in which it is located
replicated data allocation
copies of one or more database fragments are stored at several sites
vertical fragmentation
data fragmentation strategy that refers to the division of a relation into attribute (column) subsets - equivalent to PROJECT stmt
horizontal fragmentation
data fragmentation strategy that refers to the division of a relation into subsets (fragments) of tuples (rows) - equivalent to SELECT stmt w/WHERE clause of single attribute
mixed fragmentation
data fragmentation strategy that refers to the division of a relation using a combination of horizontal and vertical strategies
partitioned data allocation
database is divided into two or more disjointed parts (fragments) and stored at two or more sites
centralized data allocation
entire database is stored at one site
local mapping transparency
exists when the end user or programmer must specify both the fragment names and their locations
location transparency
exists when the end user or programmer must specify the database fragment names but does not need to specify where those fragments are located
subordinates
in a two-phase commit protocol, the cohort nodes
coordinator
in a two-phase commit protocol, the role assigned to the node that initiates the transaction
heterogeneous DDBMS
integrates different types of DBMSs over a network, but all support the same data model
homogeneous DDBMS
integrates multiple instances of the same DBMS over a network, which can be on different platforms
partition key
one or more attributes in a table that determine the fragment in which a row will be stored - used in range partitioning
replica transparency
refers to the DDBMS's ability to hide multiple copies of data from the user
mutual consistency rule
requires all copies of data fragments be identical
distributed database design
same design principles as centralized DB, plus issues of: - data fragmentation (how to partition) - data replication (which fragments to replicate) - data allocation (where to locate fragments)
partially replicated database
stores multiple copies of some database fragments at multiple sites
fully heterogeneous DDBMS
supports different types of DBMSs, each one with a different data model, running under different computer systems
network latency
the delay imposed by the amount of time required for a data packet to make a round trip from point A to point B
network partitioning
the delay imposed when nodes become suddenly unavailable due to a network failure
data replication
the storage of data copies at multiple sites served by a computer network - subject to mutual consistency rule