Ch 2 Data Models
some of thee most frequently used Big Data technologies are
- Hadoop - MapReduce - NoSQL
The use of external views that represent subsets of the database has some important advantages:
- It is easy to identify specific data required to support each business unit's operations. - It makes the designer's job easy by providing feedback about the model's adequacy. Specifically, the model can be checked to ensure that it supports all processes as defined by their external models, as well as all operational requirements and constraints. - It helps to ensure security constraints in the database design. Damaging an entire database is more difficult when each business unit works with only a subset of data. - It makes application program development much simpler.
a implementation-ready datamodel should contain at least the following components:
- a description of the data structure that will store the end-user data - a set of enforceable rules to guarantee the integrity of the data - a data manipulation methodology to support the real-world data transformations
from end-user perspective, any SQL-based relational database application involves three parts:
- end-user interface - a collection of tables stores in the database - SQL engine
basic building blosck of data models include
- entities - attributes - relationships - constraints
the ANSI/SPARC architecture defines three levels of data abstration
- external - conceptual - internal - physical (extended version now includes)
HDFS uses three types of nodes
- name nodes - data nodes - client nodes
database models use three types of relationships
- one to many - many to many - one to one
NoSQL (in Ch2) refers to the new generation of databses that address the specific challanges of the Big Data era and have the followwing general characteristics:
- they are not based on the relational model and SQL, hence the name NoSQL - they support highly distributed database architectures - provide high scalability, high availability, and fault tolerance - they support very large amounts of sparse data ( data with a large number of attributes but the actual number of data instances is low - are geared toward performance rather than transaction consistency
the basic characteristics of Big Data databases
- volume - velocity - variety
The conceptual model yields some important advantages:
First, it provides a bird'seye (macro level) view of the data environment that is relatively easy to understand Second, the conceptual model is independent of both software and hardware.
object/relational database management system (O/R DBMS)
a DBMS based on the extended relational model (ERDM). The ERDM, championed by many relational database reserchers, constitutes the relational model's response to the OODM. This model includes many of the object-oriented model's best features within an ingerently simplier relational database
attribute
a characteristic of an entity or object. an attribute has a name and a data type. equivalent to fields in file systems
a problem domain is
a clearly defined area within the real-world enviroment , with a well-defined scope and boundaries that will be systematically addressed
entity set
a collection of like entities
relational database management system (RDBMS)
a collection of programs that manages a relational database. the RDBMS software translates a user's logical request (queries) into commands that physically locate and retrieve the request data
class
a collection of similar objects with shared structure (attributes) and behavior (methods). A class encapsulates an object's data representation and a method's implementation objects that share similar characteristics are grouped in classes
hardware independence
a condition in which a model does not depend on the hardware used in the model's implementation. therefore, changes in the hardware will have no effect on the database design at the conceptual level
logical independence
a condition in which the internal model can be changed without affecting the conceptual model. )the internal model is hardware independent because it is uneffected by the computer on which the sofware is installed . Therefore, a change in storage deives or operating systems will not be affect the internal model)
physical independence
a condition in whihc the physical model can be changed without affecting the internal model
entity relationship (ER) model (ERM)
a data model that describes relationships (1:1, 1:M, and M:N) among entities at the conceptual leve with the help of ER diagrams
object-orientation data model (OODM)
a data model whose basic modeling structure is an object both data and its relationships are contained in a single structure known as an object
degrees of abstration
a database designer starts with an abstraact view of the overall data enviroment and adds details as the design comes closer to implementation
business rule
a description of a policy, procedure, or principle within an organization. Ex a pilot cannot be on duty for more than 10 hours during a 24-hour period, or a professor may teach up to four classes during a semester
entity relationship diagram (ERD)
a diagram that depicts an entity relationship model's entities, attributes, and relations
class diagram
a diagram used to represent data and thier relationship in UML object notation
relational diagram
a graphical representation of a relational database's entities, the attributes within those entities, and the relationships among the entities
Hadoop Distributed File System (HDFS)
a highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speeds --> achieved through use of write-once, read many model
unified modeling language (UML)
a language based on object-oriented concepts that provides tools such as diagrams and symbols to graphically model a system
table (relation)
a logical construct perceived to be a two-dimensional structure composed of intersecting rows (entities) and columns (attributes) that represents an entity set in the relational model
schema
a logical grouping of database objects, such as tables, indexes, views, and queries, that are related to each other is the conceptual organization of the entire database as viewed by the database administrator
extensive markup language (XML)
a metalanguage used to represent and manipulate data elements. Unlike other markup languages, XML permits the manipulation of a document's data elements. XML facilitates the exchange of structured documents such as orders and invoices over the internet it emerged as the defacto standard for the efficient and effective exchange of structured, semistructured, and unstructured data
physical model
a model in which physical characteristics such as location, path, and format are described for the data. The physical model is both hardware and software (the DBMS and operating system) dependent
extended relational data model (ERDM)
a model that includes the object-oriented model's best features in an inherently simpler relational database structural enviroment. See extended entity relationship model
big data
a movement to find new and better ways to manage large amounts of web-generated data and derive business insights from it, while simultaneously providing high performance and scalability at a reasonable cost
NoSQL
a new generation of database management systems that is not based on the traditional relational database model represents a different way of approaching the storage and processing of data in a nonrelational way. provides distributed, fault-tolerant databases for processing nonstructured data
entity
a person, place, thing, concept, or event for which data can be collected and stored. See also attribute
software independence
a propert of any model or application that does not depend on the software used to implement it
internal schema
a representation of an internal model using the database conctructs supported by the chosen database
conceptual schema
a representation of the conceptual model, usually expressed graphically most widely used is the ER model
crow's foot notation
a representation of the entity relationshop diagram that uses a three-pronged symbol to represent the "many" sides of the relationship
data model
a representation, usually graphic, of a complex "real world" data structure. Data models are used in the database design phase of the Database Life Cycle
constraint
a restriction placed on data usually expressed in the form of rules. Ex student GPA must be between 0.00 and 4.00
entity instance (entity occurence)
a row in a relational table
logical design
a stage in the design phase that matches the conceptual design to the requirements of the selected DBMS and is, therefore, software dependent. Logical design is used to translate the conceptual design into the internal model for a selected database management system, such as DB2, SQL Server, Oracle, IMS, Infromix, Access, or Ingress generally refers to the task of creating a conceptual data model that could be implemented in any DBMS
object
an abstract representation of a real-world entity that has a unique identity, embedded properties, and the ability to interact with other objects and itself can be considered the equivalent to an ER model's entity
relationship
an association between entities
network model
an early data model that represented data as a collection of record types in 1:M relationships. allows a record to have more than one parent
hierarchical model
an early database model whose basic concepts and characteristics formed the basis for subsequent database development. This model is based on an upside-down tree structure in which each record is called a segment. The top record is the root segment. Each segment has a1:M relationship to the segment directly below it
MapReduce
an open-source application programming interface (API) that provides fast data analytics services; one of the main Big Data technologies that allows organizations to process massive data stores by processing data among thousands of nodes parallel
many-to-many (M:N or *..*) relationship
association among two or more entities in which one occurrence of an entity is associated with many occurrences of the related entity and one occurrence of the related entity is associated with many occurrences of the entity
one-to-one (1:1 of 1..1) relationship
associations among two or more entities that areused by data models. in a 1:1 relationship, one entity instance is associated with only one instance of the related entity
one-to-many (1:M or 1..*) relationship
associsations among two or more entities that are used by data model. In a 1:M relationship, one entitiy instance is associated with many instances of the realted entity
the sucess of the O/R DBMS can be attributed to the model's
conceptual simplicity, data integrity, easy-to-use query language, high transaction performance, high availability, security, scalability, and expandability most relational DB products can be classified as object/relational ex OLTP and OLAP DB applications
object-oriented database management system (OODBMS)
data management software used to manage data in an object-oriented database model
the phyical model is
dependent on the DBMS, methods of assessing files, and types of hardware storage devices supported by the operating system
attributes in OODM models
describe the properties of an object
relational model
developed by E.F Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns)
internal model (software dependent, hardware independent)
in database modeling, a level of dat abstraction that adapts the conceptual model to a specific DBMS model for implementation. The internalmodel is the representation of a databse as "seen" by the DBMS. In other words, the internal model requires a designer to match the conceptual model's characterisitic and contraints to those of the selected implementation model
segment
in the hierarchical data model, the equivalent of a file system's record type
method
in the object-oriented data model, a named set of instructions to perform an action. Methods represent real-world actions equivalent of procedures in traditional programming languages
tuple
in the relational model, a table row
Hadoop
is a Java-based, open source, high-speed, fault-tolerant distributed storage and computational framework. It uses low-cost hardware to create clusters of thousands of computer nodes to store and process data is not a database or nor a data model \it is a distributed file storing and processing model
inheritance
is the ability of an object within the class hierarchy to inherit the attributes and methods of the classes above it in the object-oriented data model, the ability of an object to inherit the data structure and methods of the classes above it in the class hierarchy
OO DBMS is popular in
niche markets such as computer-aided drawing/computer-aided manufacturing (CAD/ CAM), geographic information systems (GIS), telecommunications, and multimedia, which require support for more complex objects
velocity
refers not only to the speed with which data grows but also to the need to process this data quickly in order to generate information and insight
volume
refers to the amounts of data stored
variety
refers to the fact that the data being collected comes in multiple different data formats
Chen notation
see entitiy relationship (er) model
external model
the application programmer's view of the data enviroment. given its business focus, an external model works with a data subset of the globl database schema
sematic data model
the first of a series of data models that models both data and their relationships in a single structure known as an object
data modeling
the first step in the database design, the process of creating a specific data model for a determined problem domain
American National Standards Institute (ANSI)
the group that accepted the DBTG recommendations and augmented database standards in 1975 through its SPARC committee Standards Planning and Requirement Committee (SPARC) defined a framework for data modeling based on the degree of data abstraction
data definition language (DDL)
the language that allows a database administrator to define the database structure, schema, and subschema
class hierarchy
the organization of classes in a hierarchical tree in which each parent is a superclass and each child class is a subclass
conceptual model
the output of the conceptual design process. The conceptual model provides a global view of an entire database and describes the main data objects, avoiding details
subschema
the portion of the database that interacts with application programs
connectivity
the relationship between entities. classification include 1:1, 1:M, and M:N
data manipulation language (DML)
the set of commands that allows an end-user to manipulate the data in the database, such as SELECT, INSERT, UPDATE, DELETE, COMMIT, and ROLLBACK defines the environment in which data can be managed and is used to work with the data in the database
class diagram notation which is part of the Unified Modeling Language (UML)
the set of symbols used in the creation of class diagrams
external schema
the specific representaion of an external view; the end user's view of the data enviroment
client node
used in HDFS. the client mode acts as the interface between the user application and the HDFS.
name node
used in the HDFM, the name node stores all the metadata about the file system
data node
used in the HDFS. the data nodes stores fixed-size data blocks (that could be replicated to other data nodes)
The process of identifying and documenting business rules is essential to database design for several reasons:
• It helps to standardize the company's view of data. • It can be a communication tool between users and designers. • It allows the designer to understand the nature, role, and scope of the data. • It allows the designer to understand business processes. • It allows the designer to develop appropriate relationship participation rules and constraints and to create an accurate data model.