Chapter 4: Advanced database systems
Dbms
Since the DBMS and the programming language provide different data types, this mismatch must be resolved by translation code. Additional type checking must be carried out, since the database itself can no longer enforce all rules regarding data types. This may also mean that integrity constraints must be enforced in the embedding language. Restrictions may also be necessary to prevent database access through other applications, since they might occur without appropriate type checking. Management of checkpoints and transactions may have to be moved to the programming language. • Turn an OO programming language into an OODBMS. Use the object system of an existing programming language, but make variables persistent (saved to permanent storage), permit concurrent access and so on. • Create an OO DBMS from scratch. All of these approaches, and others, have been put into practice from the 1980s onward.
Other models for data handling systems
Some alternative database models originate in an academic environment, where the concern may be either the search for alternative theoretical groundings for database systems or better alignment with other programming paradigms. Other models may originate in industry, and are often more closely associated with a product or the particular type of data that they are designed to store. Performance, behaviour and robustness will often change as the product becomes better understood, and as the range of users becomes greater. However, there are often connections between the ideas of these two communities - and each will usually know the work of the other. For the quick survey that follows, we start with a brief introduction to two paradigms that were particularly favoured by academic communities in the 1990s and then, moving into more current, usually industry-led, approaches. We shall see, in the process, how these newer models relate to the older ones.
Problems with scaling
Tables in RDBMS implementations are usually indexed in ways that are very fast, no matter how large the dataset is. Despite this, there are other aspects of the RDBMS approach that do not perform so well at scale. Often these aspects are those concerned with ensuring that the database remains correct and consistent. In some cases, alternative strategies can be devised that are optimised for specific requirements sets. For example, applications that are either less concerned about the continuous correctness of the data involved or where data will be updated only very rarely can be treated in a different way. In some cases, the reason for using a different paradigm is that, given sufficiently good programmers, and enough of them, a specialised solution will perform better than a general-purpose database system. This is especially likely to be the case for industrial users with very large amounts of data and tight performance requirements. In other cases, if no current RDBMS system supports a particular set of requirements for data processing or for distribution, then it may be considered easier for a company to either use a different approach where it has been implemented or to hire programmers to write a library in their programming language of choice. Often tools evolve from relatively local, specialised projects to larger, more data-intensive ones, in which case, transferring the functionality from an existing framework to one involving a DBMS may be considered too costly.
Big Data and MapReduce
Big Data is a much-used term, but has no precise, generally-agreed meaning. Usually, Big Data implies some or all of: • data that is too large to hold in the volatile memory of a single machine • data that necessitates distributed storage • data that will be processed as a whole, rather than by sampling a subset. We will generally focus on data generated by web-based companies in the discussion below, but Big Data is also important in scientific research, with significant contributions from fields such as astronomy, particle physics and molecular biology. Big Data must be stored somewhere, and the first component of any system storing large amounts of data is the file storage system. We shall not consider how these work, but note that, while Google developed their own proprietary system (GFS) for use in-house, Yahoo! adopted and extended an existing opensource file system called HDFS, the Hadoop Distributed File System. HDFS is a distributed file system that uses TCP/IP for communication between nodes. It is optimised for data that is written once, or at least relatively rarely, but read very often. Data is broken down into 'chunks' which will each be replicated several times across the network. HDFS is extremely fast for its main usage patterns, but this speed is partly achieved by not implementing some features that a desktop file system would implement as standard.
SQL does not support...
Commonly, specific types of application are raised as presenting situations that are impossible or difficult within the RDBMS paradigm. In many cases, these are true, and we will consider some in more detail separately in this chapter; however, some cases are actually observations about a lack of tools rather than a failure of the model. For example, Connolly and Begg describe geographical information systems (GIS) as one such application. They argue that a query for all shops within a two-mile radius of a given location is problematic for an RDBMS. Position and navigational distance are functions of multi-dimensional information. Location, for example, must be specified in at least two dimensions and, for real accuracy, those dimensions must be mapped onto a globe, since they describe places on an approximately spherical world. The dimensions taken separately will not give sufficient information to solve the distance problem helpfully - a clause along the lines of WHERE Latitude<y+5 AND Latitude>y-5 AND Longitude<x+5 AND Longitude>x-5 will not give a particularly useful answer. What this means, is that for a database application to support GIS, it must have spatial data types, so that there is no need for the queries to handle the complexities of the domain. For a given DBMS implementation, the question is whether they support or can be made to support such types; for example, through domains and user-defined functions. PostgreSQL, for example, is extensible in such a way that GIS extensions can and have been made. Since the problem was not a fundamental one about the model itself, they can be remedied by choice of DBMS, provided that the tools or extensions required have already been developed.
No ad hoc structures
Databases are based on tables that can be created at any time, with any combination of columns, but if every data object being stored is unique in structure and hard to predict in advance, then there would be a risk of an overly complicated model evolving, with very many tables, often containing only one row or with most columns being NULL, and querying would be proportionately complicatedIn more general terms, the relational model is based around highly-structured information with a good degree of consistency and overlap in the documents being stored. Web documents in HTML are good examples of information that has some structure and organisation, but little that is guaranteed, nor a very clear pre-defined form. Many other documents have such issues. Such documents can be stored in relational databases, but either there is a requirement for a complex and evolving data model, or one may simply place documents into a field and process them using external routines, which reduces their usefulness.
Object Relational Database Management Systems (ORDBMS
Functions can be defined to operate on particular domains in a way that resembles method definition. This sometimes gives rise to some relational database management systems being referred to as Object Relational Database Management Systems (ORDBMS), but this seems unnecessary, since the technologies required for this level of object-oriented processing are already part of the relational model. Nonetheless, the alignment between the relational model and SQL on the one hand and OO as it is most commonly encountered in programming languages is not a perfect one, and some would prefer a closer match.
Deductive database systems
In the relational approach, the database consists of a set of extensionallydefined base relations. All tuples for each base relation are explicitly provided. We have seen one of the effects of this in the family tree example above. The notion of an ancestor (as opposed to a direct parent) was not in a tuple, and was thus not available to the system for querying. One of the solutions for addressing that shortcoming was to make an explicit base relation for representing the concept of ancestor, even if that duplicated information. Now consider a system where we could define what we mean by ancestor. It might look something like this: ancestor(c, a) ← parent(c, a) ancestor(c, a) ← parent(c, a') AND ancestor(a', a) What these lines say is that a is an ancestor of c if either a is the parent of c or if a is the ancestor of the parent of c. In the second of these statements, ancestor(a', a) will expand out to find either a parent or a more remote ancestor. In such a system, a query would be a case of asking the database to confirm whether two individuals were related according to predefined logic definitions of relatedness. These systems are called deductive, because they deduce the truth of statements by reasoning using logical definitions applied to 'axioms' (the equivalent of base relations), rather than needing to be provided with all the explicit information they need and the steps required to turn that information into an answer. The system of logic used here is called first-order logic, which is only one class of formal logic. We will encounter another class, description logic, later in this chapter.
No sql concept
NoSQL databases can be defined as systems for which the logical modelling of real-world systems is not carried out in terms of the relational or E/R models. The name is unhelpful - it is not the SQL language that is rejected in most cases, but the relational model. Some NoSQL systems are implemented using SQL databases at the physical level. Although this is irrelevant for modelling, it has led some commentators to describe NoSQL as 'not only SQL'. This expansion of NoSQL also makes the occasional use of an SQL-like language, for querying NoSQL databases, look like less of a contradiction. NoSQL approaches to information management and retrieval are often prompted by frustrations with some aspect of relational database systems. There are practical and theoretical shortcomings in each of the three elements - the relational model, SQL and RDBMS implementations - and there are certainly cases where the tools are not the most appropriate ones for a specific family of tasks.
Object-oriented database systems
Object orientation (OO) is an approach to software development which brings aspects of good programming practice to the fore and allows programmers to make explicit particular models of the data being handled. Advantages of an OO approach include the following. • The use of modelling concepts that have closer correspondence with real-life systems. Objects have attributes that refer to the attributes of the real-life 'thing' being modelled. • Class-based object orientation, which is the most common form, groups and generalises those objects in a way that provides rules for their behaviour in a way that can also match how real-life systems are understood. • Objects provide a strong mechanism for modular programming. • Libraries of objects provide an intuitive mechanism for reusable components. While all of these advantages are aspects of programming that can be available without an OO approach, they are easier and are, to some extent, built in when object systems are used.There are several ways to bring the OO approach to database systems. • Implement classes as domains, and user-defined functions as methods on those classes. This is available in most current DBMS, although how powerful this is depends on the implementation. • Embed SQL into an OO programming language. This requires more integration than the usual database libraries, since to take advantage of the OO aspects of the language, the library must perform extra operations usually reserved for the database system:
Introduction to alternative implementations
Over the course of this subject guide, three distinct concepts have been introduced - the relational model, SQL and RDBMS implementations. The relational model is a theoretical model for organising data that has provable properties and a formal algebra. The model provides the foundations on which SQL was constructed. SQL itself is a standard that describes the way that a user or administrator should be able to interact with a conformant database, including some rules for how information should be processed. Finally, RDBMS implementations are products that implement the SQL standard, to a greater or lesser extent. This carries with it some relationship to the principles of relational theory, although there are differences between pure relational theory and RDBMS implications.
Updating operations
The equivalent to the INSERT operator in SQL is assert, and works in a similar way: assert(person('Padmé Amidala', 1/3/29, f)) assert(parent('Padmé Amidala', 'Leia Organa')) assert(parent('Padmé Amidala', 'Luke Skywalker')) To delete an axiom from the database, the retract operator is used: retract(parent('Luke Skywalker', 'Kylo Ren')) We have not touched on integrity constraints here, but the axiomatic approach continues here as well - statements about how data objects relate to one another can be made as general rules and the database can then police the applicability of those rules as data changes.
Objects orientated databse
The result of this simplicity of model is that many applications that need to apply reasoning that depends on the type of object being stored will perform those operations outside of the database, using it primarily as a data store. This tends to mean that more data is transferred than should be and less data management happens within the DBMS than might be desirable. Features of the object-oriented approach are available in most RDBMS through domains, although how much this is the case depends on the implementation. Since domains can be defined with arbitrary complexity, they can easily stand in for classes, given a sufficiently powerful implementation.
Key-value databases
The simplest form of data storage is probably the closest to that of a file system. Since it is simple, it can be fast and its behaviour and consistency easier to predict. At its most basic form, a key-value database is just a persistent associative array - the database simply stores a data object or a list of data objects, associated with a key by which that object or list can be retrieved. If the object is a file path or a URI, it can appear several times in the database, each time associated with a different key. This works well for storing the results of indexing a set of documents, and underpins document search technologies such as Apache Solr. Each document to be indexed has keywords (or the roots of keywords) extracted, and each keyword is then added as a key to the database, with the document path added to the values list. Performing a textual search on the database is then a simple case of looking up all entries with the combination of keywords that have been entered into the search field. Retrieving the set of documents for each keyword is likely to be a constant-time operation, and then the set operation to find the documents that contain all keywords is carried out on a much smaller number of documents. In the coming sections, we shall explore some of the technologies and models that have been built around key-value concepts.
Querying
There are two types of queries in logic-based systems - those that retrieve Boolean True/False statements about the query, and those that return values or sets of values. The first type of query might be used to answer questions such as: • Is Anakin Skywalker Luke's father? ← father('Anakin Skywalker', 'Luke Skywalker') • Is Anakin Skywalker an ancestor of Kylo Ren? ← ancestor('Anakin Skywalker', 'Kylo Ren') Note that in neither case was a tuple provided for the predicate being queried - we have stated only parent relationships and the sex of the people involved. The facts that they are a father in the first case and an ancestor in the other has been deduced. For the second type of queries, where we only wish to see a particular value, the query is like those above, but substituting a variable for the information we want to see: • When was Leia Organa born? ← person('Leia Organa', DoB, _) • Who was Leia Organa's brother? ← brother(Bro, 'Leia Organa') Where we require a set of results, deductive databases provide a forall predicate that finds all values that satisfy the formula: • Who are the parents of Kylo Ren? ← forall(parent(Parents, 'Kylo Ren')) • What is the name and date of birth of all parents of at least one male child? ← forall(person(Parent,DoB,_) AND parent(Parent, Son) AND person(Son,_,m))
Difficulty with recursive queries
This is a fairly easy task in most programming languages - a tree is an easy structure to traverse, and iteration or recursion are usually provided as standard. However, none of the basic relational operators will help. Since trees and graphs are very common structures in real-life applications, it is a serious shortcoming to be incapable of handling them effectively. If the application must retrieve the entire table in order to solve the problem, or must send queries to the database repeatedly, to retrieve the tree one node at a time, then the DBMS has failed to justify its use.
Data definition
To continue with the family tree example above. Information can be added to the database by adding ground axioms. Let us say that we add people to the database by specifying name, date of birth and sex: person('Leia Organa', 21/10/56, f) person('Luke Skywalker', 21/10/56, m) person('Han Solo', 13/7/42, m) person('Anakin Skywalker', 14/7/34, m) person('Kylo Ren', 3/10/82, m) ... parent('Anakin Skywalker', 'Leia Organa') parent('Anakin Skywalker', 'Luke Skywalker') parent('Leia Organa', 'Kylo Ren') parent('Han Solo', 'Kylo Ren') We can also add some more deductive axioms about the concepts underlying the system: father(X, Y) ← parent(X,Y) AND person(X, _, m) mother(X, Y) ← parent(X, Y) AND person(X, _, f) sibling(X, Y) ← parent(Z, X) AND parent(Z, Y) AND NOT equal(X, Y)
Summary of oop databases
To summarise, the OO DBMS approach can offer advantages such as: • richer modelling capabilities • easier extensibility • closer integration into existing programming paradigms • removal of the mismatch between data structures in code and database • expressive query language. Disadvantages depend on the implementation, but may include: • limited standardisation • tension between database management tasks such as query optimisation and OO goals such as encapsulation • locking and transaction control may be more complex.
No sql
While it was never the case that all database systems were relational or SQL systems, either in theory or practice, the dominance of the SQL standard has meant that the DBMS and RDBMS have often been treated as interchangeable terms; while courses and books on database systems could focus exclusively on the relational model and SQL. As a result of this, the increase in the diversity of models that arose in the late 1990s and early 2000s were often defined as a reaction against SQL. These models can be very different from one another, but are often grouped together under the category NoSQL.
Mismatches between application programs and DBMS
database will usually be encountered via at least one piece of software written in some other language. Often there will be a chain of such software, culminating in either a web page or a standalone application. Many programming languages interact well with each other because they shareprogramming paradigms, data structures and interface elements. There are fewer commonalities between such languages and the relational model implemented in SQL-based databases. This usually means extra program layers to translate between the concepts and operations of the database and those of the application itself. These are very practical concerns. Database systems are very different to most other programming environments and large organisations will usually hire database developers and programmers separately. On the one hand, a company will need to consider whether to hire a separate database developer or whether they will do better using an option that integrates better with their other programming skills. On the other hand, a risk of developing a database application that looks like any other application and is written by a programmer without database expertise is that it may perform poorly, lose data or pose a security risk. In other words, the relational database paradigm is different partly because the problems it is designed to tackle are often different.
Recursive queries
fact, SQL has had the ability to perform recursive queries since 1999, and most DBMS implementations (excluding MySQL, at the time of writing) have implemented these. The syntax, called common table expressions, allows a recursive query using the keyword WITH (or WITH RECURSIVE in PostgreSQL). Whether systems are well optimised for this and whether these sorts of queries feel idiomatic within SQL is arguably a matter for investigating at the point where a decision is being made about the system to use.
Relational theory and SQL predate
the rise of object-oriented (OO) programming. In an OO paradigm, the nature of a data object - its behaviour and its interactions with other objects - is available to the system either as part of its declared class or as attributes of the object itself. In SQL, on the other hand, all data objects are rows in tables, each value of which is a simple scalar.