Chapter 2: Databases - basic concepts
Data can be classified into two categories, namely:
1. Primary data - the fundamental information necessary to provide the database service, stored on permanent support, such as hard disks. 2. Derived data - information that can be inferred or calculated from primary data (and may be recalculated at any time).
Database structure and content
A database has a structure and content. The structure is represented in this example by the table headings; the content by the body of the table. The content changes in time - it is dynamic in nature. The structure can change, but it is far less changeable than the content. For instance, you could add a new column to this table - the type of trainer or the activity it might be associated with - but you would not expect to make such changes that often. The structure of the database is called its intension and the content is called its extension (we will return to this in more detail later in this chapter).
Hardware
As we have seen, the DBMS allows both the developer of a database and the database users to operate without knowing the details of the hardware being used. This does not remove from the system administrator the need to select hardware and operating systems that, firstly, are capable of running the chosen software; and secondly that can cope with the demands that will be placed upon it by the database and associated systems. The system administrator should be satisfied that: 1. There is enough permanent storage space, for instance disk space, to store the data and any indexes and cached derived data. 2. There is enough temporary storage space, for instance RAM, to hold intermediate results and computations. 3. There is enough computational power to manipulate the data at the rate that will be required. 4. There is fast enough communication between components of the system for moving the data between them. This is only usually an issue for particularly data-heavy applications or systems with a very large user base.
The components of a database system
At the highest, most general level, a database environment consists of: • data, representing the information needed for an organisation; • software, serving two purposes: the management of the stored data, and further processing of the data to the users' needs; • hardware, supporting both the stored data and the software components; • users, broadly divided into two categories: developers of the database system, and users of the system.
Disadvantages
Complexity. In the database approach the information needed by an organisation is modelled and implemented as a whole. Where the file-based approach can often be achieved piece by piece as individual departments develop a need and budget, the process of developing a database system is by its nature a single, unifying and more complex process, which will include: • data acquisition • data modelling and design • database implementation • database maintenance. The greater complexity of this process may mean that errors in implementation, design and data acquisition may occur, and be harder, within the organisation, to get fixed. Depending on the organisation and its data need, the database approach may require extra hardware and IT infrastructure, along with new maintenance contracts. Depending on the system being replaced, this can make a database approach more expensive, in terms of either initial or ongoing costs. The DBMS software itself may cost no money, since there are many free and open source options, but the system built around it will require developer time and may also incorporate other, paid-for software. In some cases, the integration of several systems may represent a reduction in costs, as separate contracts and IT structures are rationalised and unified. Higher impact of failure. The database system is at the core of the information system of an organisation. All data is stored centrally, in the database. As a result, most applications rely on this data. If the DBMS fails, the whole organisation is paralysed, unlike a decentralised system, where a failure in one system will only directly affect the department that uses it. Performance. DBMS software is heavily optimised for its core functionality, but it is still a generic piece of software. A database application may be slower for an individual user than a bespoke, perhaps local, file-based solution.
Data administrator
Data administrator. The data administrator (DA) is a user who properly understands the data requirements of the organisation and is in charge of administering the organisation's data. This user:• decides which data is relevant and which is not;• is in charge of applying the organisation's policy and standards;• decides on the security policy, and so on.The DA does not need to be a technical expert or a manager. Rather, the DA is somewhere in between, liaising with the management on one hand, and with the technical team, on the other.
Data definitions
Data definition. The DBMS must provide support for defining or modifying the database schema. Schema definition includes specifying data types, structures, constraints and security restrictions. This is achieved by means of a data definition language (DDL). The statements made in DDL for a specific database system represent the system's catalogue (or data dictionary7). In theory, there should be a DDL at each level of abstraction (i.e. external, conceptual and internal), but in practice there usually exists a single DDL that allows definitions at any level.
Data manipulation
Data manipulation. The DBMS must provide support for data manipulation. In particular it has to support: • retrieval of existing data, contained in the database • deletion of old data from the database • insertion of new data • modification of out-of-date data.
Schemas and mappings
Database schema. The description of the database is called the database schema or the database intension. This is specified at the creation of the database. It is not expected to change very often. Database instance. The raw data that populates a database at a particular point in time is called a database instance or the extension of the database.
Databases and database management systems
Database. 'A database is a (shared) collection of logically-related persistent data (including its description) as part of the information system of an organisation.'
File-based systems
Definition: A file-based system is a collection of application programs, each managing its own data.
Separation and isolation
Separation and isolation. Data is scattered among different files, each file belonging to a certain department. A department has access to its own files, but no access to the files of the other departments. Files belonging to different departments cannot be used together in order to create more complex data or analysis. Often, because they are based on different infrastructures (platforms, development software, etc.) files belonging to different departments cannot be transferred (copied) across.
Database approach
So far, a database system is, for us, nothing more than a system that manages data. But is any system that manages data a database system? Is there anything that all database systems have in common, that distinguishes them from other software systems? The answer is obviously yes. In order to understand the 'database approach', we shall first have a brief look at file-based systems. In appearance (behaviour) they are similar to database systems, but they are conceptually (qualitatively) different. We shall identify the drawbacks of the file-based approach to data management and then introduce the database approach as a solution to most of these drawbacks.
These aims were formalised in the early 1970s and codified and adopted as a standard in 1975 as the ANSI/SPARC three-level architecture. The architecture forms a basis for most modern DBMS.
The ANSI/SPARC architecture consists of three levels of abstraction (see Figure 2.6). The external level represents the way data is viewed by individual users. The conceptual level represents the way the organisational data (i.e. all data that is relevant for the organisation) is structured. The internal level represents the way data is physically stored, although the very lowest-level aspects of that are likely to be handled by the operating system itself.
The features of the DBMS will be considered in detail over the course of this chapter. Briefly, the DBMS provides support for schema definition, data manipulation, data security and data integrity.
The application programs can be of two kinds: 1. user developed; 2. provided together with the DBMS by its developer. The former class of applications will generally be written in a high-level programming language, such as C, Java or Python. Support for database access in such languages is provided by means of a data sub-language, embedded within the host language. Statements written in the embedded sub-language are processed and passed on to the DBMS using the appropriate routines.
Physical level concept for data
The concept of a physical level for data is one to which we will return later. The structure we describe is not purely physical, but we use the term to indicate that it is to some extent platform dependent, because access to files is made through the primitives (built-in functionality) of the operating system.
Conceptual level
The conceptual level. This represents the logical structure of the database (of all the data required by the organisation). It can be seen as the union of the whole set of views at the external level. Conversely, any view should be derivable from the conceptual level. The conceptual level represents the information stored in the database about the real life system's entities (objects) and the relationships between them. The representation of data at this level is still independent from any physical considerations - it specifies what is stored, rather than how it is stored
DBMSs and database languages
The database management system is the software through which all access to the database is made. This is a concise but limited definition. In reality the DBMS is responsible for much more. Some of its important features are presented below.
External level
The external level. This incorporates each user's external view of the database. A user's view consists only of the data needed by that user. Other data may exist in the database, but the user does not need to be aware of it.
What is a database?
The first - and most obvious - question to ask when you take up this subject is the simplest - 'What is a database?' Certainly, you will have dealt with them, indirectly, almost daily. Whether you are in a shop in person or whether you are exploring its catalogue on the internet, when you check whether a product is in stock, it is likely that a database will be used somewhere within the system. Amazon and Facebook, YouTube and iTunes all use databases to deliver products and services to their users. The database and its structure may be quite obvious to the user for a library catalogue or an online retailer, but it may also be serving a less direct purpose, allowing the company to keep track of its employees and suppliers, or helping an advertiser track visitors to web pages across different sites, tailoring their adverts to match a browser's activity.
Hardware
The hardware and the OS are often grouped together and called the platform. There is considerable variation between platforms, which is one reason for having the DBMS software handle this variation and present a more abstracted interface to higher-level components. This provides a platform independence that shields the application programs from unnecessary physical details, and means that we need not concern ourselves with details of hardware or OS for the remainder of these subject guides. Instead, we focus on the features provided for the application programs by the DBMS.
Internal level
The internal level. This describes the physical representation of data. The internal level specifies how data is stored. It is at this level where the physical data structure and file organisation are defined. The internal level is situated at the interface between the DBMS and the Operating System (OS). It is quite common that the internal level of the DBMS uses the file management primitives of the OS. However, there is no clearly defined boundary between the OS and the internal level. Because of this, there often exists another level below the internal level, namely the actual physical support (cylinders, blocks, clusters, etc.) (this extra layer is shown in Figure 2.6).
The other mapping - between the Switchboard's view and the conceptual schema - is self-explanatory.
The internal schema consists of the data structures that are used to represent (implement) the conceptual schema. For the above example, this is struct STAFF, in a C-like hypothetical language. It also can include other structures (i.e. not derived from the logical level), used for pragmatic reasons (e.g. efficiency). In the above example, an index was defined, INDEXES, in order to make the retrieval operations (from Table_Employees) faster. The conceptual/internal mapping links the definition of data at the conceptual level with the way it is actually represented - it links what data is represented with how it is represented. The table Employees is implemented as an array of records of type struct STAFF. Note that the index, Index_Employees, is used purely at the internal level (i.e. it is not mapped to the conceptual level). This is because the index does not describe the data as such - it is a way of making access to the data faster or more efficient.
Software
The software component can be seen as consisting of three layers (Figure 2.11): • the operating system (OS), positioned at the base, provides the necessary routines for accessing the hardware resources (such as file handling or memory management routines); • the database management system (DBMS), placed above the OS - and using the routines that the OS makes available - provides all the necessary primitives for data management, including languages for defining schemas, manipulating and reading data and so on; • application programs, above the DBMS - and using the routines made available by the DBMS - provide data formats and computations beyond the capabilities of the DBMS
Snapshot od database
explored in this guide. You might choose to store intermediate results that are called for often (using snapshots) or to ensure the atomicity of a set of operations (transactions). In these cases, the redundant data is intentionally included to achieve something extra, and even so requires special treatment.
Raw data and dbms
the raw data is integrated in a common database for all applications. The data is managed by a database management system (DBMS), which provides shared access to it, for all the applications in the system.
The database schema consists of three types of schemas, one for each level of abstraction
• External schemas describe the external level. There is one schema for each view. • The conceptual schema (one only) describes the conceptual level. All the definitions should only take the logical structure of the data into consideration. Implementation aspects, and user views, should be disregarded. • The internal schema describes the internal level. It defines the physical records, methods of representation, index implementation, etc. As a result of this separation of concerns, mappings are needed to allow navigation between the schemas. Since there are three types of schema, there are two types of mapping: • External/conceptual defines the correspondence between an external schema and the conceptual schema. • Conceptual/internal defines the correspondence between the conceptual and the internal level.
Application programmer
Application programmer. The application programmer writes programs that perform more complex processing of data (either computations or formatting). For this, they use either a third-generation language, embedded with a database language, or a fourth-generation tool. The resulting programs are for use by end users.
Database management system
A database management system automatically performs a lot of the housekeeping tasks that would otherwise be the responsibility of the application programmer. As a result, the user - i.e. the person who defines and uses the database - is presented with a clean and powerful set of tools for database development and exploitation. A more detailed description of both database systems and database management systems is provided in the following sections. At this stage, it is important that you broadly understand why database systems are needed and what their main benefits are. This definition, with its distinction between file-based and database approaches is quite high-level and functional. From the discussion above, many organisations that use database software would still be defined as having a file-based system, if different departments use different database implementations for storing similar data.
Database management approach
A database management system is a software system that provides a set of primitives (built-in functionality) for defining, accessing and maintaining a database. A database stores both the raw data and its description. We say that the information3 stored in a database is self-describing. The description of the raw data is known as the system dictionary, data dictionary or metadata. The consequence of this approach is program-data independence. This means that the structure of data may change without affecting the application programs that use it. This basic definition is going to be refined and better explained later in this chapter. This approach - separating the data definition from application programs - is similar to data abstraction in programming, where the internal definition of an object is kept separate from its external definition. An outside system can only see the exterior of the object. As far as the external definition remains unchanged, any changes in the object's internal definition do not affect the outside system.
Database definition
A database system is a system that stores data. To qualify as a database system, there are some features that it would have to offer: • find (retrieve) data • add (insert) new data • delete unwanted data • change (update) data.
Dervied data
Derived data may be the output of the application programs - the result of processing the primary data - in a form suitable for the users' needs, but it can also be the input from users that will then be processed by the application to be stored as primary data. The focus of a database system is on primary data. This has to be appropriately identified, described and implemented. The primary data has three important characteristics. It is: • integrated, rather than existing in separate systems - it has been gathered together into a single system5 • shared, with all the applications belonging to the information system having common access to (at least parts of) it • extensive, in that database systems are usually developed for data intensive applications, where their benefits are more clearly felt. Stored data, as we have already seen, does not include only the raw data, but also its description - the metadata, system dictionary or catalogue.
Drawbacks of file based duplication
Different applications might have to make use of the same information. Because each application has its own files, data is duplicated (e.g. the 'identification' information in our example). This aspect has at least two negative consequences. Firstly, duplication is wasteful.1 Secondly, data can become inconsistent - it can have different values in different files (belonging to different applications), even though it is supposed to give the same piece of information. For example, the address of an owner, Mr. J. Morris, might be updated in the Owner file belonging to the Sales Department, while the Contracts Department might still have Mr. Morris's old address.
End users
End user. The end users are the 'beneficiaries' of the database system. They may range from technically naïve to extremely sophisticated. A technically naïve user, for example a bank employee, may interact with the system using application programs developed for specific tasks. A naïve user does not have to be aware of the functionality of the DBMS. All they need is reliable and easy to use programs that they can use with minimal fuss. A sophisticated user, on the other hand, will know how to access the database directly, through the database language supported by the DBMS. Sometimes a sophisticated user might even develop applications, and so become an application programmer.
Conceptual and internal mapping
For instance, in the example above, the Employees table can be implemented using a linked list. If the conceptual/internal mapping is modified appropriately, and as long as the conceptual schema stays the same, the application programs (situated above) remain unaffected. Logical data independence may be more difficult to achieve since application programs typically rely heavily on the logical structure of data. However, suppose another view is needed, for the Personnel department, which requires information about the address and the family status - such as marital status, dependants and next of kin - for each employee. The conceptual schema can be extended as necessary, without affecting the other two views.
Example
For instance, suppose that the database of a software company includes information about its employees. The Personnel Department's view of the employees - the data that is relevant to them - might consist of: name, address, sex, date of birth, qualifications, department for which the employee works, current salary, job contract details, and details about previous jobs. The Personnel Department needs to be able to access this data about any employee in the company. On the other hand, the Development Department's view of the employees might consist of: departmental ID, name, telephone number, timetable, the projects in which the employee is involved including the employee's role in each project, the objectives and their deadlines. Only the data about its own employees is relevant for the Development Department.The whole information about the company's employees is stored in the database. However, the two departments need access to different projections of data. The data relevant to each department represents the department's external view of the database
File based system characteristics
In a file-based system, permanent data is stored in various files of ad-hoc structures. Each application program defines and handles its own data files independently of the others. This approach is called the de-centralised approach. Each application program works with its data at the physical level, manipulating records as they are organised in persistent memory. Sharing of data between applications is likely to be limited.
Example of database
Let us now explain this definition. A database is a large repository of data, in which data is defined once and stored once. Data that was scattered in different files - with different formats and owners - in the file-based approach, is now integrated with minimum redundancy (duplication), as a single resource. Different application programs will share this common resource, usually concurrently (at the same time).
Data models
Physical data independence is one of the main advantages of database systems. This is achieved based on the conceptual level. Users work with data - both defining and manipulating it at the conceptual level, while the DBMS takes care of the physical details. We have also mentioned that at the conceptual level, data is described purely in terms of its intrinsic characteristics - its logical structure. The definition and manipulation of data happens at the conceptual level by reference to a modelling theory. The theory we will primarily be referring to in this course is called relational theory or the relational model. This remains the most common data modelling theory for database systems, although there is increasing competition from other models. The relational theory consists of: • concepts, relational data objects by means of which data is modelled • operators, which support the manipulation of the objects in the model • rules, specifying how the concepts and operators are allowed to be combined. Relational theory provides us with the components that we need to model information and the relationships between parts of the information. It allows us to define the types of information - such as numbers and text - that will be stored and to define constraints on it - for instance to indicate that a date in a particular context must be a past rather than a future date. These are all concepts that will be considered in more detail in the next chapter. Once a suitable data model has been defined, it must be implemented before it can be used. A DBMS that implements relational theory to the extent that it supports the implementation of models defined using relational theory is called a relational DBMS. The result of implementing an abstract data model using a DBMS is a database system (see Figure 2.14). In practice, DBMSs do not fully implement a formal theory, and a restricted subset only will be available. This means that some data models need to be adjusted before they can be implemented as database systems.
As a final point, the issue of data independence can now be reconsidered. One of the main advantages provided by the three-level architecture of a database system is the provision of data independence. There are two types of data independence
Physical data independence is the immunity of application programs to changes at the internal/physical level. Logical data independence is the immunity of application programs to changes at the conceptual level.
Program data dependence
Program-data dependence. Each file belongs to a certain application program. The (physical) structure of data is defined inside the application program. This could easily - and usually does - lead to incompatible file formats between applications, meaning that it becomes impossible to share data between them. Another aspect is that data definition is embedded in the application program. That means that if the physical structure of data is to be changed - for instance, if instead of representing a year with two digits, it is to be represented with four2 - then the application program itself must be changed. Not only that, but the methods of access and manipulation of data are also embedded in the application program (for instance, in previously-defined queries); to change them, the application program must be modified.
The three-level ANSI/SPARC architecture of a database environment
Program-data independence is one of the most important advantages offered by the database approach. This independence can be achieved if the system is abstracted into two or more levels. A low-level abstraction deals with how data is organised on the physical support.4 Meanwhile, a high-level abstraction describes the logical structure of data, irrespective of its physical representation. This separation allows the separation of the design of a database from the details of its implementation.
Dbms
Programs provided by the DBMS developer allow the rapid development of user applications, without the user writing any conventional code. Programming tools abstract away or remove so much functionality in order to allow often application-specific software to be constructed quickly; these are known generically as fourth-generation tools. Home or small business database systems - such as Microsoft Access or OpenOffice Base - provide graphical fourth-generation tools for this purpose. The DBMS can also be referred to as server or backend (server), whereas the application programs are referred to as clients, or front-ends. Clients use the services provided by a server for data management. The division between client and server makes it possible for the server and client to run on different machines, giving rise to the idea of distributed processing, an issue discussed in the 'Database architectures' section and elsewhere in these subject guides.
Advantages and disadvantages of database systems
Reduced redundancy. In a file-based system each application has its own private files. This often leads to data being duplicated in different files, wasting storage space. In a database approach, all data is integrated, reducing or removing unwanted redundancy. There are various reasons why eliminating redundancy completely is often not possible or desirable in a DBMS - and we shall return to these in later chapters. However, where the file-based system forces redundancy in an ad-hoc way, a DBMS should provide mechanisms for specifying redundant data and for controlling it (to maintain the consistency of the database). Avoiding inconsistency. This is largely as a result of the reduced redundancy. A database is in an inconsistent state if the same item of information is stored in at least two places in the database, but with different values. The database approach dramatically reduces that sort of repetition, making the risk of inconsistent data smaller. Even where redundant information is stored, the repetition can be made known to the DBMS, so that the system automatically enforces consistency, so that whenever some changes are made to one set of data, the same changes are propagated to the same version that is duplicated elsewhere. The support provided by most current DBMSs for preventing inconsistencies is limited to a relatively small number of categories, but the mechanism is present. Improved data sharing. Since all data is centralised, the restrictions on which applications and users can see it are ones of security constraints rather than those of system and network architecture. In contrast to having a set of separate file-based systems, here all data is integrated, meaning that more information can be derived from the same amount of data. Both aspects considerably improve the accessibility of data. Data independence. As we have seen in earlier sections, a database approach provides protection for applications from changes in both the physical and - at least to some extent - the logical structure of the data (physical and logical data independence).
Function on dml
This is achieved by means of a data manipulation language (DML). There can be a DML at each level of abstraction. At the external and conceptual level, the DML is concise, comprehensive and easy to use; in other words, the emphasis is on its expressive power - on these levels, efficiency is a secondary goal. On the other hand, at the internal level, the emphasis is placed on the DML's efficiency. This means that its statements are complex - and probably not that straightforwardly expressible - but quite efficient. These languages (DDLs and DMLs) are called data sub-languages because they do not include constructs for the control of flow - they are computationally incomplete (meaning they cannot be used as general purpose programming languages). Users can use them directly in order to define and access the database. However, for applications that require more complex data processing (and formatting) they are usually embedded into a full high-level programming language.
Users
Users, as a component of a database environment, can be classified in four categories, according to the role they play.6
Architectures of database systems
We have seen that when the data of an organisation is integrated in a single database, it can be shared between many applications. Accordingly, a 'natural' organisation of a database system is the client-server architecture (Figure 2.12). The DBMS is the server and the application programs are clients. A server can also be referred to as back-end and a client as front-end.In the client-server architecture, the DBMS (including the database) runs on a dedicated machine - the server machine. The server machine is tailored to support the DBMS, both in terms of storage space and computational power. In high-demand situations, it has to provide: • extensive and fast external (persistent) memory • powerful processing capabilities (fast processors), combined with sufficient internal memory. The main requirement for the server machine is to provide the resources that the DBMS needs to respond efficiently to the requests received from the clients (i.e. to provide what they need at an appropriate speed). Although we speak of the server as a separate machine, it is becoming increasingly common for virtual servers to be used by organisations on a subscription basis. These run on remote servers and may offer advantages, such as easier upgrades and less in-house maintenance, and disadvantages, including questions about data security and privacy. The applications would normally run on different client machines, with each client machine specified to meet the needs of its application or applications. For instance, if an application program performs complex graphical processing, then a more powerful graphical workstation might be required, whereas if an application only performs simple data entry, then a cheaper, less powerful machine might be enough. If these needs change, it is only the client machine that has to be 'modified', making thus the client-server architecture quite flexible.The communication between an application and the DBMS is accomplished through the link between the client machine and the server machine, via a communication network. Distributed database systems can be developed using various different architectures (a chapter in Volume 2 of this subject guide considers them in more detail). Distributed systems may follow the client-server architecture. Another possible way of organising this is to have the database itself distributed on several machines (Figure 2.13).
Users and spplication
• Users and application programs should be freed from considering the aspects of the system related to the physical representation of data, such as storage and accessing details. Instead, they should be able to take into consideration only the logical structure of data. Rather than having to deal with such aspects in each application program it would be much better if these problems were to become the responsibility of the system (DBMS) that manages data. • It should be possible to change the physical representation of data without affecting users, as long as its logical structure is preserved. • As we have seen, the database integrates all the information required within an organisation. Individual users will often only need (or be allowed) access to certain parts of this 'pool of information'. Each user, then, needs to have a customised view of the database and it should be possible to change that view without affecting other users