database management

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Understand reasons for need of data warehousing, time-variant data in data warehouse

- A business requires an integrated, company-wide view of high-quality information. + The information systems department must separate informational from operational systems to improve performance dramatically in managing company data. - Data in the data warehouse contain a time dimension so that they may be used to study trends and changes.

Describe three-tier architectures

- A client/server configuration that includes three layers: a client layer and two server layers. Although the nature of the server layers differs, a common configuration contains an application server and a database server.

Procedure

- A collection of procedural and SQL statements that are assigned a unique name within the schema and stored in the database.

Data mart

- A data warehouse that is limited in scope, whose data are obtained by selecting and summarizing data from a data warehouse or from separate extract, transform, and load processes from source data systems

Default Value

- A default value is the value a field will assume unless a user enters an explicit value for an instance of that field. Assigning a default value to a field can reduce data entry time because entry of a value can be skipped. It can also help to reduce data entry errors for the most common value.

Null Value

- A null value is an empty value. Each primary key must have an integrity control that prohibits a null value. Any other required field may also have a null value control placed on it if that is the policy of the organization. Many fields legitimately may have a null value, so this control should be used only when truly required by business rules.

Physical data model

- A physical data model defines all of the logical database components and services that are required to build a database or can be the layout of an existing database. - A physical data model consists of the table's structure, column names and values, foreign and primary keys and the relationships among the tables.

Star Schema

- A simple database design in which dimensional data are separated from fact or event data. A dimensional model is another name for a star schema.

Data warehouse

- A subject-oriented, integrated, time-variant, non-updateable collection of data used in support of management decision-making processes. An integrated decision support database whose content is derived from the various operational databases.

What is file organization?

- A technique for physically arranging the records of a file on secondary storage devices.

Thin client

- An application where the client (PC) accessing the application primarily provides the user interfaces and some application processing, usually with no or limited local data storage.

Characteristics of real-time data warehouse

- An enterprise data warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and systems of record so that immediate action can be taken in response to business events. - The purpose of real-time data warehousing is to know what is happening, when it is happening, and to make desirable things happen through the operational systems.

Big data

- An ill-defined term applied to databases whose volume, velocity, and variety strains the ability of commonly used relational DBMSs to capture, manage, and process the data within a tolerable elapsed time.

Key decisions and descriptions in the physical database design process

- Choosing the storage format (called data type) for each attribute from the logical data model. The format and associated parameters are chosen to maximize data integrity and to minimize storage space. - Giving the database management system guidance regarding how to group attributes from the logical data model into physical records. You will discover that although the columns of a relational table as specified in the logical design are a natural definition for the contents of a physical record, this does not always form the foundation for the most desirable grouping of attributes in the physical design. - Giving the database management system guidance regarding how to arrange similarly structured records in secondary memory (primarily hard disks), using a structure (called a file organization) so that individual and groups of records can be stored, retrieved, and updated rapidly. Consideration must also be given to protecting data and recovering data if errors are found. - Selecting structures (including indexes and the overall database architecture) for storing and connecting files to make retrieving related data more efficient. - Preparing strategies for handling queries against the database that will optimize performance and take advantage of the file organizations and indexes that you have specified. Efficient database structures will be beneficial only if queries and the database management systems that handle those queries are tuned to intelligently use those structures.

Data Definition Language (DDL)

- Commands used to define a database, including those for creating, altering, and dropping tables and establishing constraints.

Characteristics of data-mining techniques

- Data-mining - Knowledge discovery, using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics. - The choice of an appropriate technique depends on the nature of the data to be analyzed, as well as the size of the data set. Data mining can be performed against either the data marts or the enterprise data warehouse (or both). - Test, discover, derive, search, compress

Describe two components of star schema

- Fact tables- contain factual or quantitative data (measurements that are numerical, continuously valued, and additive) about a business, such as units sold, orders booked, and so on. - Dimensions- hold descriptive data (context) about the subjects of the business. The dimension tables are usually the source of attributes used to qualify, categorize, or summarize facts in queries, reports, or graphs; thus, dimension data are usually textual and discrete (even if numeric).

What factors need to consider when choosing a file organization? (Select appropriate file organizations)

- Fast data retrieval - High throughput for processing data input and maintenance transactions - Efficient use of storage space - Protection from failures or data loss - Minimizing need for reorganization - Accommodating growth - Security from unauthorized use

Embedded SQL

- Hard-coded SQL statements included in a program written in another language, such as C or Java.

Describe a rule of thumb for choosing indexes.

- Indexes are most useful on larger tables. - Specify a unique index for the primary key of each table. - Indexes are most useful for columns that frequently appear in WHERE clauses of SQL commands either to qualify the rows to select or to link (join) tables. - Use an index for attributes referenced in ORDER BY (sorting) and GROUP BY (categorizing) clauses. You do have to be careful, though, about these clauses. Be sure that the DBMS will, in fact, use indexes on attributes listed in these clauses. - Use an index when there is significant variety in the values of an attribute. - Before creating an index on a field with long values, consider first creating a compressed version of the values (coding the field with a surrogate key) and then indexing on the coded version. Large indexes, created from long index fields, can be slower to process than small indexes. - If the key for the index is going to be used for determining the location where the record will be stored, then the key for this index should be a surrogate key so that the values cause records to be evenly spread across the storage space. Many DBMSs create a sequence number so that each new row added to a table is assigned the next number in sequence; this is usually sufficient for creating a surrogate key. - Check your DBMS for the limit, if any, on the number of indexes allowable per table. Some systems permit no more than 16 indexes and may limit the size of an index key value. If there is such a limit in your system, you will have to choose those secondary keys that will most likely lead to improved performance. - Be careful of indexing attributes that have null values. For many DBMSs, rows with a null value will not be referenced in the index (so they cannot be found from an index search based on the attribute value NULL). Such a search will have to be done by scanning the file.

Define and use three types of joins

- Join- A relational operation that causes two tables with a common domain to be combined into a single table or view. - Equi-join- A join in which the joining condition is based on equality between values in the common columns. Common columns appear (redundantly) in the result table. - Natural join- A join that is the same as an equi-join except that one of the duplicate columns is eliminated in the result table. - Outer join- A join in which rows that do not have matching values in common columns are nevertheless included in the result table.

Characteristics of cloud technologies

- On-demand self-service - Broad network access - Resource pooling - Rapid elasticity - Measured service

How to handle changing dimensions

- Overwrite the current value with the new value, but this is unacceptable because it eliminates the description of the past that we need to interpret historical facts. Kimball calls this the Type 1 method. - For each dimension attribute that changes, create a current value field and as many old value fields as we wish (i.e., a multivalued attribute with a fixed number of occurrences for a limited historical view). This schema might work if there were a predictable number of changes over the length of history retained in the data warehouse (e.g., if we need to keep only 24 months of history and an attribute changes value monthly). However, this works only under this kind of restrictive assumption and cannot be generalized to any slowly changing dimension attribute. Further, queries can become quite complex because which column is needed may have to be determined within the query. Kimball calls this the Type 3 method. - Create a new dimension table row (with a new surrogate key) each time the dimension object changes; this new row contains all the dimension characteristics at the time of the change; the new surrogate key is the original surrogate key plus the start date for the period when these dimension values are in effect. A fact row is associated with the surrogate key whose attributes apply at the date/time of the fact (i.e., the fact date/time falls between the start and end dates of a dimension row for the same original surrogate key). We likely also want to store in a dimension row the date/time the change ceases being in effect (which will be the maximum possible date or null for the current row for each dimension object) and a reason code for the change. This approach allows us to create as many dimensional object changes as necessary. However, it becomes unwieldy if rows frequently change or if the rows are very long. Kimball calls this the Type 2 method, and it is the one most often used.

Describe advantages of stored procedures

- Performance improves for compiled SQL statements. - Network traffic decreases as processing moves from the client to the server. - Security improves if the stored procedure rather than the data is accessed and code is moved to the server, away from direct end-user access. - Data integrity improves as multiple applications access the same stored procedure. - Stored procedures result in a thinner client and a fatter database server.

Describe three types of file organization and their advantages, respectively

- Sequential file organization- The storage of records in a file in sequence according to a primary key value. the records in the file are stored in sequence according to a primary key value - Indexed file organization- The storage of records either sequentially or nonsequentially with an index that allows software to locate individual records. An index can point to unique records, or to potentially more than one record. An index that allows each entry to point to more than one record is called a secondary key index. - Hashed file organization- A storage system in which the address for each record is determined using a hashing algorithm. - Join index- An index on columns from two or more tables that come from the same domain of values. A join index is created as rows are loaded into a database, so the index, like all other indexes previously discussed, is always up-to-date. saves query processing time by finding data meeting a prespecified qualification at the expense of the extra storage space and maintenance of the index

Application program interface (API)

- Sets of routines that an application program uses to direct the performance of procedures by the computer's operating system.

Middleware

- Software that allows an application to interoperate with other software without requiring the user to understand and code the low-level operations necessary to achieve interoperability.

How to handle missing data?

- Substitute an estimate of the missing value. Such estimates must be marked so that users know that these are not actual values. - Track missing data so that special reports and other system elements cause people to resolve unknown values quickly. This can be done by setting up a trigger in the database definition. - Perform sensitivity testing so that missing data are ignored unless knowing a value might significantly change results. This is the most complex of the methods mentioned and hence requires the most sophisticated programming. Such routines for handling missing data may be written in application programs.

Field

- The smallest unit of application data recognized by system software

Purposes of the SQL standard

- To specify the syntax and semantics of SQL data definition and manipulation languages - To define the data structures and basic operations for designing, accessing, maintaining, controlling, and protecting an SQL database - To provide a vehicle for portability of database definition and application modules between conforming DBMSs - To specify both minimal (Level 1) and complete (Level 2) standards, which permit different degrees of adoption in products - To provide an initial standard, although incomplete, that will be enhanced later to include specifications for handling such topics as referential integrity, transaction management, user-defined functions, join operators beyond the equi-join, and national character sets

Characteristics of quality data

- Uniqueness - Accuracy - Consistency - Completeness - Timeliness - Conformance - Referential integrity

Advantages of user-defined transactions

- User-defined transactions can improve system performance because transactions will be processed as sets rather than as individual transactions, thus reducing system overhead. - Further, some SQL systems have concurrency controls that handle the updating of a shared database by concurrent users. These can journalize database changes so that a database can be recovered after abnormal terminations in the middle of a transaction. They can also undo erroneous transactions.

High-quality data

- data that are accurate, consistent, and available in a timely fashion

Service-oriented Architectures (SOA)-

- defines a way to make software components reusable via service interfaces. These interfaces utilize common communication standards in such a way that they can be rapidly incorporated into new applications without having to perform deep integration each time.

Cloud computing

- the practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than a local server or a personal computer.

Fat client

A client PC that is responsible for processing presentation logic, extensive application and business rules logic, and many DBMS functions.

data type

A detailed coding scheme recognized by system software, such as a DBMS, for representing organizational data

Pointer-

A field of data indicating a target address that can be used to locate a related field or record of data

Data Manipulation Language (DML)

Commands used to maintain and query a database, including those for updating, inserting, modifying, and querying data.

Dynamic SQL

Specific SQL code generated on the fly while an application is processing

Web Service Description Language (WSDL)

is an XML-based interface description language that is used for describing the functionality offered by a web service


Ensembles d'études connexes

English Collocations in Use (Advanced) by Felicity O'Dell & Michael McCarthy

View Set

Accounting Chapter 8 Test Study Guide

View Set

Bio-Medical Ethics Midterm Review

View Set