Database Management - Exam 1

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

business analytics helps us do valuable things like:

- interactive visualization - sales forecasting - customer churn prediction - fraud anomaly detection

Relational algebra

- language for interacting with data stored in relations - a procedural query language -- defined by a series of computational steps - products a result relation by way of a query expression - uses operators (ie. functions)

Join

- notated by ⨝ - used to combine two relations - considered a convenience operation b/c it can be accomplished through combo of select and cartesian product - join operations can sometimes lead to lost data if corresponding data are not available in adjoining table

Operators

- select (aka restrict) - project - rename - cartesian product - union - difference

Desirable Attributes of Organizational Memory

- shareable - transportable - secure - accurate - timely - relevant

LAPP

- simple way of deploying SQL is to use LAPP (for postrgressSQL) acronym for: - Linux (operating system) - Apache (server) - PostgreSQL (RDBMS) - Perl/PHP/Python (programming languages) - LAPP stacks are used to run many major websites

Repeating Group =

- situation that arises when an entity has two ore more attributes that refer to same underlying concept (eg. course 1, course 2 etc)

Four perspectives of Business Analytics:

1. descriptive - what is happening now? 2. inferential - why is it happening? 3. predictive - what will happen in the future? 4. prescriptive - what should be done?

Referential Integrity Constraints =

Every value in a foreign key must have a corresponding, valid value in the referenced primary key

overlap rule =

a supertype may have MORE THAN ONE subtype *indicated with an o in circle

Relation =

a two dimensional data table -- corresponds to an entity in ER

candidate identifier =

an attribute or attributes that can potentially be chosen as the primary identifier

Business Intelligence (BI):

an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance

identifier =

attribute (or combo of attributes) that uniquely identifies an individual instance of an entity type

Domain Constraints =

attributes must be of the same type and consistent with domain rules (eg. attribute numberofchildren could not be negative)

Higher Normal Forms =

beyond 3NF: - Elementary key normal form (EKNF) - Boyce- Codd Normal Form (BCNF or 3.5NF) - Fourth Normal Form (4NF) - Essential Tuple Normal Form (ETNF) - Fifth Normal Form (5NF) - Sixth Normal Form (6NF) - Domain/Key Normal Form (DKNF) * typically sufficient to have tables in 3NF -- to avoid most types of anomalies -- but exceptions to dexist

deletion anomaly =

removal of data leads to unintended or unexpected data loss

Division

- another convenience operation - notated using ÷ and can be used to find tuples in one relation that match tuples in another relation - query notation: RELATION1 ÷ RELATION2

derived attribute =

- attribute whose value is not stored but rather is computed from other attributes - attribute ellipse has a dashed line border

Key elements of database development

- conceptual data model eg. entity ~relationship model - logical database design eg. normalized relations - physical database design: specification of storage tech requirements, specification/creation of approp file structures

Timely=

- current and up to date - necessary to be agile in today's dynamic environment (eg. down to the minute)

Benefits of Database approach cont.

- data independence: immunity of applications to change in storage structures and access techniques, capacity to change conceptual scheme w/o changing application programs, capacity to change physical scheme w/o having to change external or conceptual schema - ease of application development - enhanced data accessibility and responsiveness - reduced program maintenance

Weak entity =

- dependent on a strong entity (identifying owner) meaning cannot exist on its own - does not have a unique identifier (only a partial identifier -- underlined w/ a dashed line) - rectangle with double line border

Steps in database design:

- developing an ERD - mapping ERD in relational model - normalizing to avoid anomalies, reduce redundancy, and increase integrity - instantiating the database (eg. via SQL)

transportable =

- easily moved to a decision maker - for instance, insurance agents in the field

interpersonal (skill of successful data analyst)

- effective written and oral communication skills - mutual understanding and trust - team player

Strong Entity =

- exists independently of other types of entities - has its own unique identifier, underlined w/ a single line

Normalization =

- formal method for creating well designed relational databases -- process of reducing redundancy and increasing data integrity - involved series of stages referred to collectively as normal forms --> normal forms are hierarchical - meaning 3NF must satisfy the rules for 2NF and 1NF

Types of functional dependency (FD):

- full dependency - partial dependency - transitive dependency

database interface (DBI)

- key advantage of SQL is can access data many different ways - use DBI to connect to an existing database from many other applciations

Relational constraints:

- key constraints - domain constraints - referential integrity constraints

Defining Big Data

- large, complex data sets that are difficult to process with traditional data processing tools 3Vs Defn: - Volume - Velocity - Variety - some definitions include additional Vs such as Veracity

data management =

- management of organizational memory (according to watson) - involved designing, using and managing memory systems of modern organizations - requires much more than knowing the tech alone - function of data management = can only be partially accomplished by database tech

Costs of Database Approach

- need new, specialized personnel - need for explicit back up: minimal redundancy - interference w/ shared data: access to data must be synchronized, may cause "deadlock" - possibility of organizational conflict: data defn, coding, data length, access rights

Rename

- notated by p(rho) - used to change the name of attributes within relations - b/c result sets are relations, renaming can be important for conducting various operations - query notation: ρold attr. name(s) → new attr. name(s)(RELATION) or ρnew attr. name(s) ← old attr. name(s)(RELATION)

secure =

- protected from destruction and unauthorized use - consider the counter examples of high profile cases of personal and financial data sold by call center employees

Accurate =

- reliable, precise records - consequences of bad data: improper billing, wrong predictions/decisions

SQL

- structured query language - most widely used database language for querying and manipulating data - operates within codds relational model framework and is based on relational algebra (and relational calculus) - consists of data defn language (DDL), data manipulation language (DML) and data control language (DCL)

Types of join:

- theta (θ) join: general type of join involving any condition type(s) - equijoin: particular case of theta join involving only equality (ie. = comparisons) - natural join: special case of equijoin where all attributes of same name are automatically compared without specification and duplicate tuples are automatically removed -general query notation: RELATION1 ⨝ condition RELATION2

Relevant=

- too much data leads to "info overload" - data must be approp to the decision

Disadvantages of file processing systems =

- uncontrolled redundancy: multiple versions of the same data in diff files, wastage of storage space, need for multiple updates - inconsistent data - inflexibility: unanticipated reports are very difficult to generate - limited data sharing: data being accessed by one application cannot be accessed by other applications - poor enforcement of standards: no automatic way of standardizing data names, data formats, and access restrictions --> synonym (eg sales and sales this month referring to same data item), homonym (eg. invoice referring to customer as well as supplier invoice) - excessive program maintenance: any modification to a data file requires that programs accessing the data files also be modified

Analytical (skill of successful data analyst)

- understanding of organizations and the business environment - problem solving skills along with creativity, imaginativeness - system mindset: ability to see organizations and info systems as systems

technical (skill of successful data analyst)

- understanding of the capabilities and limitations of tools techniques, and methodologies

Criteria for identifiers

- will not change in value - will not be null - avoid intelligent identifiers (eg. containing locations/people that change) * in general -- substitute new, simple keys for long, composite keys

Data Privacy Solutions

-Handle personally identifiable information (PII) and personal health information (PHI) with care and foresight -Employ techniques for increasing privacy and avoiding potential inference of private information: -Use anonymized, de-identified, coded, and/or aggregated information; also consider novel solutions such as differential privacy -Consider security risks involving both data storage and data transmission, especially in regard to mobile apps and the cloud -Make use of informed consent and opt-in vs. opt-out strategies

Important Privacy Topics

-Health Insurance Portability and Accountability Act (HIPAA) -Children's Online Privacy Protection Act (COPPA) -Office for Human Research Protections (OHRP) -EU Data Protection Directive (Directive 95/46/EC)

steps in database design

1. developing an ER diagram 2. mapping the ER diagram into the relational model 3. normalizing to avoid anomalies, reduce, redundancy, and increase integrity 4. Instantiating the database (eg. via SQL)

Database Operations (how data stored in relations are manipulated)

1. insert: add new data 2. modify (update): change existing data 3. delete: remove existing data

Data Security

Data security involves protecting databases from misuse, data loss (malicious or unintentional), damage (to software or physical servers), degraded performance, etc.

trivial functional dependencies =

FDS that are self determining ex. course id--> course ID **trivial FD typically omitted from FD sets

File Processing System =

Initial attempt to manage data using computers - focus on data processing needs of individual departments - application - driven, no overall map, plan or model to guide the growth of applications - each application designed with its own set of data files - to meet needs of new applications, existing data files had to be restructured --> thus existing programs had to be modified - it was often much safer to design new files for each application

Attribute =

a column/field in the relation -- corresponds to an attribute in ER

Key constraints =

a relation must have at least one candidate key such that for every tuple, the value must be unique and no value may be NULL (latter point - entity integrity constraint)

Tuple=

a row of data in the relation -- corresponds to an instance in ER

Relationships =

an assoc. b/w the instances of one or more entity types that is of interest to the organization

Key=

an identifier in the relation -- corresponds to an identifier in ER

knowledge =

capacity of a stakeholder to identify and utilize info, understanding of how and why

ability to transform=

change inter-relationships among work-units, changing sequence of activities, enabling parallelisms, etc

total specialization rule =

each supertype MUST have a subtype **indicated w/ a double line to circle

disjoint rule =

each supertype may have ONLY ONE subtype *indicated with d in circle

EAM

electrical accounting machine - associated w/ purpose of IS = reduce cost of paper processing (50s/60s)

EDP

electronic data processing - associated w/ purpose of IS = reduce cost of paper processing (50s/60s)

Ability to automate =

eliminate human labor by producing a more structured process

specialization/generalization =

how enhanced ER models allow for entities to be categorized as super types consisting of various subtypes

data lake =

less structured/ processed variant of a data warehouse designed to cater to data scientists; more agile and flexible with maximal data diversity

Access Control

limiting either physical (e.g., to servers/rooms) or logical access to data, including restrictions on data views or database operations

Identifying Relationship =

links strong entity to weak entity

MIS

management info system - associated w/ info purpose = operations support (70s/80s)

Max cardinality =

party of the symbol CLOSER to entity

natural (or business) key =

type of identifier/keys that come from the real world (eg. SSN)

data warehouse =

unified, central repository of contemporary and historical data accumulated from a wide range of sources within a company for the purposes of BI or management decision making; highly structured and processed

Encryption

using a cryptographic algorithm to transform data in a way that makes it uninterpretable/unusable without decryption

Partial Dependency =

values in a non-key column are functionally dependent on a part of the composite primary key

Some Capabilities of IS =

- Ability to automate - ability to informate - ability to transform

Information Systems are...

- Integrated/coordinated sets of components and activities designed to accomplish: Data storage, data retrieval, data manipulation, data communication - sometimes referred -- hardware/software aspects of computer and telecommunications technologies - broader/more approp interpretation = not only hardware/software but also people, procedures, and data

Difference exmaple

ex. STUDENT - MORESTUDENT MORESTUDENT - STUDENT

degree=

number of entity types that participate in a relationship (unary/recursive, binary, ternary)

Attributes =

properties or characteristics of an entity (or relationship_ that are of interest to the organization - may be a candidate or primary identifiers) *do not use composite attributes in this class *

SQL Injection

SQL injection involves the entering of malicious SQL code into a data entry field (typically online) to either gain access to the contents of a database (i.e., "dump" the database) or to destroy it

Foreign Key =

a key in a relation referenced from another relation* -- corresponds to relationship in ER

Partial Specialization rule =

a supertype MAY have a subtype **indicated w/ a singe line

unnormalized =

data that do not conform to 1NF

disjointness constraint =

determines whether a supertype can have one or more than one subtype

Union Compatibility

to perform union, difference, or intersection, the following must be true: - both relations must have the same number of attributes - domains of the attributes must be identical/compatible for both relations given the sequence of attributes within each relation

Select

- select is notated by sigma (σ) - used to return all tuples in a relation that satisfy specific rules - rules are known as conditions -- defined using a boolean/logical expression constructed from: - comparison operators including, <, >, <=, =>, =, and <> (<> means not equal) - logical operators including (∧), or (∨), not (¬) - pattern matching and string operations (eg. regular expressions) - Query Notation: σcondition(RELATION)

Entity - Relationship (ER) Model =

- a detailed logical representation of data for an organization or business area - expressed in terms of constructs including entities, attributes etc.

managerial (skill of successful data analyst)

- ability to manage projects, stakeholders, resources, risk and change

Composite Attribute =

- an attribute composed of several sub-attributes - sub-attributes branch off the atomic parent attribute

optional attribute=

- an attribute that may contain missing (NULL) values - attribute name includes (O)

multi- valued attribute =

- an attribute that takes on more than one value for a given entity instance - attribute ellipse has a double line border (ex. skill)

Skills of a successful database/data analyst

- analytical - technical - managerial - Interpersonal

Set Operators

- cartesian product is a binary set operator -- b/c its concerned with tuples across two relations, but there are several other important set operators - result of a union of relations R and S are all tuples that are in R along with all those that are in S (including tuples that are in both R and S) without any duplicate tuples: R u S -the result of a difference of two relations R and S are all tuples that are in R excluding those that are also in S: R - S - Additional set operation is intersection - the result of which involves tuples that appear in both relations R and S: R ∩ S

Project

- project is notated by π (pi) - used to return only attributes of interest in a given relation while also removing duplicate tuples - query notation: πattribute list(RELATION)

shareable =

- readily accessed by more than one person at a time - becomes a challenge as data becomes volatile

benefits of database approach

- redundancy can be reduced --> thus inconsistency can be avoided - data can be integrated and shared among applications - standards can be enforced by the DBA: formats, representation, naming, documentation - security restrictions can be applied - data integrity can be maintained: through minimizing inconsistency/having controls to check against incorrect updates, esp. in multi-user context - conflicting requirements of diff depts. can be balanced by the DBA who (presumably) acts according to "what is best for the enterprise": by considering physical data organization, data formats, etc.

Key Aspects of Database Security:

- secrecy/confidentiality: prevent unauthorized data disclosures - integrity: preventing unauthorized data modification - availability: preventing and recovering from hardware/software errors and other obstructions related to data access - privacy: obtaining/recording user consent and protecting user data after it has been intentionally/voluntarily disclosed

Outer join:

- solve data loss issue - left outer join: ⟕ keep all tuples from the left table - right outer join: ⟖ keep all tuples from the right table - full outer join: ⟗ keep all tuples from both tables

Traditional Approach vs. Database Approach

- traditional approach = process oriented, characteristic of file processing approach - database approach = data driven rather than process driven - database approach emphasizes the integration and sharing of data across the organization

Union Example

- using student and novel morestudent relation (note the attribute names do not match) - ex. STUDENT ∪ MORESTUDENT

Cartesian Product

- way of multiplying two relations such that the attributes of the two relations are concatenated and tuples are bound together as rows For Result: - number of columns = sum of number of attributes in each of the two relations - number of rows = number of rows in first relation multiplied x number of rows in second relation - query notation: RELATION1 × RELATION2

Data base security solutions

A major role of a database administer (DBA) is to develop and maintain database security via planning and policies relating to: -Access Control -Authentication -Auditing -Encryption -Integrity -Backup Plans

Intersect example

STUDENT ∩ MORESTUDENT

SQL vs. Relational Algebra Syntax

Select: SELECT (e.g., SELECT * FROM table;) Project: SELECT col1, col2 FROM table; Rename: AS (e.g., SELECT col1 AS 'newName' FROM table;) Cartesian Product: CROSS JOIN (e.g., SELECT * FROM table1 CROSS JOIN table2; /* or */ SELECT * FROM table1, table2;) Union: UNION (e.g., SELECT * FROM table1 UNION SELECT * FROM table2;) Difference: EXCEPT (e.g., SELECT * FROM table1 EXCEPT SELECT * FROM table2;)

Associative entity =

can be used to help map many to many relationships while also serving as its own entity (making it an alternative to a relationship w/ attributes) - assoc. entity could have meaning independent of other entities - assoc. entity should have a unique identifier (independent of other entities that i tis related) and should have other attributes - assoc. entity may participate in other relationships in addition to assoc. relationship

modification anomaly (update anomaly)=

changing data in one location requires changing data in many locations

database management system =

collect of programs that enables users to define, construct, and manipulate database

database system =

collectively, the database model, the DBMS, and the database itself

Business Analytics:

comprised of solutions used to build analysis models and simulations to create scenarios, understand realities and predict future states. Business analytics includes data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user..."

completeness constraint =

determines whether or not a supertype must have a corresponding subtype

Anomalies =

for more complex relations, database operations may cause problems, inconsistencies, or errors = anomalies -- such as: - insertion anomaly -modification anomaly (update anomaly) - deletion anomaly

Integrity

maintaining meaningful and accurate data through methods including constraints (e.g., referential integrity constraint, entity integrity constraint, data type constraint), sanity/reality checks, and well-formed transactions that keep data in a consistent, reliable state (Clark & Wilson, 1987)

Insert=

may add one or more new tuples/records

modify/update=

may change existing tuples

deletion=

may remove one or more tuples

Authentication

methods for verifying/guaranteeing users' identities

Auditing

monitoring users' activities on the database

Insertion Anomaly =

new data cannot be entered as desired

functional dependency =

occurs when values in certain columns can be used to determine values in other columns - notated as x --> y (x and y represent attribute sets - x = determinant set and y = dependent set) ex. StudentID--> name, major, school, year Student ID= determinant set Name, Major, School, Year = dependent set

Typical Problems with Organizational memory =

organizational memory must be seen as a vast disorganized data warehouse -- problems: - redundancy = same data stored in diff memories - data control = data not managed as a valuable resource - interface = difficult to access data - delays = long delays in responding to requests for data - lack of data integration = data dispersed across different systems it may not be known where data are stored

Min cardinality =

part of the symbol FARTHER from entity

Entities =

person, place, object, event or concept in the user environment about which the organization wishes to maintain data

information=

processed data that is meaningful for a given stakeholder - may answer who, what where, when - what is info for a stakeholder can be data for another *terms data and info are often used interchangeably*

data =

raw, unsummarized, unanalyzed facts *terms data and info are often used interchangeably*

Data Definition Language:

syntax for defining data structures in database schemas: Create, drop, alter, truncate

Cardinality Constraints=

reflect how many instances of one entity can be assoc. w/ instances from another entity

ability to informate=

represent info about events, objects, or work processes as electronic text, thus making them more visible, knowable and shareable

Data control language (DCL)

syntax for the DBA to control access to data within the database: grant, revoke

Data manipulation language (DML)

syntax to manipulate data in a database: select, insert, update, delete

Database =

shared collection of logically-related, persistent data, designed to meet the needs of multiple users usually within an organization

data mart=

subset of data warehouse dedicated to a single project, dept, or team

Backups

the process of copying and archiving of data for the purpose of recovery in the event of unexpected or catastrophic data loss - As part of an organization's disaster recovery plan, the physical location of backups should be thoroughly considered

Surrogate key=

type of identifier/key -- those with no business meaning such as 001, 002 etc. -- sometimes referred to as synthetic, artificial, or system - generated keys

Transitive Dependency =

values in a non-key column are functionally dependent on another non-key column

Full Dependency =

values in a non-key column are functionally determined exclusively by the complete primary key


Ensembles d'études connexes

LSCM 3960 Inventory Quiz - Part I

View Set

Management Exam 2 (1-4) (5-9) (10-15)

View Set

PPR Review - Missed Questions & Content

View Set