Database Management - Exam 1
business analytics helps us do valuable things like:
- interactive visualization - sales forecasting - customer churn prediction - fraud anomaly detection
Relational algebra
- language for interacting with data stored in relations - a procedural query language -- defined by a series of computational steps - products a result relation by way of a query expression - uses operators (ie. functions)
Join
- notated by ⨝ - used to combine two relations - considered a convenience operation b/c it can be accomplished through combo of select and cartesian product - join operations can sometimes lead to lost data if corresponding data are not available in adjoining table
Operators
- select (aka restrict) - project - rename - cartesian product - union - difference
Desirable Attributes of Organizational Memory
- shareable - transportable - secure - accurate - timely - relevant
LAPP
- simple way of deploying SQL is to use LAPP (for postrgressSQL) acronym for: - Linux (operating system) - Apache (server) - PostgreSQL (RDBMS) - Perl/PHP/Python (programming languages) - LAPP stacks are used to run many major websites
Repeating Group =
- situation that arises when an entity has two ore more attributes that refer to same underlying concept (eg. course 1, course 2 etc)
Four perspectives of Business Analytics:
1. descriptive - what is happening now? 2. inferential - why is it happening? 3. predictive - what will happen in the future? 4. prescriptive - what should be done?
Referential Integrity Constraints =
Every value in a foreign key must have a corresponding, valid value in the referenced primary key
overlap rule =
a supertype may have MORE THAN ONE subtype *indicated with an o in circle
Relation =
a two dimensional data table -- corresponds to an entity in ER
candidate identifier =
an attribute or attributes that can potentially be chosen as the primary identifier
Business Intelligence (BI):
an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance
identifier =
attribute (or combo of attributes) that uniquely identifies an individual instance of an entity type
Domain Constraints =
attributes must be of the same type and consistent with domain rules (eg. attribute numberofchildren could not be negative)
Higher Normal Forms =
beyond 3NF: - Elementary key normal form (EKNF) - Boyce- Codd Normal Form (BCNF or 3.5NF) - Fourth Normal Form (4NF) - Essential Tuple Normal Form (ETNF) - Fifth Normal Form (5NF) - Sixth Normal Form (6NF) - Domain/Key Normal Form (DKNF) * typically sufficient to have tables in 3NF -- to avoid most types of anomalies -- but exceptions to dexist
deletion anomaly =
removal of data leads to unintended or unexpected data loss
Division
- another convenience operation - notated using ÷ and can be used to find tuples in one relation that match tuples in another relation - query notation: RELATION1 ÷ RELATION2
derived attribute =
- attribute whose value is not stored but rather is computed from other attributes - attribute ellipse has a dashed line border
Key elements of database development
- conceptual data model eg. entity ~relationship model - logical database design eg. normalized relations - physical database design: specification of storage tech requirements, specification/creation of approp file structures
Timely=
- current and up to date - necessary to be agile in today's dynamic environment (eg. down to the minute)
Benefits of Database approach cont.
- data independence: immunity of applications to change in storage structures and access techniques, capacity to change conceptual scheme w/o changing application programs, capacity to change physical scheme w/o having to change external or conceptual schema - ease of application development - enhanced data accessibility and responsiveness - reduced program maintenance
Weak entity =
- dependent on a strong entity (identifying owner) meaning cannot exist on its own - does not have a unique identifier (only a partial identifier -- underlined w/ a dashed line) - rectangle with double line border
Steps in database design:
- developing an ERD - mapping ERD in relational model - normalizing to avoid anomalies, reduce redundancy, and increase integrity - instantiating the database (eg. via SQL)
transportable =
- easily moved to a decision maker - for instance, insurance agents in the field
interpersonal (skill of successful data analyst)
- effective written and oral communication skills - mutual understanding and trust - team player
Strong Entity =
- exists independently of other types of entities - has its own unique identifier, underlined w/ a single line
Normalization =
- formal method for creating well designed relational databases -- process of reducing redundancy and increasing data integrity - involved series of stages referred to collectively as normal forms --> normal forms are hierarchical - meaning 3NF must satisfy the rules for 2NF and 1NF
Types of functional dependency (FD):
- full dependency - partial dependency - transitive dependency
database interface (DBI)
- key advantage of SQL is can access data many different ways - use DBI to connect to an existing database from many other applciations
Relational constraints:
- key constraints - domain constraints - referential integrity constraints
Defining Big Data
- large, complex data sets that are difficult to process with traditional data processing tools 3Vs Defn: - Volume - Velocity - Variety - some definitions include additional Vs such as Veracity
data management =
- management of organizational memory (according to watson) - involved designing, using and managing memory systems of modern organizations - requires much more than knowing the tech alone - function of data management = can only be partially accomplished by database tech
Costs of Database Approach
- need new, specialized personnel - need for explicit back up: minimal redundancy - interference w/ shared data: access to data must be synchronized, may cause "deadlock" - possibility of organizational conflict: data defn, coding, data length, access rights
Rename
- notated by p(rho) - used to change the name of attributes within relations - b/c result sets are relations, renaming can be important for conducting various operations - query notation: ρold attr. name(s) → new attr. name(s)(RELATION) or ρnew attr. name(s) ← old attr. name(s)(RELATION)
secure =
- protected from destruction and unauthorized use - consider the counter examples of high profile cases of personal and financial data sold by call center employees
Accurate =
- reliable, precise records - consequences of bad data: improper billing, wrong predictions/decisions
SQL
- structured query language - most widely used database language for querying and manipulating data - operates within codds relational model framework and is based on relational algebra (and relational calculus) - consists of data defn language (DDL), data manipulation language (DML) and data control language (DCL)
Types of join:
- theta (θ) join: general type of join involving any condition type(s) - equijoin: particular case of theta join involving only equality (ie. = comparisons) - natural join: special case of equijoin where all attributes of same name are automatically compared without specification and duplicate tuples are automatically removed -general query notation: RELATION1 ⨝ condition RELATION2
Relevant=
- too much data leads to "info overload" - data must be approp to the decision
Disadvantages of file processing systems =
- uncontrolled redundancy: multiple versions of the same data in diff files, wastage of storage space, need for multiple updates - inconsistent data - inflexibility: unanticipated reports are very difficult to generate - limited data sharing: data being accessed by one application cannot be accessed by other applications - poor enforcement of standards: no automatic way of standardizing data names, data formats, and access restrictions --> synonym (eg sales and sales this month referring to same data item), homonym (eg. invoice referring to customer as well as supplier invoice) - excessive program maintenance: any modification to a data file requires that programs accessing the data files also be modified
Analytical (skill of successful data analyst)
- understanding of organizations and the business environment - problem solving skills along with creativity, imaginativeness - system mindset: ability to see organizations and info systems as systems
technical (skill of successful data analyst)
- understanding of the capabilities and limitations of tools techniques, and methodologies
Criteria for identifiers
- will not change in value - will not be null - avoid intelligent identifiers (eg. containing locations/people that change) * in general -- substitute new, simple keys for long, composite keys
Data Privacy Solutions
-Handle personally identifiable information (PII) and personal health information (PHI) with care and foresight -Employ techniques for increasing privacy and avoiding potential inference of private information: -Use anonymized, de-identified, coded, and/or aggregated information; also consider novel solutions such as differential privacy -Consider security risks involving both data storage and data transmission, especially in regard to mobile apps and the cloud -Make use of informed consent and opt-in vs. opt-out strategies
Important Privacy Topics
-Health Insurance Portability and Accountability Act (HIPAA) -Children's Online Privacy Protection Act (COPPA) -Office for Human Research Protections (OHRP) -EU Data Protection Directive (Directive 95/46/EC)
steps in database design
1. developing an ER diagram 2. mapping the ER diagram into the relational model 3. normalizing to avoid anomalies, reduce, redundancy, and increase integrity 4. Instantiating the database (eg. via SQL)
Database Operations (how data stored in relations are manipulated)
1. insert: add new data 2. modify (update): change existing data 3. delete: remove existing data
Data Security
Data security involves protecting databases from misuse, data loss (malicious or unintentional), damage (to software or physical servers), degraded performance, etc.
trivial functional dependencies =
FDS that are self determining ex. course id--> course ID **trivial FD typically omitted from FD sets
File Processing System =
Initial attempt to manage data using computers - focus on data processing needs of individual departments - application - driven, no overall map, plan or model to guide the growth of applications - each application designed with its own set of data files - to meet needs of new applications, existing data files had to be restructured --> thus existing programs had to be modified - it was often much safer to design new files for each application
Attribute =
a column/field in the relation -- corresponds to an attribute in ER
Key constraints =
a relation must have at least one candidate key such that for every tuple, the value must be unique and no value may be NULL (latter point - entity integrity constraint)
Tuple=
a row of data in the relation -- corresponds to an instance in ER
Relationships =
an assoc. b/w the instances of one or more entity types that is of interest to the organization
Key=
an identifier in the relation -- corresponds to an identifier in ER
knowledge =
capacity of a stakeholder to identify and utilize info, understanding of how and why
ability to transform=
change inter-relationships among work-units, changing sequence of activities, enabling parallelisms, etc
total specialization rule =
each supertype MUST have a subtype **indicated w/ a double line to circle
disjoint rule =
each supertype may have ONLY ONE subtype *indicated with d in circle
EAM
electrical accounting machine - associated w/ purpose of IS = reduce cost of paper processing (50s/60s)
EDP
electronic data processing - associated w/ purpose of IS = reduce cost of paper processing (50s/60s)
Ability to automate =
eliminate human labor by producing a more structured process
specialization/generalization =
how enhanced ER models allow for entities to be categorized as super types consisting of various subtypes
data lake =
less structured/ processed variant of a data warehouse designed to cater to data scientists; more agile and flexible with maximal data diversity
Access Control
limiting either physical (e.g., to servers/rooms) or logical access to data, including restrictions on data views or database operations
Identifying Relationship =
links strong entity to weak entity
MIS
management info system - associated w/ info purpose = operations support (70s/80s)
Max cardinality =
party of the symbol CLOSER to entity
natural (or business) key =
type of identifier/keys that come from the real world (eg. SSN)
data warehouse =
unified, central repository of contemporary and historical data accumulated from a wide range of sources within a company for the purposes of BI or management decision making; highly structured and processed
Encryption
using a cryptographic algorithm to transform data in a way that makes it uninterpretable/unusable without decryption
Partial Dependency =
values in a non-key column are functionally dependent on a part of the composite primary key
Some Capabilities of IS =
- Ability to automate - ability to informate - ability to transform
Information Systems are...
- Integrated/coordinated sets of components and activities designed to accomplish: Data storage, data retrieval, data manipulation, data communication - sometimes referred -- hardware/software aspects of computer and telecommunications technologies - broader/more approp interpretation = not only hardware/software but also people, procedures, and data
Difference exmaple
ex. STUDENT - MORESTUDENT MORESTUDENT - STUDENT
degree=
number of entity types that participate in a relationship (unary/recursive, binary, ternary)
Attributes =
properties or characteristics of an entity (or relationship_ that are of interest to the organization - may be a candidate or primary identifiers) *do not use composite attributes in this class *
SQL Injection
SQL injection involves the entering of malicious SQL code into a data entry field (typically online) to either gain access to the contents of a database (i.e., "dump" the database) or to destroy it
Foreign Key =
a key in a relation referenced from another relation* -- corresponds to relationship in ER
Partial Specialization rule =
a supertype MAY have a subtype **indicated w/ a singe line
unnormalized =
data that do not conform to 1NF
disjointness constraint =
determines whether a supertype can have one or more than one subtype
Union Compatibility
to perform union, difference, or intersection, the following must be true: - both relations must have the same number of attributes - domains of the attributes must be identical/compatible for both relations given the sequence of attributes within each relation
Select
- select is notated by sigma (σ) - used to return all tuples in a relation that satisfy specific rules - rules are known as conditions -- defined using a boolean/logical expression constructed from: - comparison operators including, <, >, <=, =>, =, and <> (<> means not equal) - logical operators including (∧), or (∨), not (¬) - pattern matching and string operations (eg. regular expressions) - Query Notation: σcondition(RELATION)
Entity - Relationship (ER) Model =
- a detailed logical representation of data for an organization or business area - expressed in terms of constructs including entities, attributes etc.
managerial (skill of successful data analyst)
- ability to manage projects, stakeholders, resources, risk and change
Composite Attribute =
- an attribute composed of several sub-attributes - sub-attributes branch off the atomic parent attribute
optional attribute=
- an attribute that may contain missing (NULL) values - attribute name includes (O)
multi- valued attribute =
- an attribute that takes on more than one value for a given entity instance - attribute ellipse has a double line border (ex. skill)
Skills of a successful database/data analyst
- analytical - technical - managerial - Interpersonal
Set Operators
- cartesian product is a binary set operator -- b/c its concerned with tuples across two relations, but there are several other important set operators - result of a union of relations R and S are all tuples that are in R along with all those that are in S (including tuples that are in both R and S) without any duplicate tuples: R u S -the result of a difference of two relations R and S are all tuples that are in R excluding those that are also in S: R - S - Additional set operation is intersection - the result of which involves tuples that appear in both relations R and S: R ∩ S
Project
- project is notated by π (pi) - used to return only attributes of interest in a given relation while also removing duplicate tuples - query notation: πattribute list(RELATION)
shareable =
- readily accessed by more than one person at a time - becomes a challenge as data becomes volatile
benefits of database approach
- redundancy can be reduced --> thus inconsistency can be avoided - data can be integrated and shared among applications - standards can be enforced by the DBA: formats, representation, naming, documentation - security restrictions can be applied - data integrity can be maintained: through minimizing inconsistency/having controls to check against incorrect updates, esp. in multi-user context - conflicting requirements of diff depts. can be balanced by the DBA who (presumably) acts according to "what is best for the enterprise": by considering physical data organization, data formats, etc.
Key Aspects of Database Security:
- secrecy/confidentiality: prevent unauthorized data disclosures - integrity: preventing unauthorized data modification - availability: preventing and recovering from hardware/software errors and other obstructions related to data access - privacy: obtaining/recording user consent and protecting user data after it has been intentionally/voluntarily disclosed
Outer join:
- solve data loss issue - left outer join: ⟕ keep all tuples from the left table - right outer join: ⟖ keep all tuples from the right table - full outer join: ⟗ keep all tuples from both tables
Traditional Approach vs. Database Approach
- traditional approach = process oriented, characteristic of file processing approach - database approach = data driven rather than process driven - database approach emphasizes the integration and sharing of data across the organization
Union Example
- using student and novel morestudent relation (note the attribute names do not match) - ex. STUDENT ∪ MORESTUDENT
Cartesian Product
- way of multiplying two relations such that the attributes of the two relations are concatenated and tuples are bound together as rows For Result: - number of columns = sum of number of attributes in each of the two relations - number of rows = number of rows in first relation multiplied x number of rows in second relation - query notation: RELATION1 × RELATION2
Data base security solutions
A major role of a database administer (DBA) is to develop and maintain database security via planning and policies relating to: -Access Control -Authentication -Auditing -Encryption -Integrity -Backup Plans
Intersect example
STUDENT ∩ MORESTUDENT
SQL vs. Relational Algebra Syntax
Select: SELECT (e.g., SELECT * FROM table;) Project: SELECT col1, col2 FROM table; Rename: AS (e.g., SELECT col1 AS 'newName' FROM table;) Cartesian Product: CROSS JOIN (e.g., SELECT * FROM table1 CROSS JOIN table2; /* or */ SELECT * FROM table1, table2;) Union: UNION (e.g., SELECT * FROM table1 UNION SELECT * FROM table2;) Difference: EXCEPT (e.g., SELECT * FROM table1 EXCEPT SELECT * FROM table2;)
Associative entity =
can be used to help map many to many relationships while also serving as its own entity (making it an alternative to a relationship w/ attributes) - assoc. entity could have meaning independent of other entities - assoc. entity should have a unique identifier (independent of other entities that i tis related) and should have other attributes - assoc. entity may participate in other relationships in addition to assoc. relationship
modification anomaly (update anomaly)=
changing data in one location requires changing data in many locations
database management system =
collect of programs that enables users to define, construct, and manipulate database
database system =
collectively, the database model, the DBMS, and the database itself
Business Analytics:
comprised of solutions used to build analysis models and simulations to create scenarios, understand realities and predict future states. Business analytics includes data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user..."
completeness constraint =
determines whether or not a supertype must have a corresponding subtype
Anomalies =
for more complex relations, database operations may cause problems, inconsistencies, or errors = anomalies -- such as: - insertion anomaly -modification anomaly (update anomaly) - deletion anomaly
Integrity
maintaining meaningful and accurate data through methods including constraints (e.g., referential integrity constraint, entity integrity constraint, data type constraint), sanity/reality checks, and well-formed transactions that keep data in a consistent, reliable state (Clark & Wilson, 1987)
Insert=
may add one or more new tuples/records
modify/update=
may change existing tuples
deletion=
may remove one or more tuples
Authentication
methods for verifying/guaranteeing users' identities
Auditing
monitoring users' activities on the database
Insertion Anomaly =
new data cannot be entered as desired
functional dependency =
occurs when values in certain columns can be used to determine values in other columns - notated as x --> y (x and y represent attribute sets - x = determinant set and y = dependent set) ex. StudentID--> name, major, school, year Student ID= determinant set Name, Major, School, Year = dependent set
Typical Problems with Organizational memory =
organizational memory must be seen as a vast disorganized data warehouse -- problems: - redundancy = same data stored in diff memories - data control = data not managed as a valuable resource - interface = difficult to access data - delays = long delays in responding to requests for data - lack of data integration = data dispersed across different systems it may not be known where data are stored
Min cardinality =
part of the symbol FARTHER from entity
Entities =
person, place, object, event or concept in the user environment about which the organization wishes to maintain data
information=
processed data that is meaningful for a given stakeholder - may answer who, what where, when - what is info for a stakeholder can be data for another *terms data and info are often used interchangeably*
data =
raw, unsummarized, unanalyzed facts *terms data and info are often used interchangeably*
Data Definition Language:
syntax for defining data structures in database schemas: Create, drop, alter, truncate
Cardinality Constraints=
reflect how many instances of one entity can be assoc. w/ instances from another entity
ability to informate=
represent info about events, objects, or work processes as electronic text, thus making them more visible, knowable and shareable
Data control language (DCL)
syntax for the DBA to control access to data within the database: grant, revoke
Data manipulation language (DML)
syntax to manipulate data in a database: select, insert, update, delete
Database =
shared collection of logically-related, persistent data, designed to meet the needs of multiple users usually within an organization
data mart=
subset of data warehouse dedicated to a single project, dept, or team
Backups
the process of copying and archiving of data for the purpose of recovery in the event of unexpected or catastrophic data loss - As part of an organization's disaster recovery plan, the physical location of backups should be thoroughly considered
Surrogate key=
type of identifier/key -- those with no business meaning such as 001, 002 etc. -- sometimes referred to as synthetic, artificial, or system - generated keys
Transitive Dependency =
values in a non-key column are functionally dependent on another non-key column
Full Dependency =
values in a non-key column are functionally determined exclusively by the complete primary key