DATABASE FINAL

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

transactions cont

a successful transaction is one in which all of the SQL statements are completed successfully a CONSISTENT database state is one in which all data integrity constraints are satisfied a successful transaction changes the database from one consistent state to another

isolation

data used during the execution of a transaction cannot be used by a second transaction until the first one is completed

classification example 2

fraud detection goal: predict fraudulent cases in credit card transactions approach: use credit card transactions and the associated account holder information as attributes -when does a customer buy, what does he/she buy, how often does he/she pay on time etc label past transactions as fraud or fair transactions. this forms the class attribute train the model use this model to detect fraud by observing credit card transactions on an account

scheduling

methods for scheduling conflicting operations in concurrent transactions: -locking methods -timestamping

creating a transaction in Access with ADO and VBA

Sub CreateTransactionADO() Dim conn As ADODB.Connection On Error GoTo ErrorHandler Set conn = New ADODB.Connection With conn .Provider = "Microsoft.Jet.OLEDB.4.0" ' identify remote database to connect to (won't work with DB containing this macro) .ConnectionString = "Data Source=" & CurrentProject.Path & "\class.mdb" .Open .BeginTrans ' insert a new lunch record .Execute "INSERT INTO Lunch VALUES (17, #12/15/1998#, 209)" ' insert the items for that lunch .Execute "INSERT INTO Lunch_item VALUES (17, 1, 1)" .Execute "INSERT INTO Lunch_item VALUES (17, 2, 1)" .Execute "INSERT INTO Lunch_item VALUES (17, 3, 1)" .CommitTrans .Close MsgBox "All four inserts completed." End With ExitHere: Set conn = Nothing Exit Sub ErrorHandler: If Err.Number = -2147467259 Then 'problem with opening database MsgBox Err.Description Resume ExitHere Else MsgBox Err.Description With conn .RollbackTrans .Close End With Resume ExitHere End If End Sub

transaction examples

T2: sell 100 units of X to customer 238 for $500 UPDATE product SET Prod_QOH = PROD_QOH - 100 WHERE Prod_Code = 'X' ; UPDATE Customer SET Cust_Balance = Cust_Balance + 500 WHERE Cust_Number = 'Y'; the database is in a consistent state only if both database requests are successfully complete

deadlock

an impasse that may result when two or more transactions are waiting for locks held by the other to be released

controlling deadlocks

deadlock prevention: abort a transaction if possibility of a deadlock reschedule a transaction for later execution deadlock detection: DBMS periodically tests database for deadlocks if found, one transaction ("victim") is rolled back deadlock avoidance: transactions obtain all needed locks before execution

data warehouse data characteristics

integrated- consistent format and meaning subject oriented- organized and summarized to answer questions time-variant- captures and represents the flow of data over time nonvolatile- once data enters the warehouse, it is never removed

illustration of nearest neighbor

k-nearest neighbors of arecord x are data points that have the k smallest distances to x

lock terminology

lock granularity refers to the size of the locked resource slow data access- database level table level page level row level -significant overload an exclusive lock prohibits other users from reading the locked resource a shared lock allows other users to read the locked resource but they cannot update it

transaction log contin'd

the transaction log is typically implemented as one or more files that are stored separately from the database itself it is automatically updated when the DBMS executes transactions that modify the database

general steps of BI

1. collecting and storing operational data 2. aggregating the operational data into decision support data 3. analyzing the decision support data to generate information 4. presenting the information to the end user to support decision making 5. making business decisions and generating more data 6. monitoring results to evaluate outcomes of the business decisions

transactions in MS access

MS access only supports multi-step transactions when using code to access the DBMS the SQL window within access only allows one SQL statement to be run at a time only one user=> no problems with conflicts you can implement multi-step transactions when: connecting from other office applications via VBA connecting from a .NET application connecting from a web-based application

transaction log

a DBMS maintains a transaction log to support recovery to a consistent state the transaction log stores: a record for the beginning of the transaction -a record for each transaction component; type of operation, tables/ attributes affected, "before" and "after" values for attributes affected -a record for the end of the transaction (COMMIT)

document clustering

clustering points: twitter feeds/ blog comments similarity measure: how many words are common in these "documents" (after some word filtering) applications 1) identify issues with a product more quickly and with greater detail ) identify the occurence of/ details about a disaster event as it is in the process of occurring (used for flooding in Oklahoma and North Dakota)

concurrency control

the process of managing simultaneous operations on the database without having them interfere with one another prevents interference when two or more users are accessing database simultaneously and at least one is updating data although two transactions may be correct in themselves, interleaving of operations may produce an incorrect result

need for concurrency control

three examples of potential problems caused by concurrency: -lost updates -uncommitted data -inconsistent retrievals

time stamping

timestamp: a unique identifier created by DBMS that indicates the relative starting time of a transaction transactions ordered globally so that older transactions (transactions with smaller timestamps) get prioritity in the event of conflict conflict is resolved by rolling back and restarting the associated transaction-no locks so no deadlocks demands a lot of system resources (both memory and processing overhea)

business intelligence components

what tools are needed in general to monitor business health and support strategic decision making? typical major components: data extraction, transformation, and loading tools data store, (data warehouse or data mart) data query and analysis tool (OLAP) data presentation and visualization tools (dashboard) major software vendors microsoft, oracle, SAP, IBM

consistency

when a transaction is completed, the database must be in a consistent state

example data mining questions

which of the approximately 30k VT students will buy a new car within the next year? which of our customers would be interested paying a premium for hte latest technology? would the residents of VA or NC likely provide more revenue for our business if we chose to relocate which of our high risk mortgage applicants are likely to file for bankruptcy?

databases and reporting queries are predicated on a number of assumptions:

1.Predictable Frequency. Data is extracted from source systems at regular intervals -typically measured in days, months and quarters. 2.Static Sources. Data is sourced from controlled, internal systems supporting established and well-defined back-office processes. 3.Fixed Models. Data structures are known and modeledin advance of analysis. This enables the development of a single schema to accommodate data from all of the source systems, but adds significant time to the upfront design. 4.Defined Queries. Questions to be asked of the data (i.e., the reporting queries)are pre-defined. If not all of the query requirements are known upfront, or requirements change, then the schema has to be modified to accommodate changes. 5.Slow-changing requirements. Rigorous change control is enforced before the introduction of new data sources or reporting requirements. 6.Limited users. The consumers of BI reports are typically business managers and senior executives.

two phase locking

2PL guarantees serializability -one of hte most common techniques used to achieve this -transactions are allowed to obtain as many locks as necessary (growing phase) -once the first lock is released (shrinking phase) no additional locks can be obtained -two phase locking doesn't prevent deadlocks

Transaction Management

ANSI SQL standards for managing transactions: COMMIT: permanently records all changes in the database automatically ends the transaction ROLLBACK: aborts all uncommitted changes database is rolled back into its previous state a rolled back transaction can typically be restarted later BEGIN TRANSACTION also used by some DBM's

data warehouse

a database optimized for data analysis and read only query processing -the prime objective of a data warehouse is to provide fast and accurate answers to data analysis queries data are typically extracted from various sources and then transformed and integrated into the data warehouse development of a data warehouse requires time, money and considerable managerial effort

scheduler

a serializable schedule is a schedule of a transaction's operations in which the interleaved execution of all active transactions yields the same results as if those transactions were executed in serial order -waiting for one transaction to finish before starting any others may be inefficient -the built in scheduler ensures efficient use of the DBMS and CPU by interleaving operations when possible -if transactions access unrelated data then there is no conflict among individual operations between transactions

data mart

a small, single subject data warehouse subset that provides decision support to a small group of people -less organizational commitment required -lower cost -shorter implementation time -data warehouse~1-3 years -data mart 6 months ~ 1 year

how is this different from everyday business decision making

amount of data available to use machine learning number of competitors/ customers speed of data generation/ acquisition security (with internet) multi-national companies -complexity, legal and ethical issues, socio cultural issues, language barrier

database transaction

any (possibly multi step) action that reads from and/ or writes to a database -it may consist of a single SQL statement or a collection of related SQL statements ex: adding a new lunch to the class database requires TWO related INSERT statements

properties of a transaction

atomicity consistency isolation durability every transaction MUST exhibit these four properties

classification approach: k-nearest neighbor

basic idea: look at characterisitcs/ attributes "if it walks like a duck and quacks like a duck it is probably a duck"

classification: example 3

customer attrition/ churn goal: to predict whether a customer is likely to be lost to a competitor approach: use detailed record of transactions with each of the past and present customers, to find attributes -how often the customer calls, where he/she calls, what time of the day he/she calls most, his/her financial status, marital status etc label teh customers as loyal or disloyal develop a model for loyalty

multi dimensional DBMS

data is stored in multi dimensional arrays typically visualized as being stored as a data cube data retrieval is much quicker than with standard relational databases provides opportunity to slice and dice data foundation for multi dimensional OLAP

classification: example 1

direct marketing goal: reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone product approach: use the data for a similar product introduced before we know which custmeors decided to buy and which decided otherwise. this (buy, don't buy) decision (binary attribute) forms the class attribute collect various demographic, lifestyle and company interaction related information about all such customers -type of business, where they stay, how much they earn etc use this information as input attributes to learn a classifier model

clustering example 2

document clustering: goal: to find groups of documents that are similar to each other based on the important terms appearing in them approach: to identify frequently occurring terms in each document, form a similarity measure based on the frequencies of different terms and use it to cluster gain: information retrieval can utilize the clusters to relate a new document or search term to clustered documents

transaction examples

examine the current account balance: SELECT cust_number, cust_balance FROM Customer WHERE Cust_Number = 238 transaction: read from the customer table no changes are made to the database => the database remains in a consistent state after the transactions

example- multi dimensional DB

fact tables- sales: sales_quantity, sales_price, sales_total orders: order_quantity, order_price, order_amount dimension tables- time: time_year, time_quarter, time_month, time_day product: prod_brand, prod_color, prod_size, prod_package location: region_ID,loc_state, loc_city, loc_office

lost update problem

occurs when a successfully completed update is overwritten by another transaction example: -T1 withdraws $10 from an account with initial balance = $100 -T2 deposits $100 into same account the final balance should be $190 but incorrectly overlapping transactions may lead to a final balance of only $90 result- T2's update is lost

classification: description

given a collection of records -each record contains a set of attributes one of the attributes is the dependent variable/class find a model to predict the class attribute as a function of the values of the other attributes goal: previously unseen records should be assigned to a class as accurately as possible -a test set is used to determine the accuracy of the model. usually, the given data set is divided into training and test sets with training set used to build the model and test set used to validate it

unsupervised algorithm: clustering

given a set of data points, each having a set of attributes and a similarity measure among them, find clusters such that -data points in one cluster are more similar to one another -data points in separate clusters are less similar to one another similarity measures: -euclidian distance )if attributes are continuous) -other problem specific measures

online analytical processing

graphical user interface analytical processing logic data processing logic advanced functionality for: data presentation data aggregation data modeling and analysis capacity for multi dimensional analysis used with both transactional DBs and data warehouses

link to big data and business intelligence document

https://www.czobel.bit.vt.edu/bit4514/handouts/BigData_and_Business_Intelligence.pdf

transactiosn for MS Access

https://www.czobel.bit.vt.edu/bit4514/handouts/CreateTransactionADO.pdf

SQL Injection handout

https://www.czobel.bit.vt.edu/bit4514/handouts/SQL_injection.pdf

example of a parameterized query as a stored procedure

https://www.czobel.bit.vt.edu/bit4514/handouts/Stored_Procedure.pdf

JSON example 2

https://www.czobel.bit.vt.edu/bit4514/handouts/json-2.pdf

JSON example 1

https://www.czobel.bit.vt.edu/bit4514/handouts/json.pdf

koofers

https://www.koofers.com/files/notes-500j8a6kvm/

origins of data mining

ideas come from many disciplnes including machine learning/ AI, pattern recognition, statistics and database systems traditional techniques may be unsuitable due to: enormity of data, high dimensionality of data, heterogenous distributed nature of data

choosing the value of k

if k is too small the model is sensitive to noise if k is too large it may include too many points from other classes

transaction management

improper or incomplete transactions can have a devastating effect on database integrity ex: INSERT only items into the Lunch_item table if a DBMS supports transaction management, it will roll back an inconsistent database (i.e. the result of an unsuccessful transaction) to a previous consistent state

what is data mining

many definitions: non-trivial extraction of implicit previously unknown and potentially useful information from data exploration and analysis by automatic or semi automatic means of large quantities of data in order to discover meaningful patterns

clustering: example 1

market segmentation: goal- subdivide a market into distinct subsets of custmoers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix approach: collect different attributes of customers based on their geographical and lifestyle related information -find clusters of similar customers -measure the clustering quality by observing buying patterns of customers in same cluster vs those from different clusters

managing information

master data management (MDM)-provides for a comprehensive and consistent definition of all data in an organization, ensures uniform and consistent views of all data supports proper governance -for controlling and monitoring business health creates accountability

inconsistent retrievals problem

occurs when a transaction reads several values, but a different transaction updates some of them in the midst of the process -some data are read before they are changed and others after they are changed- yielding inconsistent results examples: T6 is totaling balances of account x ($100), account y($50) and account z ($25) meanwhile, T5 transfers $10 from balx to balz after after x is totaled but before z is, so T6 has a result that is $10 too high

uncommitted data problem

occurs when one transaction accesses the intermediate results of another transaction before they are committed- and the second transaction is then rolled back example: T4 updates bal to $200, but there is an error in a later UPDATE, so bal is rolled back to original value of $100 -after the bal, update, but before the rollback, T3 reads the new value of bal, $200 and uses this as the basis of a $10 reduction, giving a final new balance of $190, not $90

durability

once transaction changes are committed (successfully applied to the database) they cannot be undone or lost due to a subsequent failure

different types of data

operational data (transactional databases) -stored in highly normalized tables in a relational database -dynamically updated -focus of traditional information systems decision support data (data warehouse) -stored in formats that facilitate data extraction, data analysis and decision making -often aggregated -often with redundancies

optimistic vs pessimistic locking

optimistic locking- assumes that no transaction conflict(s) will occur: -DBMS proposes a transaction to a temporary file; checks whether conflict occurred -if not, the transaction is finished -if so, the transaction is repeated until there is no conflict pessimistic locking- assumes that conflict(s) will occur: -locks are issued before a transaction is processed, and then the locks are released optimistic locking is acceptable for applications with few update operations

assumptions of traditional BI platforms

predictable frequency- data is extracted from source systems at regular intervals-typically measured in days months and quarters static sources- data is sourced from controlled, internal systems supporting established and well-defined back-office processes fixed models- data structures are known and modeled in advance of analysis defined queries- questions to be asked of the data (ie reporting queries) are predefined slow-changing requirements- rigorous change control is enforced before the introduction of new data sources or reporting requirements limited users- the consumers of BI reports are typically business managers and senior executives

nearest neighbor classifier

requires 3 things: the set of stored records distance metric to compute the distance between records the value of k, the number of nearest neighbors to retrieve to classify an unknown record: compute distance to other training records identify k nearest neighbors use class labels of nearest neighbors to determine the class label of unknown record (e.g. by taking majority vote, weighted distance)

"Big Data" factors impacting traditional BI databases

semi structured and unstructured data- typical in mobile, social and sensor driven applications cannot be efficiently represented as rows and columns in a relational database rapid evolution of database schema- to supports new data sources and rapidly changing data structures is not possible in relational databases performance overhead of JOINS and transaction semantics prevents relational databases from keeping pace with the ingestion of high-velocity data sources quickly growing data volumes- require scaling databases out across commodity hardware, rather than the scale-up approach typical of most relational databases all of this leads to growing use of NoSQL databases and data formats like JSON

business intelligence

set of methodologies, processes, architectures and technologies that transform raw data into meaningful and useful information data-information-decisions-actions term coined by gartner group in early 1990s

monitoring business health

setting goals defining key performance indicators (KPI) -quantifiable measurements that assess a company's effectiveness in reaching its goals exampels general: reputation financial: net worth, profit human resource: number of complaints

multi dimensional DBMS structure

star schema- fact tables, associated with a particular type of aggregated data (ex sales table) dimension tables- attributes provide descriptive information about the facts within a given dimension (ex: product, time, and location tables)

types of data mining algorithms

supervised algorithms (classification): -learning by example -use training data which has correct answers (class label attribute) -create a model by running the algorithm on the training data -identify a class label for hte incoming new data unsupervised algorithms (clustering) -do not use training data -classes may not be known in advance

types of data mining algorithms

supervised(classification) -decision trees -regression -neural networks -genetic algorithsm -support vetor machines -k nearest neighbors -association rules -bayesian classification unsupervised -clustering

differences between SQL and NOSQL

tables schemas normalization JOINs data integrity transactions performance scaling which is better? the old technology (data warehouses) the new technology (NoSQL databases)

atomicity

the "all or nothing" property all transaction operations must be completed i.e. a transaction is treated as a single, individisible, logical unit of work

SQL vs NOSQL documents

the difference: https://www.czobel.bit.vt.edu/bit4514/handouts/SQL_vs_NoSQL.pdf how to choose: https://www.czobel.bit.vt.edu/bit4514/handouts/SQL_vs_NoSQL_2.pdf


Kaugnay na mga set ng pag-aaral

MBA Vocab - Oxford Examples - Vol 2

View Set

Nutrition Chapter 2 notes, and definitions

View Set

CGF Unit 7 Practice Questions- Hematological Disorders

View Set

NRS-105: Coronary Vascular Disorders (Chpt. 27)

View Set

MULTIPLE CHOICE CHAPTERS 6, 7, 9

View Set

Anatomy and Physiology II Ch. 24 - Digestive System (Part 2)

View Set