UNT BCIS 4660 FINAL
"The future of data mining lies in predictive analytics," Predictive analysis
Future trends in Data Mining
Keeping Up with Customer Demand Through New BI Deployments New Data Warehousing Solves Problems for Businesses More collaboration and new data warehousing solutions -Data Lakes -Data Fragmented across organizations
Future trends in Data Warehouseing
GUI
Graphical User Interface
Receives input an displays output in 3-tier client-server architecture CLIENTS Web browsers
1. The Presentation Tier
Processes the logic and makes calculations in 3-tier client-server architecture SERVERS Shopping cart code
2. The Application Tier
Stores and manages data in 3-tier client-server architecture SERVERS Shopping and order processing database
3. The Database Tier
Less data entry Ability to create detailed queries Can interface with many 3rd party tools
Advantages of Relational Database
Data visualization tool to use when the audience is you:
Explorational Data Visualizations
ETL
Extract- Get the data Transform- Make it useful Loading- Save it to warehouse
Each cell single valued Entries in a column are same type Rows uniquely identified (no repeats in cells)
First Normal Form
Data visualization tool to use when the audience is your betters:
Formal data visualization
No multi-valued dependencies (Eliminate redundancy in tables)
Fourth Normal Form
The amount of detail in data. Lowest possible is at the individual transaction level) (Very fine)
Grain
____________ break down dimensions into navigational paths, which you can use to get a more granular level of detail in data
Heiarchy
Create tap- Design- Delete query- select field to be deleted- fill in "delete" and "criteria"- RUN!
How to form Delete queries
Create tap- Design- Make Table- Name said table- Right click and show table- Add desired tables to draw information from- format data as desired
How to form Make Table queries
Create- Design- Choose Tables Holding Desired information- Select desired attributes to be displayed
How to form Select queries
Create tap- design- update button- add table- select field to be changed- fill in "update to" and "Criteria" accordingly- RUN!
How to form Update queries
Data visualization tool to use when the audience is your peers:
Informal Data Visualization
Staging area for copies of data taken from Operational Systems Place to make sure all gathered data is in same format (Pounds, Dollars, Pesos all have to interact)
Integration Layer
Turning patterns into knowledge: (user looks at data patterns and makes determinations)
Interpretation
only tuples belonging to A and B, or shared by both A and B are included in the result.
Intersect Operator SELECT dob From A INTERSECT SELECT dob from B
What are the guidelines for a Financial Reporting DW?
Its reports should be as reliable as the existing financial system
denoted by the ⟗ symbol and is used to compound similar tuples from two Relations into single longer tuples. Every row of the first table is joined to every row of the second table. The result is tuples taken from both tables.
Join Operator SELECT A.dob, A.empno from employee JOIN B on B.empno=A.empno
The most often encountered binary (1:N) tabular relationship has minimum cardinalities ______, where the first table in the pair is the parent table
M-O (Mandatory-Optional)
Numerical values that can be added up to provide meaning to demensions (Revenue, Cost, Quantity)
Measures
Some data warehousing trends include:
Need for increasing numbers of highly skilled analysts each year
OLTP
OLTP (online transaction processing) OLTP is a class of software programs capable of supporting transaction-oriented applications on the Internet. Typically, OLTP systems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales.
OLAP
Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.
First Step in data warehouse. (Marketing, Sales, External Data) Already have their own normalized databases
Operational Systems
In relational algebra, you can restrict the output from a join to include only certain columns by using the _____ command (or operation).
PROJECT
250 bytes; 1024 terabytes, or a million gigabytes
Petabyte
Converting target data to Processed data (Use algorithms to remove outliers) (Detect Missing Values)
Pre processing
An attribute or group of attributes that uniquely identifies an instance or entity (Student ID)
Primary Key
used to reorder, select and get rid of attributes from a table. At some point we might want only certain attributes in a relation and eliminate others from our query result.
Project Operator SELECT dob, empno FROM EMPLOYEE
Data visualization tool to use when the audience is everybody:
Public data visualization
Describes how one or more entities interact with each other
Relationships
Student (StudentID, FirstName, LastName, Street, City, State, Zip) Based on the code above, list the complete student table.
SELECT * FROM Student ;
Customer ( CustomerNum, CustomerName, Street, City, State, Zip, Balance, CreditLimit, RepNum )
SELECT CustomerNum, CustomerName, CreditLimit-Balance AS AvailableCredit FROM Customer WHERE CreditLimit>Balance ;
The basic form of an SQL retrieval command is _____.
SELECT-FROM-WHERE
All attributes (Non-Key Columns) dependent on the key (split)
Second Normal Form
used to choose a subset of the tuples(rows) from a relation that satisfies a selection condition, acting as a filter to retain only tuples that fulfills a qualifying requirement.
Select Operator SELECT empno FROM EMPLOYEE WHERE empno=7
Converting Data to Target Data
Selection
Adding a new single column primary key, which is not the same as the source system primary key, to a dimension table is an effective way to address the problem known as _____.
Slowly changing dimensions
DataMarts in global overview of company (Managing director would be interested in this)
Strategic Marts
Low-level DW readiness for management sponsorship is identified by:
Their lack of understanding for what DW development entails
All Fields (columns) can be determined Only by the Key in the table and no other column
Third Normal Form
Data Warehouse is built first, then data mart is created Emphasizes Data Warehouse
Top down design data mart
Converting preprocessed data to transformed data (de-normalize data)
Transformation
Student (StudentID, FirstName, LastName, Street, City, State, Zip) Based on the code above, change the Zip code of the student with ID 11433 to 14455.
UPDATE Student SET Zip='14455' WHERE StudentID='11433' ;
joins or includes all tuples that are in A or in B, eliminating duplicate tuples.
Union Operator SELECT * From A UNION SELECT * From B
Which of the following is a reason for the growth in popularity of data mining? a. Increased volume of data b. Increased awareness of the inadequacy of the human brain to process multi-factorial dependencies or correlations c. Increased affordability of machine learning d. All of the above
d. All of the above
To create the dimensional model from the relatively well normalized transactional database, one must _______. a. determine dimension and facttables c. rebuild the tables b. denormalize the tables d. all of the above
d. all of the above
Uncovering new knowledge, patterns, trends, and rules from the data stored in a data warehouse is known as _____.
data mining
A(n) _____ holds information about the relationships among various entities.
database
A table is in first normal form if _____.
it does not contain repeating groups
On an E-R diagram, the number closest to the rectangle represents _____ cardinality.
maximum
Partial dependencies are dependencies on only a portion of the _____.
primary key
Rows are also called _____.
tuples
A new object added to resolve a many-to-many relationship
Intersection Entity
Represents a person, place or thing you want to track in a data base Will become a table in database ie: students table> full of students (Instances)
Entity
Simple structure of the data - Easy to understand how elements are connected. Simplifies the reporting of the information. Most common - Easy to integrate with another tools. Queries more effective - The queries in these systems are usually simpler since the data doesn't follow some strict rules of normalization. Another reason for this is the lesser number of tables to join. Performance enhancements - The performance has substantial gains due to the de-normalized form of the data. Optimized for large data sets - Due to the best performance of the system and it's queries, the star schema is efficient on data warehouses or data marts with huge data sets. Rapid aggregational actions - Tasks like sum, average, count, and others are performed quickly on this systems. Good for OLAP
Advantages of Star Diagram
EDW/DW/DWH
All mean data Warehouse
The type of table which joins two or more tables and is a real object itself is often referred to as a(n) ______ table.
Association
Which data mining technique utilizes linkage analysis to search operationaltransactions for patterns with a high probability of repetition?
Association analysis
Describes various characteristics about an individual entity Will become "columns" in table
Attribute
Data mart is created first, then they are integrated to get data warehouse Emphasizes Data Marts
Bottom up design data mart
Which end is one, which end is many:
Cardinality
The count of instances that are allowed or necessary between entity relationships
Cardinality Minimum- fewest rows needed for a relation ship Maximum- cannot exceed this number of rows
The product of two tables (or relations) is also called the ____ product.
Cartesian
also referred to as the cross product or cross join, creates a relation that has all the attributes of A and B, allowing all the attainable combinations of tuples from A and B in the result.
Cartesian Product Operator SELECT A.dob, B.empno from A, B
the system maintains several identical replicas of the same relation r in different sites. Data is more available in this scheme. Parallelism is increased when read request is served. Increases overhead on update operations as each site containing the replica needed to be updated in order to maintain consistency.
DDBMS replication
The relation r is fragmented into several relations r1, r2, r3....rn in such a way that the actual relation could be reconstructed from the fragments and then the fragments are scattered to different locations. There are basically two schemes of fragmentation: Horizontal __________ - splits the relation by assigning each tuple of r to one or more _____________. Vertical ____________ - splits the relation by decomposing the schema R of relation r.
DDMBS fragmentation
Finding actionable informations out of big data
Data Mining
Turning transformed data into patterns by applying algorithms
Data Mining
Only way to truly communicate the nature of a distribution
Data Visualization tool
Holds data from across the organization (in second NF)
Data Warehouse
Purpose is to provide aggregate data (suitable format for decision making)
Data Warehouseing
Sub- Sets of the DW Keeps individual users from manipulating data stored in data warehouse Small problems are easier to solve
Datamarts
Different points of view on data (Location/ Supplier)
Dimensions
the current relational database forces developers to recreate logic, to find solutions to mismatches between applications and their database system, to retrofit or debug the existing database because of time and money already invested and to work with a very complex system.
Disadvantages of Relational Database
Poor Data Integrity - Due to the non normalized structure of these tables, information can be replicated, creating several anomalies in the data. Long time loading dimension table - When the data integrity is low and replication values high, loading time of the tables increases. More disk space Additional processing - Usually some controlling processes are added, to avoid the data integrity issue. Harder Complex Queries - Since the data schema is built specifically to analyse a set of data, its de-normalized organization makes it harder to develop new complex queries. No Many-to-Many - This schema has no many-to-many relationships
Disadvantages of Star Diagram
Data stored on a number of computers in different locations but is stored as 1 database faster queries easy to backup higher overhead security issues
Distributed/Enterprise Model Approach to Data Warehouse Design