UNT BCIS 4660 FINAL

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

"The future of data mining lies in predictive analytics," Predictive analysis

Future trends in Data Mining

Keeping Up with Customer Demand Through New BI Deployments New Data Warehousing Solves Problems for Businesses More collaboration and new data warehousing solutions -Data Lakes -Data Fragmented across organizations

Future trends in Data Warehouseing

GUI

Graphical User Interface

Receives input an displays output in 3-tier client-server architecture CLIENTS Web browsers

1. The Presentation Tier

Processes the logic and makes calculations in 3-tier client-server architecture SERVERS Shopping cart code

2. The Application Tier

Stores and manages data in 3-tier client-server architecture SERVERS Shopping and order processing database

3. The Database Tier

Less data entry Ability to create detailed queries Can interface with many 3rd party tools

Advantages of Relational Database

Data visualization tool to use when the audience is you:

Explorational Data Visualizations

ETL

Extract- Get the data Transform- Make it useful Loading- Save it to warehouse

Each cell single valued Entries in a column are same type Rows uniquely identified (no repeats in cells)

First Normal Form

Data visualization tool to use when the audience is your betters:

Formal data visualization

No multi-valued dependencies (Eliminate redundancy in tables)

Fourth Normal Form

The amount of detail in data. Lowest possible is at the individual transaction level) (Very fine)

Grain

____________ break down dimensions into navigational paths, which you can use to get a more granular level of detail in data

Heiarchy

Create tap- Design- Delete query- select field to be deleted- fill in "delete" and "criteria"- RUN!

How to form Delete queries

Create tap- Design- Make Table- Name said table- Right click and show table- Add desired tables to draw information from- format data as desired

How to form Make Table queries

Create- Design- Choose Tables Holding Desired information- Select desired attributes to be displayed

How to form Select queries

Create tap- design- update button- add table- select field to be changed- fill in "update to" and "Criteria" accordingly- RUN!

How to form Update queries

Data visualization tool to use when the audience is your peers:

Informal Data Visualization

Staging area for copies of data taken from Operational Systems Place to make sure all gathered data is in same format (Pounds, Dollars, Pesos all have to interact)

Integration Layer

Turning patterns into knowledge: (user looks at data patterns and makes determinations)

Interpretation

only tuples belonging to A and B, or shared by both A and B are included in the result.

Intersect Operator SELECT dob From A INTERSECT SELECT dob from B

What are the guidelines for a Financial Reporting DW?

Its reports should be as reliable as the existing financial system

denoted by the ⟗ symbol and is used to compound similar tuples from two Relations into single longer tuples. Every row of the first table is joined to every row of the second table. The result is tuples taken from both tables.

Join Operator SELECT A.dob, A.empno from employee JOIN B on B.empno=A.empno

The most often encountered binary (1:N) tabular relationship has minimum cardinalities ______, where the first table in the pair is the parent table

M-O (Mandatory-Optional)

Numerical values that can be added up to provide meaning to demensions (Revenue, Cost, Quantity)

Measures

Some data warehousing trends include:

Need for increasing numbers of highly skilled analysts each year

OLTP

OLTP (online transaction processing) OLTP is a class of software programs capable of supporting transaction-oriented applications on the Internet. Typically, OLTP systems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales.

OLAP

Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.

First Step in data warehouse. (Marketing, Sales, External Data) Already have their own normalized databases

Operational Systems

In relational algebra, you can restrict the output from a join to include only certain columns by using the _____ command (or operation).

PROJECT

250 bytes; 1024 terabytes, or a million gigabytes

Petabyte

Converting target data to Processed data (Use algorithms to remove outliers) (Detect Missing Values)

Pre processing

An attribute or group of attributes that uniquely identifies an instance or entity (Student ID)

Primary Key

used to reorder, select and get rid of attributes from a table. At some point we might want only certain attributes in a relation and eliminate others from our query result.

Project Operator SELECT dob, empno FROM EMPLOYEE

Data visualization tool to use when the audience is everybody:

Public data visualization

Describes how one or more entities interact with each other

Relationships

Student (StudentID, FirstName, LastName, Street, City, State, Zip) Based on the code above, list the complete student table.

SELECT * FROM Student ;

Customer ( CustomerNum, CustomerName, Street, City, State, Zip, Balance, CreditLimit, RepNum )

SELECT CustomerNum, CustomerName, CreditLimit-Balance AS AvailableCredit FROM Customer WHERE CreditLimit>Balance ;

The basic form of an SQL retrieval command is _____.

SELECT-FROM-WHERE

All attributes (Non-Key Columns) dependent on the key (split)

Second Normal Form

used to choose a subset of the tuples(rows) from a relation that satisfies a selection condition, acting as a filter to retain only tuples that fulfills a qualifying requirement.

Select Operator SELECT empno FROM EMPLOYEE WHERE empno=7

Converting Data to Target Data

Selection

Adding a new single column primary key, which is not the same as the source system primary key, to a dimension table is an effective way to address the problem known as _____.

Slowly changing dimensions

DataMarts in global overview of company (Managing director would be interested in this)

Strategic Marts

Low-level DW readiness for management sponsorship is identified by:

Their lack of understanding for what DW development entails

All Fields (columns) can be determined Only by the Key in the table and no other column

Third Normal Form

Data Warehouse is built first, then data mart is created Emphasizes Data Warehouse

Top down design data mart

Converting preprocessed data to transformed data (de-normalize data)

Transformation

Student (StudentID, FirstName, LastName, Street, City, State, Zip) Based on the code above, change the Zip code of the student with ID 11433 to 14455.

UPDATE Student SET Zip='14455' WHERE StudentID='11433' ;

joins or includes all tuples that are in A or in B, eliminating duplicate tuples.

Union Operator SELECT * From A UNION SELECT * From B

Which of the following is a reason for the growth in popularity of data mining? a. Increased volume of data b. Increased awareness of the inadequacy of the human brain to process multi-factorial dependencies or correlations c. Increased affordability of machine learning d. All of the above

d. All of the above

To create the dimensional model from the relatively well normalized transactional database, one must _______. a. determine dimension and facttables c. rebuild the tables b. denormalize the tables d. all of the above

d. all of the above

Uncovering new knowledge, patterns, trends, and rules from the data stored in a data warehouse is known as _____.

data mining

A(n) _____ holds information about the relationships among various entities.

database

A table is in first normal form if _____.

it does not contain repeating groups

On an E-R diagram, the number closest to the rectangle represents _____ cardinality.

maximum

Partial dependencies are dependencies on only a portion of the _____.

primary key

Rows are also called _____.

tuples

A new object added to resolve a many-to-many relationship

Intersection Entity

Represents a person, place or thing you want to track in a data base Will become a table in database ie: students table> full of students (Instances)

Entity

Simple structure of the data - Easy to understand how elements are connected. Simplifies the reporting of the information. Most common - Easy to integrate with another tools. Queries more effective - The queries in these systems are usually simpler since the data doesn't follow some strict rules of normalization. Another reason for this is the lesser number of tables to join. Performance enhancements - The performance has substantial gains due to the de-normalized form of the data. Optimized for large data sets - Due to the best performance of the system and it's queries, the star schema is efficient on data warehouses or data marts with huge data sets. Rapid aggregational actions - Tasks like sum, average, count, and others are performed quickly on this systems. Good for OLAP

Advantages of Star Diagram

EDW/DW/DWH

All mean data Warehouse

The type of table which joins two or more tables and is a real object itself is often referred to as a(n) ______ table.

Association

Which data mining technique utilizes linkage analysis to search operationaltransactions for patterns with a high probability of repetition?

Association analysis

Describes various characteristics about an individual entity Will become "columns" in table

Attribute

Data mart is created first, then they are integrated to get data warehouse Emphasizes Data Marts

Bottom up design data mart

Which end is one, which end is many:

Cardinality

The count of instances that are allowed or necessary between entity relationships

Cardinality Minimum- fewest rows needed for a relation ship Maximum- cannot exceed this number of rows

The product of two tables (or relations) is also called the ____ product.

Cartesian

also referred to as the cross product or cross join, creates a relation that has all the attributes of A and B, allowing all the attainable combinations of tuples from A and B in the result.

Cartesian Product Operator SELECT A.dob, B.empno from A, B

the system maintains several identical replicas of the same relation r in different sites. Data is more available in this scheme. Parallelism is increased when read request is served. Increases overhead on update operations as each site containing the replica needed to be updated in order to maintain consistency.

DDBMS replication

The relation r is fragmented into several relations r1, r2, r3....rn in such a way that the actual relation could be reconstructed from the fragments and then the fragments are scattered to different locations. There are basically two schemes of fragmentation: Horizontal __________ - splits the relation by assigning each tuple of r to one or more _____________. Vertical ____________ - splits the relation by decomposing the schema R of relation r.

DDMBS fragmentation

Finding actionable informations out of big data

Data Mining

Turning transformed data into patterns by applying algorithms

Data Mining

Only way to truly communicate the nature of a distribution

Data Visualization tool

Holds data from across the organization (in second NF)

Data Warehouse

Purpose is to provide aggregate data (suitable format for decision making)

Data Warehouseing

Sub- Sets of the DW Keeps individual users from manipulating data stored in data warehouse Small problems are easier to solve

Datamarts

Different points of view on data (Location/ Supplier)

Dimensions

the current relational database forces developers to recreate logic, to find solutions to mismatches between applications and their database system, to retrofit or debug the existing database because of time and money already invested and to work with a very complex system.

Disadvantages of Relational Database

Poor Data Integrity - Due to the non normalized structure of these tables, information can be replicated, creating several anomalies in the data. Long time loading dimension table - When the data integrity is low and replication values high, loading time of the tables increases. More disk space Additional processing - Usually some controlling processes are added, to avoid the data integrity issue. Harder Complex Queries - Since the data schema is built specifically to analyse a set of data, its de-normalized organization makes it harder to develop new complex queries. No Many-to-Many - This schema has no many-to-many relationships

Disadvantages of Star Diagram

Data stored on a number of computers in different locations but is stored as 1 database faster queries easy to backup higher overhead security issues

Distributed/Enterprise Model Approach to Data Warehouse Design


Ensembles d'études connexes

Ch 32 Assessing the Gastrointestinal System

View Set

Role Transition modules 1-6 self check questions

View Set

AP Comparative Final - AP Classroom Mock Tests

View Set