chapter 6, MIS
Data cleansing
*) Activities, software for detecting and correcting data in a database or file that are incorrect, incomplete, improperly formatted, or redundant. Also known as data scrubbing. *) Enforces consistency among different sets of data from separate information systems.
Select
*) Create a subset of rows that meet certain criteria.
Multidimensional Data Model
*) Cubes are aggregated set data's that helps into rapid analysis. *) OLAP cube can be understand that multidimensional array of datasets. *) For example a company might wish to analyze some financial data by product, by time-period, by city, by type of revenue and cost, and by comparing actual data with a budget. *) The OLAP cube consists of numeric facts called measures which are categorized by dimensions. *) Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.
How a DBMS Solves the problems of the traditional file environment
*) DBMS reduces data redundancy and inconsistency by minimizing isolated files in which the same data are repeated. *) DBMS may not able to eliminate data redundancy but at least it can help control redundancy. *) DBMS can ensure every data has the same value so it can eliminate data inconsistency. *) DBMS uncouples programs and data, enabling data to stand on their own DBMS can increase access and availability of information and reduce the program development and maintenance costs. *) Because users and programmers can perform ad hoc queries of data in the database *) DBMS enables to centrally manage data, their use, and security.
The database approach to data management
*) Database technology solves many of the problems of traditional file organization. *) Database is a collection of data organized to serve many applications efficiently by centralizing the data and controlling redundant data. *) Data store in one location. *) Single database services multiple applications. E.g. create a single common human resources database.
Primary Key
*) Each table in a relational database has one field that is designated as it is a primary key.
Design process identifies
*) Identifies relationships among data elements *) Identifies the efficient way of grouping data elements together. *) Identifies redundant data elements. This is to meet business information requirements and the need of application programs.
Business intelligence
*) Information collected from multiple sources such as suppliers, customers, competitors, partners, and industries that analyzes patterns, trends, and relationships for strategic decision making *) Enhanced reporting system outputs and reports for planning or decision making
Database management systems (DBMS)
*) Is software that permits an organization to centralize data, manage them efficiently and provide access to the stored data by application programs. *) DBMS acts as an interface between application programs and the physical data files.
Foreign Key
*) Primary key used in second table as look-up field to identify records from original table.
Business intelligence - Principle tools include
*) Software for database query and reporting *) Online analytical processing (OLAP) *) Data mining
Object-Oriented DBMS (OODBMS)
*) Store data and procedures as objects. *) Object can be graphics, multimedia, Java applets. *) OODBMS slow compared with relational DBMS for processing large numbers of transactions
Data warehouse
*) Stores current and historical data form many operational transaction systems. *) Data warehousing provide query, analysis and reporting tools. *) Consolidates and standardizes information for use across enterprise, but data cannot be altered.
Online analytical processing (OLAP)
*) Supports multidimensional data analysis - Viewing data using multiple dimensions - Each aspect of information (product, pricing, cost, region, time period) is different dimension - E.g., how many washers sold in the East in June compared with other regions? *) OLAP enables rapid, online answers to ad hoc queries
Designing Databases
*) The database requires both a conceptual design and a physical design.
Key field
*) This key field is the unique identifier for all the information in any row of the table and this primary key cannot be duplicated.
Operations of a relational DBMS
*) Three basic operations used to develop useful sets of data.
Very large databases and systems require special capabilities, tools
*) To analyze large quantities of data *) To access data from multiple systems
Databases in the Cloud
*) Typically less functionality than on-premises DBs *) Amazon Web Services, Microsoft SQL Azure *) You only pay for the exact services you need.
A data dictionary
*) is an automated or manual file that stores definitions of data elements and their characteristics. *) Data dictionaries for large corporate databases may capture additional information, such as usage, ownership (who in the organization is responsible for maintaining the data), authorization; security, and the individuals, business functions, programs, and reports that use each data element.
Data manipulation language
*) is used to add, change, delete, and retrieve the data in the database.
Microsoft Access and other DBMS include capabilities for
*) report generation so that the data of interest can be displayed in a more structured and polished format than would be possible just by querying. *) Access also has capabilities for developing desktop system applications. *) These include tools for creating data entry screens, reports, and developing the logic for processing transactions.
Types of information obtainable from data mining
- Associations - Sequences - Classification - Clustering - Forecasting
Three Key techniques of analyzing large database
- Data warehousing - Data mining - Tools for accessing internal databases through the Web
Before new database in place, need to:
- Identify and correct faulty data - Establish better routines for editing data once database in operation
Data Marts
- Subset of data warehouse - Summarized or highly focused portion of firm's data for use by specific users - Typically focuses on single subject or line of business.
Capabilities of Database Management System
A DBMS includes capabilities and tools for organizing, managing, and accessing the data in the database. The most important are its data definition language, data dictionary, and data manipulation language.
Entity relationship diagram
A diagram that shows how different entities database are connected.
Field
A grouping of characters into a word, a group of words or complete number (person name or age Supplier name, street, city, and zipcode) and related fields can be grouped to form a record.
Data quality audit
A survey and/or sample of files to determine accuracy and completeness of data in an information system - Survey samples from data files, or - Survey end users for perceptions of quality
After normalization, the original relation ORDER has been broken down into four smaller relations. The relation ORDER is left with only two attributes and the relation LINE_ITEM has a combined, or concatenated, key consisting of Order_Number and Part_Number.
An unnormalized relation contains repeating groups. For example, there can be many parts and suppliers for each order. There is only a one-to-one correspondence between Order_Number and Order_Date
Web mining
Analyzes unstructured data (useful patterns and information) associated with websites to identify consumer behavior and website navigation from WWW
Text mining
Analyzes unstructured data to find trends and patterns in words and sentences.
Predictive analysis
Combines historical data with assumptions about future conditions to predict outcomes of events, such as future product sales or the probability that a customer will respond to an offer
Database administration
Defining, organizing, implementing, maintaining database; performed by database design and management group
Sequences
Events linked over time
Establishing an information policy
Firm's rules, procedures, roles for sharing, managing, standardizing data
An order can include more than one part, and it may be that several parts are supplied by the same supplier.
However, using this table, the supplier's name, and other information might need to be stored several times on the order list.
Web content mining
Knowledge extracted from content web pages
Ensuring data quality
More than 25% of critical data in Fortune 1000 company databases are inaccurate or incomplete
Associations
Occurrences linked to single event
6.1 Organization data in a traditional file environment
Poorly organized and maintained IS may result in business don't have timely, accurate, relevant information. So data management is so essential.
Classification
Recognizes patterns that describe group to which item belongs
Data governance
Refers policies and processes for managing availability, usability, integrity, and security of company data especially as it relates to government regulations.
Relational DBMS
Represent data as a two-dimensional table (called relations or files ) each table contains data on an entity and attributes.
Byte
Represents a single character (letter, number or symbol). Bytes can be grouped to form a field.
Fields (columns):
Represents attribute for entity
Bit
Represents the smallest unit of data either a 0 or a 1. Bits can be grouped to form a byte.
Clustering
Similar to classification when no groups have been defined; finds groupings within data
Data definition capability
Specifies structure of database content, used to create tables and define characteristics of fields.
Normalization
The process of creating small, stable, yet flexible and adaptive data structures from complex groups of data.
Web usage mining
User interaction data recorded by web server
Forecasting
Uses series of existing values to forecast what other values will be
File
a group of records of the same type (a group of record) and related files can be organized into a database.
Record
a group of related fields (student's name, a course taken, the date, the grade) related records can be collected to form a file.
Database
a group of related files makes up a database. Ex.) Telephone Book
The conceptual, or logical, design of a database is
an abstract model of the database from a business perspective
Hybrid object-relational DBMS
are now available to provide capabilities of both object-oriented and relational DBMS.
The data hierarchy starts with
bits, bytes, progresses to fields, records, files and databases
Replicated
central database duplicated in entirety at different locations.
Partitioned
different parts of databases store in separate locations
Most data quality problems stem from
faulty input
A computer system organizes data in
hierarchy
The most prominent data manipulation language today
is Structured Query Language, or SQL.
Entity
is a person, place, thing, or event on which we store and maintain information.
Crystal Reports
is a popular report generator for large corporate DBMS, although it can also be used with Access.
Attribute
is describing a particular entity such as student_ID, Course, Date, Grade
Data Mining (Market basket analysis)
is the process of discovering unknown patterns and relationships in large amounts of data to predict future behavior. *) E.g., Finding patterns in customer data for one-to-one marketing campaigns or to identify profitable customers *) - Predictive analysis - Text mining - Web mining (Link mining)
Web structure mining
links to and from web page
Join
operation combines relational tables to provide the user with more information that is available in individual tables.
Project
operation creates a subset consisting of columns in a table, permitting the user to create new tables that contain only the information required.
Rows (tuples):
referred to as records, records for different entities.
Data administration
responsible for specific policies and procedures to manage data
Physical design
shows how the database is actually arranged on direct-access storage devices.
Distributing databases
storing database in more than one place