Chapter 1. Introduction
Scientific Perspective
- Data Collected & stored at enormous speeds(GB/hour) - Traditional techniques infeasible for raw data - Data mining may help scientists (classifying/segmenting data; hypothesis formation)
Concept of Data Mining
- Draws idea from ML/AI, pattern recognition, stats, and database systems
Commercial Perspective of Data Mining
- Lots of data to collected/warehoused (web data, purchases, bank/card transactions) - Computers cheaper/more powerful - Competitive Pressure strong (provide better, customized services; decision making process more efficient/effective)
Major Issues in Data Mining
- Mining methodology - User interaction - Efficiency and scalability - Diversity of data types - Data mining and society Many of these issues have been addressed in recent research and development to a certain extent and are now considered data mining requirements; others are still at the research stage
Challenges of Data Mining
- Scalability - Dimensionality - Complex and Heterogeneous Data - Privacy Preservation - Streaming Data
Frequent subsequence (Frequent sequential pattern)
A recurring pattern in transactional data Ex. buying laptop, then a digital camera, then a memory card
Knowledge Discovery Process
1. Data cleaning - remove noise and inconsistencies 2. Data integration - combine data sources 3. Data selection - retrieve relevant data from db 4. Data transformation - data transformed into appropriate forms for mining and aggregation 5. Data mining - methods applied to extract data patterns 6. Pattern Evaluation - identify truly interesting patterns 7. Knowledge representation - visualize and transfer new knowledge
Semi-Supervised Learning
A class of machine learning techniques that make use of both labeled and unlabeled examples when learning a model. In one approach, labeled examples are used to learn class models and unlabeled examples are used to refine the boundaries between classes.
Data Discrimination
A comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.
Interesting
A pattern is interesting if it is (1) easily understood by humans, (2), valid on new or test data with some degree of certainty, (3) potentially useful, and (4) novel A pattern is also interesting if it validates a hypothesis that the user sought to confirm
Data Warehouse
A repository of multiple heterogenous data sources organized under a unified schema at a single site to facilitate management decision making
Statistical Model
A set of mathematical functions that describe the behavior of the objects in a target class in terms of random variables and their associated probability distributions
Frequent substructure
A substructure can refer to different structural forms (graphs, trees, or lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a frequent structures pattern
Data Characterization
A summarization of the general characteristics or feature of a target class of data
Multidimensional Data Mining
Also called exploratory multidimensional data mining. Performs data mining a multidimensional space in an OLAP style. It allows the exploration of multiple combinations of dimensions at varying levels of granularity in data mining, and thus has greater potential for discovering interesting patters representing knowledge.
Active Learning
An ML approach that lets users play an active role in the learning process. An active learning approach can ask a user (e.g., a domain expert) to label an example, which may be from a set of unlabeled examples or synthesized by the learning program. The goal is to optimize the model quality by actively aqurining knowledge from human users, given a constraint on how many examples they can be asked to label.
Knowledge Discovery from Data (KDD)
Another popular term, data mining is often treated as a synonym for KDD, others view data mining as a step in the knowledge discovery process
Why Data Mining?
Commercial Perspective and Scientific Perspective
Discriminant Rules
Discrimination descriptions expressed in the form of rules
Efficiency and Scalability
Efficiency and scalability are always considered when comparing data mining algorithms. As data amounts continue to multiply, these two factors are especially critical.
Supervised Learning
Essentially a synonym for classification. Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest. Supervision in the learning comes from the labeled examples in the training data set
Unsupervised Learning
Essentially a synonym for clustering. Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process.
Description Methods
Find human-interpretable patterns that describe the data Types: - Clustering - Association Rule Discovery - Sequential Pattern Discovery
Data Mining and Society
How data mining may impact society and what steps can be taken to preserve the privacy of individuals. These questions raise issues such as: social impacts of data mining; privacy-preserving data mining; invisible data mining
Machine Learning
Investigates how computers can learn (or improve their performance) based on data. There are several types related to data mining: - Supervised Learning - Unsupervised Learning - Semi-Supervised Learning - Active Learning
What is/isn't data mining?
Is: - Certain names more prevalent in certain locations (O'Brien, O'Rurke.... in Boston) - Group similar documents returned by a search engine by context (Amazon rainforest, Amazon.com) Isn't: - Looking up a phone number in a directory - Query a web search engine for information about "Amazon"
Mining Methodology
Researchers have been vigorously developing new data mining methodologies. This involves the investigation of new kinds of knowledge, mining in multidimensional space, integrating methods from other disciplines, and the consideration of semantic ties among data objects
Frequent Patterns
Patterns that occur frequently in data. There are many kinds: frequent item sets, frequent subseuqeunces (aka sequential patterns), and frequent substructures
Typical Data Mining Tasks
Prediction Methods and Description Methods
Statistics
Studies the collection, analysis, interpretation or explanation, and presentation of data. Data mining has an inherent connection with statistics
Data Cube
The multidimensional data structure in which each dimension corresponds to an attribute or set of attributes in the schema, and each cell stores the value of some aggregate measure. Data cues provide a multidimensional view of data and allows the precomputation and fast access of summarized data.
What is Data Mining?
The process of discovering interesting patterns and knowledge from large amounts of data
Information Retrieval
The science of searching for documents or information in documents. Documents can be text or multimedia, and may reside on the web.
User Interaction
The user plays an important role in the data mining process. Interesting areas of research include how to interact with a data mining system, how to incorporate a user's background knowledge in mining, and how to visualize and comprehend data mining results
Diversity of Data Types
The wide diversity of database types brings about challenges to data mining. These include: handling complex types of data; mining dynamic, networked, and global data repositories
Frequent Itemset
Typically refers to a set of items that often appear together in a transactional data set Ex. milk and bread
Prediction Methods
Use some variables to predict unknown or future values of other variables Types: - Classification - Regression - Deviation Detection