Chapter 1. Introduction

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Scientific Perspective

- Data Collected & stored at enormous speeds(GB/hour) - Traditional techniques infeasible for raw data - Data mining may help scientists (classifying/segmenting data; hypothesis formation)

Concept of Data Mining

- Draws idea from ML/AI, pattern recognition, stats, and database systems

Commercial Perspective of Data Mining

- Lots of data to collected/warehoused (web data, purchases, bank/card transactions) - Computers cheaper/more powerful - Competitive Pressure strong (provide better, customized services; decision making process more efficient/effective)

Major Issues in Data Mining

- Mining methodology - User interaction - Efficiency and scalability - Diversity of data types - Data mining and society Many of these issues have been addressed in recent research and development to a certain extent and are now considered data mining requirements; others are still at the research stage

Challenges of Data Mining

- Scalability - Dimensionality - Complex and Heterogeneous Data - Privacy Preservation - Streaming Data

Frequent subsequence (Frequent sequential pattern)

A recurring pattern in transactional data Ex. buying laptop, then a digital camera, then a memory card

Knowledge Discovery Process

1. Data cleaning - remove noise and inconsistencies 2. Data integration - combine data sources 3. Data selection - retrieve relevant data from db 4. Data transformation - data transformed into appropriate forms for mining and aggregation 5. Data mining - methods applied to extract data patterns 6. Pattern Evaluation - identify truly interesting patterns 7. Knowledge representation - visualize and transfer new knowledge

Semi-Supervised Learning

A class of machine learning techniques that make use of both labeled and unlabeled examples when learning a model. In one approach, labeled examples are used to learn class models and unlabeled examples are used to refine the boundaries between classes.

Data Discrimination

A comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.

Interesting

A pattern is interesting if it is (1) easily understood by humans, (2), valid on new or test data with some degree of certainty, (3) potentially useful, and (4) novel A pattern is also interesting if it validates a hypothesis that the user sought to confirm

Data Warehouse

A repository of multiple heterogenous data sources organized under a unified schema at a single site to facilitate management decision making

Statistical Model

A set of mathematical functions that describe the behavior of the objects in a target class in terms of random variables and their associated probability distributions

Frequent substructure

A substructure can refer to different structural forms (graphs, trees, or lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a frequent structures pattern

Data Characterization

A summarization of the general characteristics or feature of a target class of data

Multidimensional Data Mining

Also called exploratory multidimensional data mining. Performs data mining a multidimensional space in an OLAP style. It allows the exploration of multiple combinations of dimensions at varying levels of granularity in data mining, and thus has greater potential for discovering interesting patters representing knowledge.

Active Learning

An ML approach that lets users play an active role in the learning process. An active learning approach can ask a user (e.g., a domain expert) to label an example, which may be from a set of unlabeled examples or synthesized by the learning program. The goal is to optimize the model quality by actively aqurining knowledge from human users, given a constraint on how many examples they can be asked to label.

Knowledge Discovery from Data (KDD)

Another popular term, data mining is often treated as a synonym for KDD, others view data mining as a step in the knowledge discovery process

Why Data Mining?

Commercial Perspective and Scientific Perspective

Discriminant Rules

Discrimination descriptions expressed in the form of rules

Efficiency and Scalability

Efficiency and scalability are always considered when comparing data mining algorithms. As data amounts continue to multiply, these two factors are especially critical.

Supervised Learning

Essentially a synonym for classification. Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest. Supervision in the learning comes from the labeled examples in the training data set

Unsupervised Learning

Essentially a synonym for clustering. Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process.

Description Methods

Find human-interpretable patterns that describe the data Types: - Clustering - Association Rule Discovery - Sequential Pattern Discovery

Data Mining and Society

How data mining may impact society and what steps can be taken to preserve the privacy of individuals. These questions raise issues such as: social impacts of data mining; privacy-preserving data mining; invisible data mining

Machine Learning

Investigates how computers can learn (or improve their performance) based on data. There are several types related to data mining: - Supervised Learning - Unsupervised Learning - Semi-Supervised Learning - Active Learning

What is/isn't data mining?

Is: - Certain names more prevalent in certain locations (O'Brien, O'Rurke.... in Boston) - Group similar documents returned by a search engine by context (Amazon rainforest, Amazon.com) Isn't: - Looking up a phone number in a directory - Query a web search engine for information about "Amazon"

Mining Methodology

Researchers have been vigorously developing new data mining methodologies. This involves the investigation of new kinds of knowledge, mining in multidimensional space, integrating methods from other disciplines, and the consideration of semantic ties among data objects

Frequent Patterns

Patterns that occur frequently in data. There are many kinds: frequent item sets, frequent subseuqeunces (aka sequential patterns), and frequent substructures

Typical Data Mining Tasks

Prediction Methods and Description Methods

Statistics

Studies the collection, analysis, interpretation or explanation, and presentation of data. Data mining has an inherent connection with statistics

Data Cube

The multidimensional data structure in which each dimension corresponds to an attribute or set of attributes in the schema, and each cell stores the value of some aggregate measure. Data cues provide a multidimensional view of data and allows the precomputation and fast access of summarized data.

What is Data Mining?

The process of discovering interesting patterns and knowledge from large amounts of data

Information Retrieval

The science of searching for documents or information in documents. Documents can be text or multimedia, and may reside on the web.

User Interaction

The user plays an important role in the data mining process. Interesting areas of research include how to interact with a data mining system, how to incorporate a user's background knowledge in mining, and how to visualize and comprehend data mining results

Diversity of Data Types

The wide diversity of database types brings about challenges to data mining. These include: handling complex types of data; mining dynamic, networked, and global data repositories

Frequent Itemset

Typically refers to a set of items that often appear together in a transactional data set Ex. milk and bread

Prediction Methods

Use some variables to predict unknown or future values of other variables Types: - Classification - Regression - Deviation Detection


Ensembles d'études connexes

Psych Chapter 13: Social Psychology

View Set

Chapter 7: Muscles of Mastication

View Set

4.02 Converting Between Forms and Writing the Equation of a Line

View Set

Taxes, Group Life, Government Insurance, and Other Insurance Concepts Chapter Quiz

View Set