Week 3 - Data Mining Processes, Methods and Algorithms

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

association

A category of data mining algorithm that establishes relationships about items that occur together in a given record.

Cross-Industry Standard Process of Data Mining (CRISP-DM)

A cross-industry standardized process of conducting data mining projects, which is a sequence of six steps that starts with a good understanding of the business and the need for the data mining project and ends with the deployment of the solution that satisfies the specific business need.

area under the ROC curve

A graphical assessment technique for binary classification models where ethe true positive rate is plotted on the y-axis and the false positive rate is plotted on the x-axis.

decision tree

A graphical presentation of a sequence of interrelated decisions to be made under assumed risk. This technique classifies specific entities into particular classes based upon the features of the entities. A root is followed by internal nodes where each node, including the root, is labeled with a question and the arcs associated with each node cover all possible responses.

knowledge discovery in databases (KDD)

A machine-learning process that performs rule induction or a related procedure to establish knowledge from large databases.

Distance measure

A method used to calculate the closeness between pairs of items in most cluster analysis methods. popular distance measures include Euclidean distance, the ordinary distance between two points that one would measure with a ruler and Manhattan distance, also called the rectilinear distance or taxicab distance, between two points.

Gini index

A metric that is used in economics to measure the diversity of the population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along with a particular attribute or variable.

Entropy

A metric that measures the extent of uncertainty or randomness in a data set. If all the data is a subset belonging to just one class, then there is no uncertainty or randomness in that data set, and therefore the entropy is zero.

sequence mining

A pattern discovery method where relationships among the things are examined in terms of their order of occurrence to identify associations over time.

weka

A popular, free of charge, and open source suite of machine learning software written in java and developed at the University of Waikato.

RapidMiner

A popular, open-source, and free-of-charge data mining software suite that employs a graphically enhanced user interface, a rather large number of algorithms, and a variety of data visualization features.

bootstrapping

A sampling technique where a fixed number of instances from the original data is sampled (with replacement) for training and the rest of the data set is used for testing.

sensitivity analysis

A study of the effect of a change in one or more input variables on a proposed solution.

classification

A supervised induction is used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior.

numeric data

A type of data that represents the numeric values of specific variables. Examples of numerically valued variables include age, number of children, total household income (US dollars), travel distance (in miles), and temperature (in Fahrenheit degrees).

sample, explore, modify, model, and assess (SEMMA)

An alternative process of data mining projects proposed by the SAS Institute.

Konstanz Information Miner (KNIME)

An open-source, free-of-charge, and platform-agnostic analytics software tool.

<blank> is the commonly used algorithm to discover association rules. It attempts to find the subset that are common to at least a minimum number of the itemsets.

Apriori algorithm

Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system.

False

Ratio data is a type of categorical data.

False

The entire focus of the predictive analytics system in the Visa case was on detecting and handling fraudulent charges for the company's benefit.

False

prediction

The act of telling about the future.

ensembles (model ensembles, ensemble modeling)

The combinations of the outcomes produced by two or more analytics models into a compound output. Ensembles are primarily used for prediction modeling where the scores of two or more models are combined to produce a better prediction.

confidence

The conditional probability of finding the RHS of the rule that is present in a list of transactions where the LHS of the rule already exists. It is used in association rules.

simple split

The data are partitioned into two mutually exclusive subsets called a training set and a test set, also known as a holdout set. It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set.

link analysis

The linkage among many objects of interest is discovered automatically, such as the link between web pages and referential relationships among groups of academic publication authors.

support

The measure of how often products and services appear together in the same transaction, that is, the proportion of transactions in the data set that contains all of the products and services mentioned in a specific rule.

Apriori Algorithm

The most commonly used algorithm to discover association rules by recursively identifying frequent itemsets.

clustering

The partitioning of a database into segments in which the members of a segment share similar qualities.

information gain

The splitting mechanism used in ID3, which is a popular decision-tree algorithm.

interval data

The variables that can be measured on interval scales.

If using a mining analogy, "knowledge mining" would be a more appropriate term than "data mining."

True

The cost of data storage has plummeted recently, making data mining feasible for more firms.

True

In data mining, classification models help in prediction.

True - In data mining, classification models help in prediction. Data mining is a process that uses statistical, mathematical, and AI techniques to extract and identify useful information and subsequent knowledge and patterns from a large set of data. Classification is one of the data mining methods.

The Mini index has been used in economics to measure the diversity of a population.

True - The Mini index has been used in economics to measure the diversity of a population. it is a splitting mechanism that is used for building a decision tree.

Decision trees are most appropriate for categorical data and interval data.

True - a decision tree classifies data into a finite number of classes based on the values of the input variable.

The decision tree approach is useful for problems with many attributes impacting the classification of different patterns.

True - a decision tree recursively divides a training set until each division consists entirely or primarily of examples from one class.

Data preparation, the third step in CRISP-DM, is commonly known as <blank>.

data preprocessing

One way to accomplish privacy and protection of individual's rights when data mining is by <blank> of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual.

de-identification

A <blank> recursively divides a training set until each division consists entirely or primarily of examples from one class.

decision tree

In the terrorist funding case study, an observed price <blank> may be related to income tax avoidance/evasion, money laundering, or terrorist financing.

deviation

ID3 (iterative dichotomiser 3), the most widely used decision tree algorithm, uses <blank> as the mechanism for splitting.

information gain

<blank> is a classification method that randomly splits the complete dataset into k mutually exclusive subsets of approximately equal size.

k-fold cross-validation

Fayyad et al. (1996) defined <blank> in databases as a process of using data mining methods to find useful information and patterns in the data.

knowledge discovery

Association rule mining is also known as the <blank> in the retail industry.

market-basket analysis

The data mining in cancer research case study explains that data mining methods are capable of extracting patterns and <blank> hidden deep in large and complex medical databases.

relationships


Ensembles d'études connexes

Introduction to SQL and Querying Datacamp

View Set

Chapter 8: Climate and Climate Change

View Set