BI Chapter 5

Ace your homework & exams now with Quizwiz!

knowledge discovery in databases

A machine learning process that performs rule induction or a related procedure to establish knowledge from large databases

Microsoft Enterprise Consortium

...

Microsoft SQL Server

...

hypothesis driven data mining

A form of data mining that begins with a proposition by the user, who then seeks to validate the truthfulness of the proposition

Gini index

A metric that is used in economics to measure the diversity of the population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute/variable

prediction

Act of telling about the future

SAS Enterprise Miner

Comprehensive and commercial data mining software tool developed by SAS Institute

SPSS PASW Modeler

...

Association

A category of data mining algorithm that establishes relationships about items that occur together in a given record

regression

A data mining method for real world prediction problems where the predicted values are numeric

discovery-driven data mining

A form of data mining that finds patterns, associations, and relationships among data in order to uncover facts that were previously unknown or not even contemplated by an organization

distance measure

A measure used to calculate the closeness between pairs of items in most cluster analysis methods. Popular distance measures include Euclidan distance (ordinary distance between two points measured with ruler) and Manhattan distance (rectilinear distance or taxicab distance between two points)

entropy

A metric that measures the extent of uncertainty or randomness in a data set. If all the data in a subset belongs to just one class, then there is no uncertainty or randomness in the data set, therefore this is zero

sequence mining

A pattern discovery method where relationships among the things are examined in terms of their order of occurrence to identify associations over time

data mining

A process that uses statistical, mathematical, artificial intelligence and machine learning techniques to extract and identify useful information and subsequent knowledge from large databases

Bootscrapping

A sampling technique where a fixed number of instances from the original data is sampled for training and the rest of the dataset is used for testing

SEMMA

An alternative process for data mining projects proposed by the SAS Institute. (Sample, explode, modify, model, and assess)

ratio data

Continuous data where both differences and ratios are interpret-able. Distinguishing feature is the possession of a non-arbitrary zero value

simple split

Data is in two mutually exclusive subsets called a training set and a tool set. Common to designate two-thirds of the data as the training set and the rest as the test set

ordinal data

Data that contains codes assigned to objects or events as labels, they also represent the rank order among them (low, medium, high)

nominal data

Type of data that contains measurements of simple codes assigned to objects as labels, which are not measurements (single, married, divorced)

numeric data

Type of data that represents the numeric values of specific variables such as age, number of kids

support

Measure of how often products and/or services appear together in the same transactions in the data set, proposition of transactions in the data set that contain all of the products mentioned in a specific rule

link analysis

The linkage among many objects of interest is discovered automatically, such as the link between Web pages and referential relationships among groups of academic publication authors

Apriori algorithm

The most commonly used algorithm to discover association rules by recursively identifying frequent itemsets

information gain

The splitting mechanism used in ID3 (popular decision tree algorithm)

RapidMiner

Popular open-source, free of charge data mining software suite that employs a graphically enhanced user interface, a rather large number of algorithms and a variety of data visualization features

Weka

Popular, free of charge, open-source suite of machine-learning software written in Java

Crisp-Dm

a cross-industry standardized process of conducting data mining projects, 6 steps that starts with understanding the business and end with deploying the solution that satisfies the need

Area under the ROC curve

a graphical assessment technique for binary classification models where the true positive rate is plotted on the y-axis and false positive rate is plotted on the x-axis

Categorical data

data that represent the labels of multiple classes used to divide a variable into specific groups

decision tree

graphical presentation of a sequence of interrelated decisions to be made under assumed risk. This technique classifies specific entities into particular classes based upon entity features, a root followed by internal nodes, each node is labeled with a question, and arcs associated with each node cover all possible responses

Clustering

partitioning a database into segments in which the members of a segment share similar qualities

k-fold cross validation

popular accuracy assessment technique for prediction models where the complete data set is randomly split into k mutually exclusive subsets of approximately equal size. The classification model is trained and test k times. Each time it is trained on all but one fold and then tested on the remaining single fold. The cross validation estimate of the overall accuracy of a model is calculated by simply averaging the k individual accuacy measure

Classification

supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior

Confidence

the conditional probability of finding the RHS of the rule present in a list of transactions where the LHS of the rule already exists

BI Chapter 5

Related study sets

Taj Mahal

AP GOV Unit 4

Romanticism

AD Banker Health and Life Chapter 15 Terms/Definitions

NCIDQ - Ch. 22

TestOut Ethical Hacker Pro - Chapter 5

MGS 4300 exam One Bunch

Exam 2 Biology Sapling

Chapter 9: Homework

Cell division

1.1 Compare and contrast different types of social engineering techniques.

chm midterm

Data Communication (ch3)

chapter 1 vocab

Roman Empire & Germanic Tribes 9/29

ATI - Maternal Neonatal

Biology Chapter 28 Study Guide

Endocrine Fall 2017

BIO 141 The Skeletal System

NCLEX Questions: Gall Bladder Disease