Data Mining

Ace your homework & exams now with Quizwiz!

What is a centroid

The central point (mean point) of a cluster

How is split info calculated

- A1/N log2(A1/N) - A2/N log2(A2/N)................. - An/N log2(An/N) Where A = no. of instances in group, N = total instances

How is the expected info (entropy) calculated

-((t/t+f log2 X t/t+f )+( f/t+f log2 X f/t+f)) where t = no. of true tuples, f = no of false

For nominal variables how can the distance be calculated

V - m / V V = no. of variables m = number of matches

What is frequent itemset

an itemset with a support greater than the min support threshold (Smin)

How is the info gain calculated

info expected - info needed

What is goal of mining association rules

to identify strong rules which satisfies both Smin & Cmin thresholds

What are two popular distance functions

•Manhattan distance •Euclidean distance

Define sample space and event

-A collection of all possible outcomes of a random experiment -A subset of the sample space

Give 2 examples objective measures of interesting rules

-Accuracy -Lift -Support -Confidence

What are the data mining tasks (6)

-Concept description -Association analysis -Classification & prediction -Cluster analysis -Outlier detection -Evolution analysis

Give two examples of when outliers could be useful

-Credit card fraud detection -Medical analysis

What are the 6 classification methods

-Decision tree induction -Naive Bayesian classification -Rule-based method -Artificial neural network -Nearest-Neighbour -Logistic regression

Based on subjective measures a rule is interesting if it is

-Easily understood by users -Valid on new or test data with some degree of certainty -Potentially useful -Novel -Actionable

What are the 5 social impacts of data mining

-Important tool for managers to make more informed decisions -ubiquitous - data mining is used everywhere -invisible - data mining functions are built in daily life operations -multiple personal uses -the right to privacy

Define the 3 types of nodes in a Decision trees

-One root node - first attribute selected to group samples (e.g. Age) -Intermediate nodes - other attributes selected to further divide samples -Leaf nodes - resulting class labels (e.g. Buy_computer)

What are the 3 steps of binning

-Sort dataset -Partition data into equal buckets (equal depth = freq, equal width = distance) -Replace each data in bin with appropriate value (normally mean)

What are the 4 approaches to outlier detection

-Statistical -Density based -Distance based -Cluster based

What are the issues with data integration (3), if done correctly (2)

-schema integration -entity identification problem (soccer/football) -data value conflicts (miles/km) -helps reduce/avoid redundancies & inconsistencies -Improve mining speed and quality

What is inconsistent data and what causes it

Data containing discrepancies in naming convention, data domain or format -different sources -changes in strategy of data warehousing -functional dependencies violation

What are the 2 algorithms used to determine the attribute for splitting (and how do they determine)

 ID3 algorithm - selecting the attribute with the highest info gain  C4.5 algorithm - selecting the attribute with the highest info gain ratio

What is the Apriori algorithm

"All nonempty subsets of a frequent itemset must also be frequent"

What is Occam's Razor

"Given two models with the same testing error, the simpler one is preferred."

How to calculate the information needed

(A1/N x I1 + A2/N x I2 ....... + An/N x In) Where A = no. of instances in group, I = info expected of group, N = total instances

What is the formula to calculate a new value in Min-Mix Normalisation

(v-min)/(max-min)x(newMax-newMin)+newMin

What are 4 OLAP operates and what do they do

Drill up - summarises data Drill down - from summary to more detailed data Slice & Dice - project & select Pivot - re-orientate cube

Advantages of K-means clustering

Easy to understand Relatively efficient Scalable

What are some features of ordinal variables (3)

Either discrete/continous Order is important Can be treated like interval-scaled

Whats the difference between info gain and info gain ratio

info gain - biased towards multi valued attributes and May lead to overfitting info gain ration - Tends to prefer unbalanced splits in which one partition is much smaller than the others

How is the info gain ratio calculated

info gain/ split info

How is the info expected calulated

no. of tuples in the class/no. of tuples all together

Whats an outliers

object which doesn't conform to the general behaviours of the dataset

Purpose of data cleaning and what it involves (4)

removing noisy data & correcting inconsistencies -filling in missing values -smoothing noisy data -identifying and removing outliers -resolving inconsistencies

In decision tree when does partitioning stop

•All samples for a given node belong to the same class •There are no more attributes for further partitioning •There are no samples left

What are the advantages of Decision trees

 Easy to classify a new sample  Learning speed is faster than other methods  Accuracy is comparable to other methods  Robust to noise in dataset  Copes with both norminal and numerical data  Easily converted into a set of classification rules that are simple and easy to understand

Disadvantages of K-means clustering

Applicable only to numerical data Need to specify no. of clusters in advance Sensitive to noisy data and outliers (remember the above, its easier) Unable to handle cluster of different sizes and densities Unclear as which attributes are more important Lack of explanation about the nature of the clusters discovered

Disadvantages of Bayes Theorm

Assumption that attributes are conditional independent may be untrue Dependencies among attributes cannot be modelled by Naïve Bayesian Classifier

What are the advantages(3) and disadvantages(2) of a cluster-based detection method

Detect outliers without requiring any labelled data Works for many data types Cluster can be regarded as summarised data Effectiveness depends on clustering method used High computational cost

Advantages of Bayes Theorm

Easy to implement Good results obtained Robust to noise and missing values

Define the two popular cluster partitioning methods

K-means - each cluster represented by centroid K-medoids - each cluster is represented by one of the points in the cluster

When producing clusters what is the goal to have

Maximise intra-class similarity Minimize interclass similarity

What are some requirements of clustering

Scalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of cluster with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Interpretability and usability

What is the disadvantage and advantage of K-medoids clustering

More robust in presence of noise and outliers More costly, suitable for small datasets {not scalable}

To identify anomalies we consider (2)

No. of attributes used to define anomaly Whether its an anomaly globally or locally

Difference between data mining and OLAP

OLAP - tells whats happening to data DM - tells whats happening to data, why its happening and can predict future

What is a distance function used for

To measure similarity between two instances

Purpose of data reduction

To reduce data in volume and increase efficiency of mining process

What is overfitting and what causes it

Too many branches Causes •Small number of representative samples in the training data •Noisy data or outliers •Model complexity

For concept description what do the description methods describe (2)

Data characterisation - summary of characteristics of target class Data discrimination - comparison of characteristics of target class with other classes

What is the method of Z-score Normalisation

Maps new dataset with mean & standard deviation (helps detect outliers) V' = V-mean/standard deviation

Define Desperation

degree to which numerical data tends to spread

What are the 3 central tendencies

Mean, Medium, Mode

What are the steps of K-means clustering

1.Randomly select K points as the initial centroids 2.Loop (steps 3 & 4) 3.Assign each point to the nearest centroid to form a cluster 4.Compute the centroid of each cluster 5.Stop if there is no more new assignment, i.e. all centroids do not change any more

How does a distance-based detection method approach work (1Adv, 1Dis)

Anomalies are that which are distant from the other objects -Enables multi dimensional analysis without knowing data distribution -Cannot handle data with uneven densities

How does a statistical detection method approach work (2 Dis)

Assumes a distribution or probability model for a given dataset and then identifies outliers with respects to model -Most tests are for single value -Data distribution may be unknown

Aims of tree pruning

Avoid overfitting Reduces model complexity Easier to understand

Define the 4 methods of dealing with noise

Binning - smoothing using values around it Regression - smoothing using regression function Clustering - detect and remove outliers Combined human & computer inspection

Whats the equation for lift

C(A=>B)/S(B)

What is incomplete data and what causes it

Data missing attributes or missing attributes of interest unknown or not considered important at data collection, equipment malfunction

What is the process of Cross-Validation evaluation and what is its advantage and disadvantage

Divide dataset into equal sized subset (for each subset 1 used as test, rest training) -Statically more reliable -More computationally expensive

What is the process of Partition (holdout) evaluation and what are its two disadvantages

Divide dataset into two subsets (training and test) The number of training samples is reduced The model may vary depending on how the dataset is partitioned

What are the methods for dealing with missing values, give Adv/Dis

Ignore tuples with missing values A-small effect, if no. of affect tuples is small D-large effect, if no. of affect tuples is large (causing biased) Manually filling in missing values D-Time consuming/error prone Replace missing value with constant A-simple, smaller bias than ignoring D-noise may be introduced Predict missing value with model A- most effective D-complex, computationally expensive

Why are inexplicable and redundant rules no interesting

Inexplicable - no actionable Redundant - don't meet threshold

How does a density-based detection method approach work (1Adv, 1Dis) and define LOF

Object is an outlier if its density is relatively much lower than that of its neighbours {local outliers} -LOF useful for discovering local outlier -Costly to compute every LOF for each object Local Outlier Factor = the degree to which the object is isolated with respect to its neighbors

How does a cluster-based detection method approach work (1Adv, 1Dis)

Objects is an outlier if: -Does not belong in any cluster -Large distance between the object and closest cluster -Belongs to small/sparse cluster Adv - Detect outliers without outliers Dis - High computational cost

What does OLTP and OLAP and define

Online Transaction Processing - day to day transaction Online Analytical Processing - tool provided by data warehouse management system for data analysis and decision making

What are the 2 different methods of decision tree evaluation

Partition (holdout) Cross Validation

What is the process of classification

Process of constructing a model that assigns each object to a predefined class (e.g. target attribute)

Define Visual data mining

Process of discovering implicit but useful knowledge from a dataset using visualisation techniques

How is P{H|X} calculated

P{X|H} * P{H}/P{H}

What is noisy data and what causes it

Random errors, outliers faulty data collection equipment, human/computer input error

What are 3 types of data dispersion

Range Variance Standard deviation (square root of variance)

Whats the advantages Apriori algorithm

Reduces complexity of mining association rules, by reducing the number of -Candidates -Transactions -Comparisons in the database Increase algorithm efficiency

Whats the Support and Confidence

S - % of transactions contain itemset C - % of itemset B which appear in transactions containing itemsets A

What are the issues with threshold (Smin, Cmin)

Smin -If set too high, rare itemsets may be lost -If set too low, many valid itemsets do not frequently occur Cmin -Too high, only a few rules may be found -Too low, many rules are uncertain

Define 4 methods of data transformation

Smoothing - removing noise from data Aggregation - summarization of data Normalisation - scaling data set to specific range Change data types - numerical -> categorical

Define the two different measure for interesting rules

Subjective - based on user's belief of data Objective - based on the structure of discovered rules and the statistics of the

Whats the difference between supervised and unsupervised learning

Supervised an attribute is specified as the target class before mining. Unsupervised learning has no predefined classes

What are the equations for Accuracy, Error Rate, Precision and Recall?

TP+TN/N 1- Accuracy TP/TP+FP TP/TP+FN

Define conditional probability

The probability that event A occurs given that event B has occurred

What the purpose of pre-processing

To deal with -Noisy data -Incomplete data -Inconsistent data

Define a cluster

a collection of data objects similar to others in the cluster and dissimilar to those which arent


Related study sets

Teaching and Learning/Patient Education

View Set

CULTURAL COMPETENCY FOR ALL CLASSES

View Set

Male reproductive disorders from PrepU

View Set

Ch 7 and 8 - Data Warehouse and Big DataAssignment

View Set

Chapter 3. Fitness, Wellness, and Stress Management

View Set

FOUNDATIONS OF PROGRAMMING : INTRODUCTION TO PROGRAMMING : 01.05 STRING INPUT

View Set