Data Mining
What is a centroid
The central point (mean point) of a cluster
How is split info calculated
- A1/N log2(A1/N) - A2/N log2(A2/N)................. - An/N log2(An/N) Where A = no. of instances in group, N = total instances
How is the expected info (entropy) calculated
-((t/t+f log2 X t/t+f )+( f/t+f log2 X f/t+f)) where t = no. of true tuples, f = no of false
For nominal variables how can the distance be calculated
V - m / V V = no. of variables m = number of matches
What is frequent itemset
an itemset with a support greater than the min support threshold (Smin)
How is the info gain calculated
info expected - info needed
What is goal of mining association rules
to identify strong rules which satisfies both Smin & Cmin thresholds
What are two popular distance functions
•Manhattan distance •Euclidean distance
Define sample space and event
-A collection of all possible outcomes of a random experiment -A subset of the sample space
Give 2 examples objective measures of interesting rules
-Accuracy -Lift -Support -Confidence
What are the data mining tasks (6)
-Concept description -Association analysis -Classification & prediction -Cluster analysis -Outlier detection -Evolution analysis
Give two examples of when outliers could be useful
-Credit card fraud detection -Medical analysis
What are the 6 classification methods
-Decision tree induction -Naive Bayesian classification -Rule-based method -Artificial neural network -Nearest-Neighbour -Logistic regression
Based on subjective measures a rule is interesting if it is
-Easily understood by users -Valid on new or test data with some degree of certainty -Potentially useful -Novel -Actionable
What are the 5 social impacts of data mining
-Important tool for managers to make more informed decisions -ubiquitous - data mining is used everywhere -invisible - data mining functions are built in daily life operations -multiple personal uses -the right to privacy
Define the 3 types of nodes in a Decision trees
-One root node - first attribute selected to group samples (e.g. Age) -Intermediate nodes - other attributes selected to further divide samples -Leaf nodes - resulting class labels (e.g. Buy_computer)
What are the 3 steps of binning
-Sort dataset -Partition data into equal buckets (equal depth = freq, equal width = distance) -Replace each data in bin with appropriate value (normally mean)
What are the 4 approaches to outlier detection
-Statistical -Density based -Distance based -Cluster based
What are the issues with data integration (3), if done correctly (2)
-schema integration -entity identification problem (soccer/football) -data value conflicts (miles/km) -helps reduce/avoid redundancies & inconsistencies -Improve mining speed and quality
What is inconsistent data and what causes it
Data containing discrepancies in naming convention, data domain or format -different sources -changes in strategy of data warehousing -functional dependencies violation
What are the 2 algorithms used to determine the attribute for splitting (and how do they determine)
ID3 algorithm - selecting the attribute with the highest info gain C4.5 algorithm - selecting the attribute with the highest info gain ratio
What is the Apriori algorithm
"All nonempty subsets of a frequent itemset must also be frequent"
What is Occam's Razor
"Given two models with the same testing error, the simpler one is preferred."
How to calculate the information needed
(A1/N x I1 + A2/N x I2 ....... + An/N x In) Where A = no. of instances in group, I = info expected of group, N = total instances
What is the formula to calculate a new value in Min-Mix Normalisation
(v-min)/(max-min)x(newMax-newMin)+newMin
What are 4 OLAP operates and what do they do
Drill up - summarises data Drill down - from summary to more detailed data Slice & Dice - project & select Pivot - re-orientate cube
Advantages of K-means clustering
Easy to understand Relatively efficient Scalable
What are some features of ordinal variables (3)
Either discrete/continous Order is important Can be treated like interval-scaled
Whats the difference between info gain and info gain ratio
info gain - biased towards multi valued attributes and May lead to overfitting info gain ration - Tends to prefer unbalanced splits in which one partition is much smaller than the others
How is the info gain ratio calculated
info gain/ split info
How is the info expected calulated
no. of tuples in the class/no. of tuples all together
Whats an outliers
object which doesn't conform to the general behaviours of the dataset
Purpose of data cleaning and what it involves (4)
removing noisy data & correcting inconsistencies -filling in missing values -smoothing noisy data -identifying and removing outliers -resolving inconsistencies
In decision tree when does partitioning stop
•All samples for a given node belong to the same class •There are no more attributes for further partitioning •There are no samples left
What are the advantages of Decision trees
Easy to classify a new sample Learning speed is faster than other methods Accuracy is comparable to other methods Robust to noise in dataset Copes with both norminal and numerical data Easily converted into a set of classification rules that are simple and easy to understand
Disadvantages of K-means clustering
Applicable only to numerical data Need to specify no. of clusters in advance Sensitive to noisy data and outliers (remember the above, its easier) Unable to handle cluster of different sizes and densities Unclear as which attributes are more important Lack of explanation about the nature of the clusters discovered
Disadvantages of Bayes Theorm
Assumption that attributes are conditional independent may be untrue Dependencies among attributes cannot be modelled by Naïve Bayesian Classifier
What are the advantages(3) and disadvantages(2) of a cluster-based detection method
Detect outliers without requiring any labelled data Works for many data types Cluster can be regarded as summarised data Effectiveness depends on clustering method used High computational cost
Advantages of Bayes Theorm
Easy to implement Good results obtained Robust to noise and missing values
Define the two popular cluster partitioning methods
K-means - each cluster represented by centroid K-medoids - each cluster is represented by one of the points in the cluster
When producing clusters what is the goal to have
Maximise intra-class similarity Minimize interclass similarity
What are some requirements of clustering
Scalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of cluster with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Interpretability and usability
What is the disadvantage and advantage of K-medoids clustering
More robust in presence of noise and outliers More costly, suitable for small datasets {not scalable}
To identify anomalies we consider (2)
No. of attributes used to define anomaly Whether its an anomaly globally or locally
Difference between data mining and OLAP
OLAP - tells whats happening to data DM - tells whats happening to data, why its happening and can predict future
What is a distance function used for
To measure similarity between two instances
Purpose of data reduction
To reduce data in volume and increase efficiency of mining process
What is overfitting and what causes it
Too many branches Causes •Small number of representative samples in the training data •Noisy data or outliers •Model complexity
For concept description what do the description methods describe (2)
Data characterisation - summary of characteristics of target class Data discrimination - comparison of characteristics of target class with other classes
What is the method of Z-score Normalisation
Maps new dataset with mean & standard deviation (helps detect outliers) V' = V-mean/standard deviation
Define Desperation
degree to which numerical data tends to spread
What are the 3 central tendencies
Mean, Medium, Mode
What are the steps of K-means clustering
1.Randomly select K points as the initial centroids 2.Loop (steps 3 & 4) 3.Assign each point to the nearest centroid to form a cluster 4.Compute the centroid of each cluster 5.Stop if there is no more new assignment, i.e. all centroids do not change any more
How does a distance-based detection method approach work (1Adv, 1Dis)
Anomalies are that which are distant from the other objects -Enables multi dimensional analysis without knowing data distribution -Cannot handle data with uneven densities
How does a statistical detection method approach work (2 Dis)
Assumes a distribution or probability model for a given dataset and then identifies outliers with respects to model -Most tests are for single value -Data distribution may be unknown
Aims of tree pruning
Avoid overfitting Reduces model complexity Easier to understand
Define the 4 methods of dealing with noise
Binning - smoothing using values around it Regression - smoothing using regression function Clustering - detect and remove outliers Combined human & computer inspection
Whats the equation for lift
C(A=>B)/S(B)
What is incomplete data and what causes it
Data missing attributes or missing attributes of interest unknown or not considered important at data collection, equipment malfunction
What is the process of Cross-Validation evaluation and what is its advantage and disadvantage
Divide dataset into equal sized subset (for each subset 1 used as test, rest training) -Statically more reliable -More computationally expensive
What is the process of Partition (holdout) evaluation and what are its two disadvantages
Divide dataset into two subsets (training and test) The number of training samples is reduced The model may vary depending on how the dataset is partitioned
What are the methods for dealing with missing values, give Adv/Dis
Ignore tuples with missing values A-small effect, if no. of affect tuples is small D-large effect, if no. of affect tuples is large (causing biased) Manually filling in missing values D-Time consuming/error prone Replace missing value with constant A-simple, smaller bias than ignoring D-noise may be introduced Predict missing value with model A- most effective D-complex, computationally expensive
Why are inexplicable and redundant rules no interesting
Inexplicable - no actionable Redundant - don't meet threshold
How does a density-based detection method approach work (1Adv, 1Dis) and define LOF
Object is an outlier if its density is relatively much lower than that of its neighbours {local outliers} -LOF useful for discovering local outlier -Costly to compute every LOF for each object Local Outlier Factor = the degree to which the object is isolated with respect to its neighbors
How does a cluster-based detection method approach work (1Adv, 1Dis)
Objects is an outlier if: -Does not belong in any cluster -Large distance between the object and closest cluster -Belongs to small/sparse cluster Adv - Detect outliers without outliers Dis - High computational cost
What does OLTP and OLAP and define
Online Transaction Processing - day to day transaction Online Analytical Processing - tool provided by data warehouse management system for data analysis and decision making
What are the 2 different methods of decision tree evaluation
Partition (holdout) Cross Validation
What is the process of classification
Process of constructing a model that assigns each object to a predefined class (e.g. target attribute)
Define Visual data mining
Process of discovering implicit but useful knowledge from a dataset using visualisation techniques
How is P{H|X} calculated
P{X|H} * P{H}/P{H}
What is noisy data and what causes it
Random errors, outliers faulty data collection equipment, human/computer input error
What are 3 types of data dispersion
Range Variance Standard deviation (square root of variance)
Whats the advantages Apriori algorithm
Reduces complexity of mining association rules, by reducing the number of -Candidates -Transactions -Comparisons in the database Increase algorithm efficiency
Whats the Support and Confidence
S - % of transactions contain itemset C - % of itemset B which appear in transactions containing itemsets A
What are the issues with threshold (Smin, Cmin)
Smin -If set too high, rare itemsets may be lost -If set too low, many valid itemsets do not frequently occur Cmin -Too high, only a few rules may be found -Too low, many rules are uncertain
Define 4 methods of data transformation
Smoothing - removing noise from data Aggregation - summarization of data Normalisation - scaling data set to specific range Change data types - numerical -> categorical
Define the two different measure for interesting rules
Subjective - based on user's belief of data Objective - based on the structure of discovered rules and the statistics of the
Whats the difference between supervised and unsupervised learning
Supervised an attribute is specified as the target class before mining. Unsupervised learning has no predefined classes
What are the equations for Accuracy, Error Rate, Precision and Recall?
TP+TN/N 1- Accuracy TP/TP+FP TP/TP+FN
Define conditional probability
The probability that event A occurs given that event B has occurred
What the purpose of pre-processing
To deal with -Noisy data -Incomplete data -Inconsistent data
Define a cluster
a collection of data objects similar to others in the cluster and dissimilar to those which arent