Data Mining
Ward's method interpretation
Two clusters that are very far apart would have a high variance, which would indicate that we shouldn't merge them
Deep Learning
Subset of Machine Learning; set apart by the multiple layers in a neural network;
MinPts
User-specified parameter which indicates the density threshold of dense regions; how many neighbors must a point have for it to be considered a core point
Bayesian Belief Networks
Using conditional probability to identify the probability that an event occurs given another event
DENCLUE and noise
it is really good at handling noise, even when there is a lot of uniformly distributed noise the only limitation would be for biased noise, which would throw the algorithm off
k-means and noise
it is unable to handle
principal vectors
k orthonormal vectors, that provide a basis for the normalized input data
Gabor function
like a Gaussian function in 2d; it can be 3d in multidimensional space
logarithmic normalization
ln(v) - ln(min) / ln(max) - ln(min)
RNN applications
machine translation, speech recognition, handwriting recognition, image captioning
covariance
measure of how much two variables change together
Iterative Self-Organizing Data Analysis Technique (ISODATA)
merges clusters if either the number of members in a cluster is less than a certain threshold or if the centers of two clusters are closer than a certain threshold
Data integration
merges data from multiple sources into a coherent data stores such as a data warehouse
vectors
multiple data points that belong together and are not separable (like a phone number)
how to determine initial clustering
multiple sampling so that you avoid the chance the sample is not representative
multivariate and multidimensional data
multiple variable that may or may not have any relationship
lazy learner examples
nearest-neighbor classifiers, case-based resoning classifiers
preprocessing step for k-nearest neighbors
normalize values (min max works)
Goal of segmentation
simplify the image into something more meaningful and easier to analyze
Binning
sort data and partition into (equi-depth) bins and then smooth by bin means, bin median, bin boundaries, etc; a local smoothing method because it consults only the local neighborhood of data points
Minkowski Distance
sqr rt( sum of |x[i] - y[i]| ^p)
Euclidean distance
sqr rt(sum of (x[i] - y[i])^2)
starting 03
starting 03
starting 03-02
starting 03-02
starting 04_01
starting 04_01
starting 04_02
starting 04_02
starting 04_03
starting 04_03
starting 05_01
starting 05_01
starting lecture 2
starting lecture 2
Measure for compactness of a cluster (cost function)
sum of dist(p, m[c])
Measure for compactness of clustering (cost function)
sum of the compactness of each cluster C
Manhattan distance
sum of |x[i] - y[i]| a step wise function
Unsupervised learning
takes unlabeled data as input and outputs a grouping of data with decision boundaries
feature map
the "hidden layers" of SOMs; the neurons in this are connected to each other and to each input
Centroid
the center point of a cluster, which can be defined in many way including as the mean of medoid of the objects assigned to the cluster
convolutional layer
the core building block of the CNN consists of learnable filters; each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2d activation map of that fitler
Density estimation
the estimation of an unobservable underlying probability density function based on a set of observed data ; in this context, the unobservable underlying probability density function is the true distribution of the population of all possible objects to be analyzed
difference between neural networks and curve fitting/ regression
the first can model non-linear relationships, the second can't
Lazy learner
the learner waits until the last minute before doing any model construction to classify a given test tuple; only performs a generalizaton when given a test tuple ; also called instance-based learners, even though all learning is essentially based on instances
simple random sampling
the least biased sampling method; the problem is the chance that the sample does not describe the whole population
Margin
the line that intersects with the support vectors
Max pooling
the most common type of pooling; slides a window over its input, and takes the max value in the input
Order 2 neighbors
the neurons that are a step away from directly linked to the neuron of interest in a feature map
Order 1 neighbors
the neurons that are directly linked to the neuron of interest in a feature map
k in k-nearest neighbors
the number of neighbors accepted in a cluster
Association Rules
the patterns that emerge from boolean vectors; considered interesting if they satisfy a minimum support threshold and a minimum confidence threshold
P(H|X)
the posterior probability of hypothesis H conditional on data tuple X example: probability that customer X will buy a computer given that we know his age and income
P(G=T, R=T)
the probability of the joint events G and R
classification error
the probability that a classifier incorrectly classifies an object
classification accuracy
the probability that the classifier correctly classifies an object
P(C[i]|X)
the probability that tuple X belongs to class C[i] given that we know the attribute description of X
Classification
the process of finding a model that describes and distinguishes data classes or concepts; the model is derived bsad on the analysis of a set of training data
Dimensionality reduction
the process of reducing the number of random variables or attributes under consideration; methods include wavelet transforms and principal components analysis
Knowledge Discovery in Databases (KDD)
the process of semi-automatic extraction of knowledge from databases which is 1) valid, 2) previously unknown, and 3) potentially useful
perceptron
the simplest neural network possible: a computational model of a single neuron
learning constant
the speed of learning a high version of this goes with highly separable data, inverse also true usually between 0.001 and 2
density function (DENCLUE)
the sum of the influence of all data points
Drawback of DBSCAN and OPTICS
their density estimates based on counting the number of objects in a neighborhood defined by a radius parameter epsilon can be highly sensitive to the radius value used
how DENCLUE overcomes the drawback of DBSCAN and OPTICS
through kernel density estimation
time series data
time perhaps has the widest range of possible values; mostly numerical, but could also be categorical like with a day of the week
Market Basket Analysis goal
to develop marketing strategies according to which items are frequently purchased together
main goal of PCA
to explore the statistical correlations among attributes and find the data representation that retains the maximum nonredundant and uncorrelated information
SVM goal
to find the best separating hyperplane for the training data
lagrangian multipliers
used to minimize the margin of the SVM; introduce the variable a[i] which weights each of the data points by how much it contributes to the solution (so just the support vectors are included)
drawback of ISODATA
user has to provide several additional parameter values
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
uses compact summaries to describe micro clusters arranges the summaries in a balanced tree
k-Medoids clustering
uses medoids as the representative element of the cluster
linear interpolation
using a linear equation to find the connection between two points; influence by outliers
Transfer learning
using a pre-trained model on some task and fine tune it on a new task by removing only the last layer and keeping the basic feature extraction; used to avoid long training times
network data
vertices on a surface are connected to their neighbors via edges
how to calculate a new weight (perceptron)
weight + (error * output) * learning constant delta weight = error * output
indicator of association rule strength
when both support and confidence criteria are satisfied
EM termination
when the clustering converges or the changes are very small
inverse document frequency
word's frequency in a document normalized by how frequently it appears across all documents in database
Hierarchical clustering method
works by grouping data objects into hierarchy or "tree" of clusters
formula to find the hyperplane
y[i](w * x[i] + b ) -1 = 0
Motivation of sampling
you can represent a large dataset by a much smaller subset; you can speed up automatic calculations performed in a later step
problem of k-Means that ISODATA fixes
you don't have to specify k
Silhouette Coefficient
[ b(o) - a(o) ] / max{a(o),b(o)} where a(o) = dist between object o and its cluster representative b(o) = distance between object o and its "second best" cluster representative
hyperplane
a "decision boundary" separating the tuples of one class from another
Clustering Feature Tree (CF tree)
a balanced tree that hierarchically arranges clustering features each inner node contains at the max CF and a child
DENCLUE (DENsity-based CLUstEring)
a clustering method based on a set of density distribution functions
neural network
a collection of neuron-like processing units with weighted connections between the units
Itemsets
a collection of one or more items
geographic data
a data record which has an implicit or explicit association with a location relative to the Earth
k-distance diagram
a diagram that plots object in descending order of the density around them (lowest density in the top left, highest in the bottom right) change in slope marks a new density level for a cluster
GAN discriminator
a discriminative model in a GAN that learns the boundary between real data from the input and fake data from the generator; classifies yes/no fake or not
k-distance function
a heuristic method that considers the distance between each point in the dataset and its k-nearest neighbors; after finding this distance and ordering the points according to this distance, we can determine an appropriate epislon and MinPts (minimum density) for DBSCAN
Recall
a measure of the ability of a system to present all relevant items(number of relevant items retrieved) / ( number of relevant items in collection ) = true positive/ true positive + false negative
Precision
a measure of the ability of a system to present only relevant items; (number of relevant items retrieved)/ (total number of items retrieved) = true positive/ true positive + false positive
Divisive method
a method of hierarchical clustering that initially lets all the given objects form one cluster, which they iteratively split into smaller clusters
Agglomerative method
a method of hierarchical clustering that starts with individual objects as clusters, which are iteratively merged to form larger clusters
Multiple-phase/ multiphase clustering
a method that tries to improve the clustering quality of hierarchical methods by integrating hierarchical clustering with other clustering techniques; two examples are BIRCH and Chameleon
multilayer feed-forward neural network
a neural network that consists of an input layer, an output later, and an arbitrary number of hidden layers (usually 1) - feed forward: none of the weights cycle backwards - fully connected - each unit provides input to each unit the next
directly density reachable
a point that is a neighborhood point to a core point ; if we meet this requirement, then we also meet the density-reachable and density-connected requirement
density reachable
a point that is not a direct neighbor of a core point, but connected to that core point through another point
noise point (density based clustering)
a point that is not density reachable or density connected from another point
step function
a possible activation function f(x) = {0 if 0>x, 1 if x >=0}
sigmoid function
a possible activation function g(x) = 1 / (1 + e^-x)
dynamic regression model
a regression model that uses historical and new data to adapt the model
static regression model
a regression model that uses only historical data to calculate the function
Long Short-Term Memory
a type of RNN capable of learning order dependence in sequence prediction problems
Method of Wishart
a way to deal with the single-linkage problem of linking cluster because of outliers that form a bridge between clusters works by identifying and removing points with low density around them before applying the algorithm
Cube Map
a way to implement DENCLUE by putting a grid over all data points, ignoring the empty squares and checking the density of the populated squares
Within-cluster variation
a way to measure the quality of cluster C by calculating the sum of squared error between all objects in C and the centroid
Factors comprising data quality
accuracy, completeness, consistency, timeliness, believability, and interpretability
Center defined cluster
after applying the noise threshold in DENCLUE, we identify a cluster for each local maximum
Self Organizing Maps (SOMs)
aka Kohonen maps defines a "mesh" that serves as the basic layout for the pictorial representation and then let it float in the dataspace such that the data gets well covered uses a neural network to cluster data
conditional pattern base
all of the transformed prefix paths of item p which are accumulated by traversing the FP-tree by following the link of each frequent item p
how to determine the number of units in the input layer (neural network)
allocate one input for each domain value for example: Marriage status {married, widowed, divorced} => 3 input units
Partitioning algorithm
an algorithm that organizes a given dataset D of n objects into k partitions where each partition represents a cluster (for example, k-means and k-medoids) Definition:
Parallel algorithm
an algorithm which can do multiple operations in a given time; as opposed to a traditional serial algorithm or a sequential algorithm; an algorithm may vary in how parallelizable it is
Single-linkage
an approach in which each cluster is represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters
scalar
an individual number in a data record; could be any data ytpe
composition of a two layer neural network
an input layer, a hidden layer, and an output layer
k-itemset
an itemset that contains k-items
border object (DBSCAN)
an object that is density reachable from another object
core object (DBSCAN)
an object that within a predetermined distance epsilon has a predetermined number of neighbors
Clustering
analyzes data objects without consulting class labels; can be used to generate class labels; objects are group based on a principle of maximizing the intraclass similarity and minimizing the interclass similarity
EM cluster shapes
any kind of elliptical shape since we use the mean and std dev to define the dimensions of the Gaussian distribution
noise
any random error or variance in a measured variable; also called bias
How many eigenvectors and values a matrix has
as many as the matrix has dimensions
competitive learning
as opposed to error correction learning; in this model we compete for assimilating the input ex: SOMs
classification output for k-nearest neighbors
assigns the most common class in the cluster to the tuple
Clustering for data cleaning
automatically or via human inspection clustering data and remove outliers
variance
average of squared differences from the mean
Supervised learning
basically a synonym for classification; the supervision in the learning comes from the labeled examples in the training data set
prediction output for k-nearest neighbors
calculate the average value of the class attribute
data cleaning
can be applied to remove noise and correct inconsistencies in data
linear regression for smoothing
can be used to fill in missing values, classify and predict numeric values, reduce the amount of data by not storing all data values but just the model function, and smooth data
major drawback of perceptrons
can only solve linearly separable data - so they cannot solve XOR
data reduction
can reduce data size by, for instance, aggregating, eliminating reducant features, or clustering
Nominal
categorical data, no quantitative relationship between variables, classification without ordering
EM initialization
choose the number of clusters
k-means initialization
choose the number of clusters and initial centroids
discrete supervised learning example
classification eg decision tree, CNN, RNN
discrete unsupervised learning example
clustering eg k-means; SOM
Ward's Method
considers how dense the clusters are so that we can see whether merging gives us good variance sum( D(x, mu)^2) where x= data point in the cluster; mu = mean of the cluster
Multi-center defined cluster
consists of a set of center defined clusters, which are linked by a path with significance ξ; Means that are multiple local maximums above the upper noise threshold, we count them all into a single cluster
Eager learners
construct a generalization model given a set of training tuples before receiving new tuples to classify; decision tree induction, support vector machines
Curve fitting
constructing a curve, or mathematical function, that has the best fit to a series of data points possibly subject to constraints; can involve either interpolation,or smoothing in which a "smooth" function is constructed that approximately fits the data
brute force approach to frequent itemset generation
count the support of each candidate M by scanning the database of N items in the list and W items in each row
Support vectors
data vectors on the margin in an SVM - these are the only vectors that play a role in defining the hyperplane
Ordinal
data with attributes that can be rank-ordered; distanced between values do not have any meaning
numeric
data with attributes that can be rank-ordered; distances between values have a meaning; mathematical operations are possible
Stratified Random Sampling
define strata based on some characteristics like education level, then sample within each strata; most effective when variability within strata are minimized, variability between strata are maximized, and the variables upon which the population is stratified are strongly correlated with the desired dependent variable
K-means algorithm
defines the centroid of a cluster as the mean value of the points within the cluster and iteratively improves the within-cluster variation until the cluster assignment is stable
DBSCAN efficiency
depends greatly on input parameters; A low epsilon will take longer to check everything, a high epsilon while not take long but be more insensitive
perceptron's error
desired output - guessed output
continuous unsupervised learning example
dimensionality reduction eg PCA; AE, GAN
equal-width binning
divide the range into N intervals of equal size if A and B are the lowest and highest values of an attribute, then the width of the intervals is (B-A) / N outliers may dominate the presentation and skewed data is not handled well
equal-depth binning
divides the range into N intervals, each containing roughly the same number of samples; skewed data is handled well
Temporal Multi-Dimensional Scaling (TMDS)
does dimensionality reduction and maintains the relative distance of all the point works for categorical data an alternative clustering method
Pooling layers
downsample each feature map independently, reducing the height and width, keeping the depth intact
ways to determine k in k-nearest neighbors
experiment by increasing k and calculating error rate a low k is sensitive to outliers a high k affected by data points of different cluster/ classes
Linear normalization
f(v) = ( v- min)/ ( max- min) where vis an individual value
learned feature approach
features are learned from a feature hierarchy all layers extract features from the output of the previous layer all layers trained jointly
difference between transfer learning and fine tuning
first is better when the new dataset is more different from the one the model was trained on; second works better on a very similar dataset
difference between PAM and CLARANS and CLARA
first iterates through every possible clustering combination, second uses sampling to increase algorithmic efficiency, last tries to seek a balance between these two extremes
Support
fraction of transactions that contain an itemset example: s({Milk, Bread}) = ⅜ (A => B) = P(A ⋃ B )
support count
frequency of occurrence in an itemset example: ({Milk, bread}) = 3
how to search a hash tree (association rules)
given a tuple (for example: 3 5 9) apply the hash function to each value in the tuple sequentially until you read a node. Then you check to see if the tuple is present in the node
how to insert value in a hash tree (association rules)
given a tuple, apply the hash function to each value in the tuple sequentially until you reach a node. If the node contains the tuple already or has space for an additionall tuple, insert the tuple. If not, split the node by adding another level of hash function and insert the tuple accordingly
FP-growth
grow long patterns from short ones using local frequent items
Convolutional filter
has a specific height and width (5x5x3) , is 3D with the depth matching the depth of the image; slides over the input in order to perform a convolution at every possible location to aggregate in a feature map
Silhouette coefficient interpretation
high value -> better clustering > 0.5 reasonable cluster structure
Naive Bayes Classification problem
if one of the probabilities in the calculation is 0, the the whole formula is 0 solution: add one tiny sample so that it is non zero
cluster ordering
in OPTICS, a linear list of all objects under analysis; objects in a denser cluster are listed closer to each other
core-distance
in OPTICS, the smallest epsilon value such that the epsilon neighborhood of an object p has at least MinPts objects; the minimum threshold that makes p a core object
winning neuron
in SOMs, the neuron that is most similar to the input vector
GAN generator
in a GAN, learns the distribution of the input data and is able to generate new data for the discriminator to evaluate
covariance matrix
in a PCA analysis, the covariance between all dimensions
latent space
in an encoder-decoder system, this captures the essential features necessary for reconstructing the input; For images it is hard to a human to interpret what these "essential features" would be
distance function
in clustering, a function that determines how close together or far apart data points are
clustering features
in the BIRCH algorithm, compact summaries that describe micro clusters by containing centroid and radius information = (N, LS, SS) N = number of points in C LS = linear sum of N points SS = square sum of N points
difference between classification and prediction
in the first we have discrete class, in the second we have continuous numerical values
postpruning
involves pruning the decision tree after construction by calculating the cost complexity
Prepruning
involves pruning the decision tree during the construction by determing a stopping critera that is based on minimal support orminimum confidence
Predictive analysis
involves the discovery of rules that underlie a phenomenon and form a predictive model which minimize the error between the actual and predicted outcome considering all possible inferring factors
confidence
(A=> B) = P(B|A) (A=>B) = support (A ⋃ B )/ support(A)
Two ways to determine clusters after applying the noise threshold in DENCLUE
- Center defined cluster - Multi-center defined cluster
advantages of decision trees
- Good for applications where there are many attributes of unknown importance - Tolerant toward many correlated or noisy attributes - The structure allows for data clean-up. - Can reveal unexpected dependencies in the data which would be hidden in a more complex model -easy to understand the model - able to handle numerical and categorical data - inexpensive to construct
neural networks advantages
- High tolerance of noisy data - Classify patterns on which they have not been trained - Can be used when you have little knowledge on the relationships between attributes - Well-suited for continuous inputs and outputs - Use parellelization techniques which can speed up computation process
SVMs Disadvantages
- Inefficient model construcition - Long training times (~O(n2)) - Model is hard to interpret - Learn only weights of features - Weights tend to be almost uniformly distributed
The four basic steps of the PCA
- Input data are normalized - PCA computers the principal vectors that provide a basis for the normalized input data; the input data are a linear combination of the principal components - The principal components are sorted in order of decreasing significance - Use only the strongest principal components to reconstruct a good approximation of the original data
K-nearest-neighbor advantages
- Local method does not have to find a global decision function (decision surface) - High classification accuracy in many application Incremental Classifier can easily be adapted to new training objects - Can be used also for prediction
Neural networks disadvantages
- Long training time - Require a number of parameters that are typically best determined empirically - Poor interpretability
deep learning advantages
- No manual feature extraction - Allows Machine Learning without feature engineering - Complex problems can be solved without much domain knowledge
DBSCAN advantages
- No need to specify the number of clusters in advance - Able to find arbitrarily shaped clusters - Able to detect noise
Differences between CNN and regular NN
- Not fully connected -> neurons in one layer only connect to a small region of the neurons in the following layer - The layers are organized in 3 dimensions: width, height and depth - The output is reduced to a single vector of probability scores
Disadvantages of stratified random sampling
- Requires selection of relevant stratification variable which can be difficult - Is not useful when there are no homogenous subgroups - Can be expensive to implement
SVMs advantages
- Strong mathematical foundation - Find global optimum - Scale well to high-dimensional datasets - High classification accuracy in many challenging applications - less prone to overfitting than other methods
Three dimensions of input for a CNN
- Width - x axis of the image - Height - y axis of the image - Depth - color channels in an image (RGB)
Support Vector Machines (SVMs)
- a method for the classification of both linear and nonlinear data - Uses nonlinear mapping to transform the original training data into a higher dimension - And searches for the linear optimal separating hyperplane - they are very slow but highly accurate, less prone to overfitting
Difference between RNN and regular NN
- first has no predetermined limit on input, second does - first can remember things learned from prior inputs, second only from training - first shares parameters across each layer of the network, second has different weights across each node
Information gain
- for attribute A, the biggest reduction in entropy compared to the original set - can never be negative - no matter what, entropy will decrease
Patterns that occur frequently in data
- itemsets - subsequences - substructures
k-means disadvantages
- need to specify k - unable to handle noisy data - cannot detect clusters with non convex shapes - applicable only when mean is defined - often terminates at a local optimum
steps in training GAN models
- neural network that trains the D detective on input data R - neural network that trains generator G on noise data I - then D decides if input is from R or fake data - G tries to forge samples which D thinks are from R - G trains on the binary decision of D
k-means advantages
- relatively efficient - simple implementation
disadvantages to decision trees
- repetition of split criteria - replication of subtrees - large trees are difficult to analyze/ understand - overfitting
linkage based clustering algorithms
- single linkage - complete linkage - centroid linkage
how to determine the number of hidden layer units
- there are no clear rules - trial and error - rule of thumb is: 0.5 * (# inputs + # outputs)
neural network advantages
- tolerant against noise - well-suited for continuous data - algorithm is inherently parallel
how to determine the number of hidden layers
- usually only 1
advantages of stratified random sampling
-Focuses on important subpopulations and ignores irrelevant ones - Allows use of different sampling techniques for different subpopulations - Improves the accuracy of estimation
k-means terminates when...
... the cluster assignment is stable
how to determine the number of units in the output layer
1 output unit is sufficient for two classes > 2 classes -> output unit for each class
steps to Naive Bayes Classification
1) Calculate P(C[i]) for each possible i (ex: Probability that buys a computer = yes) 2) Calculate P(X|C[i]) for each possible i by multiplying the probability of each attribute (P(age= youth| buys_computer = yes) 3) Multiply each P(C[i]) by its corresponding P(X|C[i]) and compare the result
3 basic requirements of an RNN
1) Can store information for an arbitrary duration 2) Is resistant to noise 3) Its parameters are trainable in reasonable time
FP-tree benefits
1) Completeness 2) compactness
Naive Bayes advantages
1) Fast to train and classify 2) Performance is similar to decision trees and neural networks 3) Easy to implement 4) handles numeric and categorical data 5) useful for very large datasets
two influence functions
1) Gaussian influence function 2) square wave influence function
Structure of Encoder Decoder systems
1) Input layer 2) Encoder 3) Latent space 4) Decoder 5) output
3 distance functions for numeric attributes
1) Minkowski distance 2) Euclidean distance 3) Manhattan distance
Two components of CNNs
1) The hidden layers/ feature extraction part 2) The classification part
ways to evaluate a classification model
1) accuracy 2) speed 3) robustness 4) scalability 5) interpetability
Naive Bayes Classification disadvantages
1) assumes class conditional independence, therefore loss of accuracy 2) model is difficult to interpret
Ways to handle noisy data
1) binning 2) regression 3) clustering
types of splits in a decision tree
1) boolean splits (Married: yes or no) 2) nominal splits: (married: never, divorced, widow) 3) split on continuous attributes (temperature <= 80 or > 80)
steps to identify an attribute to split on
1) calculate entropy for the whole data set (p[i] = yes/ all, p[i] = no/all) 2) calculate Info[A](D) for each attribute A 3) calculate Information gain for each attribute 4) select the attribute with the highest information gain
steps to training a multilayer feed-forward neural network
1) calculate the input for the hidden layer 2) calculate the output for the hidden layer 3) calculate the input for the output layer 4) calculate the output for the output layer 5) calculate the error for the output layer 6) calculate the error for the hidden layer 7) calculate new weights
steps to k-means
1) choose the number of clusters 2) reassign each object to the cluster to which the object is most similar based on the mean value of the objects in the cluster 3) update the cluster means (which dont have to be real data points) 4) repeat until no change
Naive Bayes classifier assumptions
1) class conditional independence
two types of decision trees
1) classification 2) regression
steps to Gini gain caluclation
1) compute the impurity of D training data set 2) find the "best" splitting criterion by computing the gini impurity for each attribute
steps to creating a decision tree
1) create a node N 2) fill this node - if all elements in this node are in the same class, then we finish 3) else, apply a selection method to decide which attribute is the best to distinguish for the split
steps in expectation maximization algorithm
1) create initial clustering by projecting the data onto the Gaussian distributions 2) calculate the probability of each data point being assigned to each cluster 3) calculate new clustering 4) Calculate E(M) and E(M') 5) repeat until maximization
Goals of Cluster Analysis
1) data understanding 2) data class identification 3) data reduction 4) outlier detection 5) noise detection
Two step process to association rule mining
1) find all frequent itemsets (minimum support) 2) Generate strong association rules from the frequent itemsets (minimum support , minimum confidence)
Three gates in an LSTM
1) forget gate 2) input gate 3) output gate
steps to single-linkage
1) form initial clusters each of a single object and compute the distance between each pair of clusters 2) merge the two clusters that have the smallest distance 3) calculate the distances between each cluster 4) if there is only one cluster stop, otherwise return to step 2
Ways to deal with missing values
1) ignore the tuple 2) enter the value manually 3) use a global constant 4) use attribute mean 5) use the most probable value
CNN applications
1) image classification 2) object detection
steps to SOMs
1) initialize weights randomly 2) randomly choose a data input 3) find the most similar neuron 4) update the weights of the winning neuron and all neighbors making the neurons more similar to the input
LSTM structure
1) input layer 2) hidden layer which contains memory cells and corresponding gate units 3) an output layer
Training steps for perceptron (supervised learning)
1) input the training data 2) ask the perceptron to guess an answer 3) compute the error (comparing it to the ground truth) 4) adjust the weights according to the error 5) return to step 1 and repeat
Properties of Clusters
1) may have different sizes, shapes, densities 2) may form a hierarchy 3) may be overlapping or disjoint
types of sampling
1) non-probabilistic samples 2) probabilistic samples
5 types of RNN
1) one-to-one 2) one to many 3) many to one 4) many to many 5) many to many
ways to deal with overfitting in decision trees
1) prepruning 2) postpruning
ways to improve the efficiency of frequent itemset generation
1) reduce the number of candidates (M) 2) Reduce the number of transactions (N) 3) reduce the number of comparisons (NM)
Types of data reduction
1) reduction of number of data points 2) reduction of number of dimension
Possible solutions to the introduction of new mins and maxs after normalization
1) rerun the normalization 2) assign a global constant to higher/ lower values 3) based on experience increase the value rangw
typical data classes
1) scalar 2) multivariate and multidimensional data 3) vectors 4) network data 5) hierarchical data 6) time series data 7) geographic data
Apriori algorithm
1) scan DB once to get frequent 1-itemsets 2) Generate length (k+1) candidate itemsets from length k frequent itemsets 3) test the candidates against the database to see whether they are frequent or not based on the minimum support. If they are continue to next k, if not, they are pruned 4) Terminate when no frequent or candidate set can be generated
FP-Trees steps
1) scan the database and find the set of frequent items (1-itemsets) and their support count 2) sort the frequent 1-itemsets in descending order 3) scan the database. each transaction is processed in L order
steps to k-medoids
1) select an object as cluster representative 2) assign data points to the closest centroid 3) calculate the cost function 4) repeat until there is no improvement in the cost function
types of association rules
1) single-dimensional 2) multi-dimensional 3) boolean/ binary 4) quantitative
activation functions (neural network)
1) step function 2) sigmoid function
steps for calculating the PCA
1) subtract the mean 2) calculate the covariance matrix 3) calculate eigenvector eigenvalue 4) forming a feature vector 5) deriving new data set
three components of a perceptron
1) the input 2) the weighted sum of the inputs 3) the activation function that generates the output
Specification necessary before training a neural network
1) the number of units in the input layer 2) the number of hidden layers 3) the number of units in each hidden later 4) the number of units in the output layer
Two weaknesses of RNNs that LSTMs overcome
1) vanishing gradients 2) exploding gradients
distance functions for text documents
1) vector of term frequencies per text document 2) term frequency and inverse document frequency 4) cosine similarity
formula used to maximize the width of the margin
2/ ||w|| or minimize (1/2)||w||^2 (to make it a quadratic problem with a unique solution)
support interpretation (buys(X, "computer")=> buys (X, "software") cofidence= 50%)
50% because if a customer buys a computer, there is a 50% chance that he will buy software as well
calculate P(H|X given P(H), P(X|H), and P(X)
= P(X|H) * P(H)/ P(X)
Recurrent neural network
A class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence; applicable to handwriting recognition or speech recognition
Convolutional Neural Network
A class of deep neural networks, most commonly applied to analyzing visual imagery; convolutional layers make it distinct - prone to overfitting
Generative Adversarial network
A class of machine learning frameworks in which two neural networks contest with each other in a game
Covariance interpretation
A positive value means that both dimensions increase together
squared normalization
(sqrt(v) - sqrt(min)) / (sqrt(max) -sqrt(min))
K-nearest-neighbor disadvantages
- Application of classifier expensive, requires k-nearest neighbor query - Does not generate explicit knowledge about the classes
Steps to DENCLUE
- Apply the influence function to all data points - Sum up the influence functions to obtain the density curve (density function) - Apply a noise threshold - Identify the local maximums and determine clusters accordingly
Dendrogram
A tree of nodes representing clusters, satisfying the following properties: Root represent the whole data set Leaf nodes represent clusters containing a single object Inner nodes represent the union of all objects contained in its corresponding subtrees
Self-organizing map
A type of ANN that is trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction; use competitive learning instead of backpropagation; useful for visualization
Autoencoder neural network
A type of ANN used to learn efficient data codings in an unsupervised manner; learns a representation for a set of data, typically for dimensionality reduction, by training the network to ignore noise
Formula for finding eigenvalues
A v = λ v Where A = a square matrix v = eigenvector λ = eigenvalue
Clustering LARge Applications based upon RANdomized Search (CLARANS)
A way to implement k-medoids with greater efficiency than PAM and greater effectiveness than CLARA by specifying the number of iterations and the sample
frequent itemset
An itemset whose support is greater than or equal to a minsup threshold
"Encoder" or "Analyze" (in a NN name)
Neural Network to extract meaningful information from noisy input
"Decoder" or "Synthese" (in a NN name)
Neural Network to reconstruct the original input from the extracted features
calculate the probability of the joint events P(G=T, S=T, R=T)
P(G=T|S=T|R=T) * P (S-T|R=T) * P(R=T)
algorithms that implement k-medoids
PAM CLARA CLARANS
RNN Advantages
Possibility of processing input of any length Model size not increasing with size of input Computation takes into account historical information Weights are shared across time
difference between random and systematic error
Random noise would fall all over the place, biased would fall in a specific way
STARTING 03-03
STARTING 03-03
STARTING 03-04
STARTING 03-04
Bayes Classification
Classify a given tuple based on the probability of its attributes occurring with the class labels
RNN disadvantages
Computation is slow Difficulty of accessing information from a long time ago Cannot consider any future input for the current state
KDD process model
Data -> Target Data-> Preprocessed Data -> Transformed data -> Patterns -> knowledgwe
Market Basket Analysis
Finds associations between the different items that customers place in their "shopping baskets"
neighborhood function options for SOMs
Gaussian hat function other options
problem with association rule mining
Given a large dataset, the possible number of frequent itemsets is exponential in the number of possible attribute-values
Apriori principle
If an itemset is frequent, then all of its subsets must also be frequent So, if an itemset is infrequent, its superset must no be tested
influence function (DENCLUE)
In DENCLUE, a function applied to the data that highlights how much "influence" it has on other data points in its neighborhoods by counting how many neighbors are in its neighborhood
entropy
Info(D) = - sum of (p) * log2(p) where p = the probability that an arbitrary tuple in D belongs to the class C; a value between 0 and 1, where 1 is the most uncertainty and 0 is we know everything the average amount of information needed to identify the class label of a tuple in D
Calculate the centroid in BIRCH
LS / N
cluster random sampling
sampling method where clusters are randomly selected; - cheap method when it is geographically convenient - easy to increase sample sixe - least representative of the population
Principal Component Analysis
searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k <= n; this then allows the original data to be projected onto a much smaller space, resulting in dimensionality reduction
K-nearest-neighbor
searches the patterns space for the k training tuples that are closest to the unknown tuple
LSTM applications
sequence prediction (in general) specifically, Language modeling, speech recognition, and machine translation
steps to DBSCAN
Check an unclassified point If the unclassified point is reachable by another point connect them until you run out repeat
Complete Linkage
Calculate the distance based on the furthest away two points in the clusters Then take the smallest of those, and merge those two clusters
centroid-linkage
Calculate the mean of all the clusters, the find the distances between the means, and merge the clusters with the minimum distance
DBSCAN disadvantages
Cannot cluster data sets well with large differences in densities
Fine Tuning
similar to transfer learning; involves take a pre-trained model and if we know what a certain layer does for the dataset, then we freeze it so it works in situations in which you have multiple similar datasets
support interpretation (buys(X, "computer")=> buys (X, "software") Support = 1%)
It is 1% because 1% of all transactions under analysis showed that computer and software were purchased together
systematic random sampling
order the data according to a certain characteristic (eg income) and then take every k-th element where k = (the number of data points)/ (the number of samples); has the sample problem has simple random sampling- there's a chance the sample won't describe the population
problem with decision trees
overfitting
M-fold cross validation
partition dataset O into m same size subsets train m different classifiers using the different subsets combine the evaluation results of the the m classifiers
Pooling
performed to reduce dimensionality in a CNN; has no parameters, we just specify the window size and stride
Feature map
produced by applying a convolution on the input data with a convolution filter; has the same size as the filter; will have multiple of these as multiple convolutions are performed on an input; when they are stacked together they become the final output of the convolution layer
Ordering Points to Identify the Clustering Structure (OPTICS)
proposed to overcome the difficulty in using one set of global parameters in clustering analysis; determines the clustering for different density-parameters and shows the results in a visual representation
how weights are initialized (neural network)
randomly
least square error minimization
refers to minimizing sum of the square of the real y value minus the estimate y value
density connected
refers to the connection between two points that are both density reachable from a shared object o
continuous supervised learning example
regression eg linear regression; CNN, RNN
hierarchical data
relationships between nodes in a hierarchy can be specified by links
k-mean efficiency
relatively efficient