Data Mining

Ace your homework & exams now with Quizwiz!

Ward's method interpretation

Two clusters that are very far apart would have a high variance, which would indicate that we shouldn't merge them

Deep Learning

Subset of Machine Learning; set apart by the multiple layers in a neural network;

MinPts

User-specified parameter which indicates the density threshold of dense regions; how many neighbors must a point have for it to be considered a core point

Bayesian Belief Networks

Using conditional probability to identify the probability that an event occurs given another event

DENCLUE and noise

it is really good at handling noise, even when there is a lot of uniformly distributed noise the only limitation would be for biased noise, which would throw the algorithm off

k-means and noise

it is unable to handle

principal vectors

k orthonormal vectors, that provide a basis for the normalized input data

Gabor function

like a Gaussian function in 2d; it can be 3d in multidimensional space

logarithmic normalization

ln(v) - ln(min) / ln(max) - ln(min)

RNN applications

machine translation, speech recognition, handwriting recognition, image captioning

covariance

measure of how much two variables change together

Iterative Self-Organizing Data Analysis Technique (ISODATA)

merges clusters if either the number of members in a cluster is less than a certain threshold or if the centers of two clusters are closer than a certain threshold

Data integration

merges data from multiple sources into a coherent data stores such as a data warehouse

vectors

multiple data points that belong together and are not separable (like a phone number)

how to determine initial clustering

multiple sampling so that you avoid the chance the sample is not representative

multivariate and multidimensional data

multiple variable that may or may not have any relationship

lazy learner examples

nearest-neighbor classifiers, case-based resoning classifiers

preprocessing step for k-nearest neighbors

normalize values (min max works)

Goal of segmentation

simplify the image into something more meaningful and easier to analyze

Binning

sort data and partition into (equi-depth) bins and then smooth by bin means, bin median, bin boundaries, etc; a local smoothing method because it consults only the local neighborhood of data points

Minkowski Distance

sqr rt( sum of |x[i] - y[i]| ^p)

Euclidean distance

sqr rt(sum of (x[i] - y[i])^2)

starting 03

starting 03

starting 03-02

starting 03-02

starting 04_01

starting 04_01

starting 04_02

starting 04_02

starting 04_03

starting 04_03

starting 05_01

starting 05_01

starting lecture 2

starting lecture 2

Measure for compactness of a cluster (cost function)

sum of dist(p, m[c])

Measure for compactness of clustering (cost function)

sum of the compactness of each cluster C

Manhattan distance

sum of |x[i] - y[i]| a step wise function

Unsupervised learning

takes unlabeled data as input and outputs a grouping of data with decision boundaries

feature map

the "hidden layers" of SOMs; the neurons in this are connected to each other and to each input

Centroid

the center point of a cluster, which can be defined in many way including as the mean of medoid of the objects assigned to the cluster

convolutional layer

the core building block of the CNN consists of learnable filters; each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing a 2d activation map of that fitler

Density estimation

the estimation of an unobservable underlying probability density function based on a set of observed data ; in this context, the unobservable underlying probability density function is the true distribution of the population of all possible objects to be analyzed

difference between neural networks and curve fitting/ regression

the first can model non-linear relationships, the second can't

Lazy learner

the learner waits until the last minute before doing any model construction to classify a given test tuple; only performs a generalizaton when given a test tuple ; also called instance-based learners, even though all learning is essentially based on instances

simple random sampling

the least biased sampling method; the problem is the chance that the sample does not describe the whole population

Margin

the line that intersects with the support vectors

Max pooling

the most common type of pooling; slides a window over its input, and takes the max value in the input

Order 2 neighbors

the neurons that are a step away from directly linked to the neuron of interest in a feature map

Order 1 neighbors

the neurons that are directly linked to the neuron of interest in a feature map

k in k-nearest neighbors

the number of neighbors accepted in a cluster

Association Rules

the patterns that emerge from boolean vectors; considered interesting if they satisfy a minimum support threshold and a minimum confidence threshold

P(H|X)

the posterior probability of hypothesis H conditional on data tuple X example: probability that customer X will buy a computer given that we know his age and income

P(G=T, R=T)

the probability of the joint events G and R

classification error

the probability that a classifier incorrectly classifies an object

classification accuracy

the probability that the classifier correctly classifies an object

P(C[i]|X)

the probability that tuple X belongs to class C[i] given that we know the attribute description of X

Classification

the process of finding a model that describes and distinguishes data classes or concepts; the model is derived bsad on the analysis of a set of training data

Dimensionality reduction

the process of reducing the number of random variables or attributes under consideration; methods include wavelet transforms and principal components analysis

Knowledge Discovery in Databases (KDD)

the process of semi-automatic extraction of knowledge from databases which is 1) valid, 2) previously unknown, and 3) potentially useful

perceptron

the simplest neural network possible: a computational model of a single neuron

learning constant

the speed of learning a high version of this goes with highly separable data, inverse also true usually between 0.001 and 2

density function (DENCLUE)

the sum of the influence of all data points

Drawback of DBSCAN and OPTICS

their density estimates based on counting the number of objects in a neighborhood defined by a radius parameter epsilon can be highly sensitive to the radius value used

how DENCLUE overcomes the drawback of DBSCAN and OPTICS

through kernel density estimation

time series data

time perhaps has the widest range of possible values; mostly numerical, but could also be categorical like with a day of the week

Market Basket Analysis goal

to develop marketing strategies according to which items are frequently purchased together

main goal of PCA

to explore the statistical correlations among attributes and find the data representation that retains the maximum nonredundant and uncorrelated information

SVM goal

to find the best separating hyperplane for the training data

lagrangian multipliers

used to minimize the margin of the SVM; introduce the variable a[i] which weights each of the data points by how much it contributes to the solution (so just the support vectors are included)

drawback of ISODATA

user has to provide several additional parameter values

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

uses compact summaries to describe micro clusters arranges the summaries in a balanced tree

k-Medoids clustering

uses medoids as the representative element of the cluster

linear interpolation

using a linear equation to find the connection between two points; influence by outliers

Transfer learning

using a pre-trained model on some task and fine tune it on a new task by removing only the last layer and keeping the basic feature extraction; used to avoid long training times

network data

vertices on a surface are connected to their neighbors via edges

how to calculate a new weight (perceptron)

weight + (error * output) * learning constant delta weight = error * output

indicator of association rule strength

when both support and confidence criteria are satisfied

EM termination

when the clustering converges or the changes are very small

inverse document frequency

word's frequency in a document normalized by how frequently it appears across all documents in database

Hierarchical clustering method

works by grouping data objects into hierarchy or "tree" of clusters

formula to find the hyperplane

y[i](w * x[i] + b ) -1 = 0

Motivation of sampling

you can represent a large dataset by a much smaller subset; you can speed up automatic calculations performed in a later step

problem of k-Means that ISODATA fixes

you don't have to specify k

Silhouette Coefficient

[ b(o) - a(o) ] / max{a(o),b(o)} where a(o) = dist between object o and its cluster representative b(o) = distance between object o and its "second best" cluster representative

hyperplane

a "decision boundary" separating the tuples of one class from another

Clustering Feature Tree (CF tree)

a balanced tree that hierarchically arranges clustering features each inner node contains at the max CF and a child

DENCLUE (DENsity-based CLUstEring)

a clustering method based on a set of density distribution functions

neural network

a collection of neuron-like processing units with weighted connections between the units

Itemsets

a collection of one or more items

geographic data

a data record which has an implicit or explicit association with a location relative to the Earth

k-distance diagram

a diagram that plots object in descending order of the density around them (lowest density in the top left, highest in the bottom right) change in slope marks a new density level for a cluster

GAN discriminator

a discriminative model in a GAN that learns the boundary between real data from the input and fake data from the generator; classifies yes/no fake or not

k-distance function

a heuristic method that considers the distance between each point in the dataset and its k-nearest neighbors; after finding this distance and ordering the points according to this distance, we can determine an appropriate epislon and MinPts (minimum density) for DBSCAN

Recall

a measure of the ability of a system to present all relevant items(number of relevant items retrieved) / ( number of relevant items in collection ) = true positive/ true positive + false negative

Precision

a measure of the ability of a system to present only relevant items; (number of relevant items retrieved)/ (total number of items retrieved) = true positive/ true positive + false positive

Divisive method

a method of hierarchical clustering that initially lets all the given objects form one cluster, which they iteratively split into smaller clusters

Agglomerative method

a method of hierarchical clustering that starts with individual objects as clusters, which are iteratively merged to form larger clusters

Multiple-phase/ multiphase clustering

a method that tries to improve the clustering quality of hierarchical methods by integrating hierarchical clustering with other clustering techniques; two examples are BIRCH and Chameleon

multilayer feed-forward neural network

a neural network that consists of an input layer, an output later, and an arbitrary number of hidden layers (usually 1) - feed forward: none of the weights cycle backwards - fully connected - each unit provides input to each unit the next

directly density reachable

a point that is a neighborhood point to a core point ; if we meet this requirement, then we also meet the density-reachable and density-connected requirement

density reachable

a point that is not a direct neighbor of a core point, but connected to that core point through another point

noise point (density based clustering)

a point that is not density reachable or density connected from another point

step function

a possible activation function f(x) = {0 if 0>x, 1 if x >=0}

sigmoid function

a possible activation function g(x) = 1 / (1 + e^-x)

dynamic regression model

a regression model that uses historical and new data to adapt the model

static regression model

a regression model that uses only historical data to calculate the function

Long Short-Term Memory

a type of RNN capable of learning order dependence in sequence prediction problems

Method of Wishart

a way to deal with the single-linkage problem of linking cluster because of outliers that form a bridge between clusters works by identifying and removing points with low density around them before applying the algorithm

Cube Map

a way to implement DENCLUE by putting a grid over all data points, ignoring the empty squares and checking the density of the populated squares

Within-cluster variation

a way to measure the quality of cluster C by calculating the sum of squared error between all objects in C and the centroid

Factors comprising data quality

accuracy, completeness, consistency, timeliness, believability, and interpretability

Center defined cluster

after applying the noise threshold in DENCLUE, we identify a cluster for each local maximum

Self Organizing Maps (SOMs)

aka Kohonen maps defines a "mesh" that serves as the basic layout for the pictorial representation and then let it float in the dataspace such that the data gets well covered uses a neural network to cluster data

conditional pattern base

all of the transformed prefix paths of item p which are accumulated by traversing the FP-tree by following the link of each frequent item p

how to determine the number of units in the input layer (neural network)

allocate one input for each domain value for example: Marriage status {married, widowed, divorced} => 3 input units

Partitioning algorithm

an algorithm that organizes a given dataset D of n objects into k partitions where each partition represents a cluster (for example, k-means and k-medoids) Definition:

Parallel algorithm

an algorithm which can do multiple operations in a given time; as opposed to a traditional serial algorithm or a sequential algorithm; an algorithm may vary in how parallelizable it is

Single-linkage

an approach in which each cluster is represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters

scalar

an individual number in a data record; could be any data ytpe

composition of a two layer neural network

an input layer, a hidden layer, and an output layer

k-itemset

an itemset that contains k-items

border object (DBSCAN)

an object that is density reachable from another object

core object (DBSCAN)

an object that within a predetermined distance epsilon has a predetermined number of neighbors

Clustering

analyzes data objects without consulting class labels; can be used to generate class labels; objects are group based on a principle of maximizing the intraclass similarity and minimizing the interclass similarity

EM cluster shapes

any kind of elliptical shape since we use the mean and std dev to define the dimensions of the Gaussian distribution

noise

any random error or variance in a measured variable; also called bias

How many eigenvectors and values a matrix has

as many as the matrix has dimensions

competitive learning

as opposed to error correction learning; in this model we compete for assimilating the input ex: SOMs

classification output for k-nearest neighbors

assigns the most common class in the cluster to the tuple

Clustering for data cleaning

automatically or via human inspection clustering data and remove outliers

variance

average of squared differences from the mean

Supervised learning

basically a synonym for classification; the supervision in the learning comes from the labeled examples in the training data set

prediction output for k-nearest neighbors

calculate the average value of the class attribute

data cleaning

can be applied to remove noise and correct inconsistencies in data

linear regression for smoothing

can be used to fill in missing values, classify and predict numeric values, reduce the amount of data by not storing all data values but just the model function, and smooth data

major drawback of perceptrons

can only solve linearly separable data - so they cannot solve XOR

data reduction

can reduce data size by, for instance, aggregating, eliminating reducant features, or clustering

Nominal

categorical data, no quantitative relationship between variables, classification without ordering

EM initialization

choose the number of clusters

k-means initialization

choose the number of clusters and initial centroids

discrete supervised learning example

classification eg decision tree, CNN, RNN

discrete unsupervised learning example

clustering eg k-means; SOM

Ward's Method

considers how dense the clusters are so that we can see whether merging gives us good variance sum( D(x, mu)^2) where x= data point in the cluster; mu = mean of the cluster

Multi-center defined cluster

consists of a set of center defined clusters, which are linked by a path with significance ξ; Means that are multiple local maximums above the upper noise threshold, we count them all into a single cluster

Eager learners

construct a generalization model given a set of training tuples before receiving new tuples to classify; decision tree induction, support vector machines

Curve fitting

constructing a curve, or mathematical function, that has the best fit to a series of data points possibly subject to constraints; can involve either interpolation,or smoothing in which a "smooth" function is constructed that approximately fits the data

brute force approach to frequent itemset generation

count the support of each candidate M by scanning the database of N items in the list and W items in each row

Support vectors

data vectors on the margin in an SVM - these are the only vectors that play a role in defining the hyperplane

Ordinal

data with attributes that can be rank-ordered; distanced between values do not have any meaning

numeric

data with attributes that can be rank-ordered; distances between values have a meaning; mathematical operations are possible

Stratified Random Sampling

define strata based on some characteristics like education level, then sample within each strata; most effective when variability within strata are minimized, variability between strata are maximized, and the variables upon which the population is stratified are strongly correlated with the desired dependent variable

K-means algorithm

defines the centroid of a cluster as the mean value of the points within the cluster and iteratively improves the within-cluster variation until the cluster assignment is stable

DBSCAN efficiency

depends greatly on input parameters; A low epsilon will take longer to check everything, a high epsilon while not take long but be more insensitive

perceptron's error

desired output - guessed output

continuous unsupervised learning example

dimensionality reduction eg PCA; AE, GAN

equal-width binning

divide the range into N intervals of equal size if A and B are the lowest and highest values of an attribute, then the width of the intervals is (B-A) / N outliers may dominate the presentation and skewed data is not handled well

equal-depth binning

divides the range into N intervals, each containing roughly the same number of samples; skewed data is handled well

Temporal Multi-Dimensional Scaling (TMDS)

does dimensionality reduction and maintains the relative distance of all the point works for categorical data an alternative clustering method

Pooling layers

downsample each feature map independently, reducing the height and width, keeping the depth intact

ways to determine k in k-nearest neighbors

experiment by increasing k and calculating error rate a low k is sensitive to outliers a high k affected by data points of different cluster/ classes

Linear normalization

f(v) = ( v- min)/ ( max- min) where vis an individual value

learned feature approach

features are learned from a feature hierarchy all layers extract features from the output of the previous layer all layers trained jointly

difference between transfer learning and fine tuning

first is better when the new dataset is more different from the one the model was trained on; second works better on a very similar dataset

difference between PAM and CLARANS and CLARA

first iterates through every possible clustering combination, second uses sampling to increase algorithmic efficiency, last tries to seek a balance between these two extremes

Support

fraction of transactions that contain an itemset example: s({Milk, Bread}) = ⅜ (A => B) = P(A ⋃ B )

support count

frequency of occurrence in an itemset example: ({Milk, bread}) = 3

how to search a hash tree (association rules)

given a tuple (for example: 3 5 9) apply the hash function to each value in the tuple sequentially until you read a node. Then you check to see if the tuple is present in the node

how to insert value in a hash tree (association rules)

given a tuple, apply the hash function to each value in the tuple sequentially until you reach a node. If the node contains the tuple already or has space for an additionall tuple, insert the tuple. If not, split the node by adding another level of hash function and insert the tuple accordingly

FP-growth

grow long patterns from short ones using local frequent items

Convolutional filter

has a specific height and width (5x5x3) , is 3D with the depth matching the depth of the image; slides over the input in order to perform a convolution at every possible location to aggregate in a feature map

Silhouette coefficient interpretation

high value -> better clustering > 0.5 reasonable cluster structure

Naive Bayes Classification problem

if one of the probabilities in the calculation is 0, the the whole formula is 0 solution: add one tiny sample so that it is non zero

cluster ordering

in OPTICS, a linear list of all objects under analysis; objects in a denser cluster are listed closer to each other

core-distance

in OPTICS, the smallest epsilon value such that the epsilon neighborhood of an object p has at least MinPts objects; the minimum threshold that makes p a core object

winning neuron

in SOMs, the neuron that is most similar to the input vector

GAN generator

in a GAN, learns the distribution of the input data and is able to generate new data for the discriminator to evaluate

covariance matrix

in a PCA analysis, the covariance between all dimensions

latent space

in an encoder-decoder system, this captures the essential features necessary for reconstructing the input; For images it is hard to a human to interpret what these "essential features" would be

distance function

in clustering, a function that determines how close together or far apart data points are

clustering features

in the BIRCH algorithm, compact summaries that describe micro clusters by containing centroid and radius information = (N, LS, SS) N = number of points in C LS = linear sum of N points SS = square sum of N points

difference between classification and prediction

in the first we have discrete class, in the second we have continuous numerical values

postpruning

involves pruning the decision tree after construction by calculating the cost complexity

Prepruning

involves pruning the decision tree during the construction by determing a stopping critera that is based on minimal support orminimum confidence

Predictive analysis

involves the discovery of rules that underlie a phenomenon and form a predictive model which minimize the error between the actual and predicted outcome considering all possible inferring factors

confidence

(A=> B) = P(B|A) (A=>B) = support (A ⋃ B )/ support(A)

Two ways to determine clusters after applying the noise threshold in DENCLUE

- Center defined cluster - Multi-center defined cluster

advantages of decision trees

- Good for applications where there are many attributes of unknown importance - Tolerant toward many correlated or noisy attributes - The structure allows for data clean-up. - Can reveal unexpected dependencies in the data which would be hidden in a more complex model -easy to understand the model - able to handle numerical and categorical data - inexpensive to construct

neural networks advantages

- High tolerance of noisy data - Classify patterns on which they have not been trained - Can be used when you have little knowledge on the relationships between attributes - Well-suited for continuous inputs and outputs - Use parellelization techniques which can speed up computation process

SVMs Disadvantages

- Inefficient model construcition - Long training times (~O(n2)) - Model is hard to interpret - Learn only weights of features - Weights tend to be almost uniformly distributed

The four basic steps of the PCA

- Input data are normalized - PCA computers the principal vectors that provide a basis for the normalized input data; the input data are a linear combination of the principal components - The principal components are sorted in order of decreasing significance - Use only the strongest principal components to reconstruct a good approximation of the original data

K-nearest-neighbor advantages

- Local method does not have to find a global decision function (decision surface) - High classification accuracy in many application Incremental Classifier can easily be adapted to new training objects - Can be used also for prediction

Neural networks disadvantages

- Long training time - Require a number of parameters that are typically best determined empirically - Poor interpretability

deep learning advantages

- No manual feature extraction - Allows Machine Learning without feature engineering - Complex problems can be solved without much domain knowledge

DBSCAN advantages

- No need to specify the number of clusters in advance - Able to find arbitrarily shaped clusters - Able to detect noise

Differences between CNN and regular NN

- Not fully connected -> neurons in one layer only connect to a small region of the neurons in the following layer - The layers are organized in 3 dimensions: width, height and depth - The output is reduced to a single vector of probability scores

Disadvantages of stratified random sampling

- Requires selection of relevant stratification variable which can be difficult - Is not useful when there are no homogenous subgroups - Can be expensive to implement

SVMs advantages

- Strong mathematical foundation - Find global optimum - Scale well to high-dimensional datasets - High classification accuracy in many challenging applications - less prone to overfitting than other methods

Three dimensions of input for a CNN

- Width - x axis of the image - Height - y axis of the image - Depth - color channels in an image (RGB)

Support Vector Machines (SVMs)

- a method for the classification of both linear and nonlinear data - Uses nonlinear mapping to transform the original training data into a higher dimension - And searches for the linear optimal separating hyperplane - they are very slow but highly accurate, less prone to overfitting

Difference between RNN and regular NN

- first has no predetermined limit on input, second does - first can remember things learned from prior inputs, second only from training - first shares parameters across each layer of the network, second has different weights across each node

Information gain

- for attribute A, the biggest reduction in entropy compared to the original set - can never be negative - no matter what, entropy will decrease

Patterns that occur frequently in data

- itemsets - subsequences - substructures

k-means disadvantages

- need to specify k - unable to handle noisy data - cannot detect clusters with non convex shapes - applicable only when mean is defined - often terminates at a local optimum

steps in training GAN models

- neural network that trains the D detective on input data R - neural network that trains generator G on noise data I - then D decides if input is from R or fake data - G tries to forge samples which D thinks are from R - G trains on the binary decision of D

k-means advantages

- relatively efficient - simple implementation

disadvantages to decision trees

- repetition of split criteria - replication of subtrees - large trees are difficult to analyze/ understand - overfitting

linkage based clustering algorithms

- single linkage - complete linkage - centroid linkage

how to determine the number of hidden layer units

- there are no clear rules - trial and error - rule of thumb is: 0.5 * (# inputs + # outputs)

neural network advantages

- tolerant against noise - well-suited for continuous data - algorithm is inherently parallel

how to determine the number of hidden layers

- usually only 1

advantages of stratified random sampling

-Focuses on important subpopulations and ignores irrelevant ones - Allows use of different sampling techniques for different subpopulations - Improves the accuracy of estimation

k-means terminates when...

... the cluster assignment is stable

how to determine the number of units in the output layer

1 output unit is sufficient for two classes > 2 classes -> output unit for each class

steps to Naive Bayes Classification

1) Calculate P(C[i]) for each possible i (ex: Probability that buys a computer = yes) 2) Calculate P(X|C[i]) for each possible i by multiplying the probability of each attribute (P(age= youth| buys_computer = yes) 3) Multiply each P(C[i]) by its corresponding P(X|C[i]) and compare the result

3 basic requirements of an RNN

1) Can store information for an arbitrary duration 2) Is resistant to noise 3) Its parameters are trainable in reasonable time

FP-tree benefits

1) Completeness 2) compactness

Naive Bayes advantages

1) Fast to train and classify 2) Performance is similar to decision trees and neural networks 3) Easy to implement 4) handles numeric and categorical data 5) useful for very large datasets

two influence functions

1) Gaussian influence function 2) square wave influence function

Structure of Encoder Decoder systems

1) Input layer 2) Encoder 3) Latent space 4) Decoder 5) output

3 distance functions for numeric attributes

1) Minkowski distance 2) Euclidean distance 3) Manhattan distance

Two components of CNNs

1) The hidden layers/ feature extraction part 2) The classification part

ways to evaluate a classification model

1) accuracy 2) speed 3) robustness 4) scalability 5) interpetability

Naive Bayes Classification disadvantages

1) assumes class conditional independence, therefore loss of accuracy 2) model is difficult to interpret

Ways to handle noisy data

1) binning 2) regression 3) clustering

types of splits in a decision tree

1) boolean splits (Married: yes or no) 2) nominal splits: (married: never, divorced, widow) 3) split on continuous attributes (temperature <= 80 or > 80)

steps to identify an attribute to split on

1) calculate entropy for the whole data set (p[i] = yes/ all, p[i] = no/all) 2) calculate Info[A](D) for each attribute A 3) calculate Information gain for each attribute 4) select the attribute with the highest information gain

steps to training a multilayer feed-forward neural network

1) calculate the input for the hidden layer 2) calculate the output for the hidden layer 3) calculate the input for the output layer 4) calculate the output for the output layer 5) calculate the error for the output layer 6) calculate the error for the hidden layer 7) calculate new weights

steps to k-means

1) choose the number of clusters 2) reassign each object to the cluster to which the object is most similar based on the mean value of the objects in the cluster 3) update the cluster means (which dont have to be real data points) 4) repeat until no change

Naive Bayes classifier assumptions

1) class conditional independence

two types of decision trees

1) classification 2) regression

steps to Gini gain caluclation

1) compute the impurity of D training data set 2) find the "best" splitting criterion by computing the gini impurity for each attribute

steps to creating a decision tree

1) create a node N 2) fill this node - if all elements in this node are in the same class, then we finish 3) else, apply a selection method to decide which attribute is the best to distinguish for the split

steps in expectation maximization algorithm

1) create initial clustering by projecting the data onto the Gaussian distributions 2) calculate the probability of each data point being assigned to each cluster 3) calculate new clustering 4) Calculate E(M) and E(M') 5) repeat until maximization

Goals of Cluster Analysis

1) data understanding 2) data class identification 3) data reduction 4) outlier detection 5) noise detection

Two step process to association rule mining

1) find all frequent itemsets (minimum support) 2) Generate strong association rules from the frequent itemsets (minimum support , minimum confidence)

Three gates in an LSTM

1) forget gate 2) input gate 3) output gate

steps to single-linkage

1) form initial clusters each of a single object and compute the distance between each pair of clusters 2) merge the two clusters that have the smallest distance 3) calculate the distances between each cluster 4) if there is only one cluster stop, otherwise return to step 2

Ways to deal with missing values

1) ignore the tuple 2) enter the value manually 3) use a global constant 4) use attribute mean 5) use the most probable value

CNN applications

1) image classification 2) object detection

steps to SOMs

1) initialize weights randomly 2) randomly choose a data input 3) find the most similar neuron 4) update the weights of the winning neuron and all neighbors making the neurons more similar to the input

LSTM structure

1) input layer 2) hidden layer which contains memory cells and corresponding gate units 3) an output layer

Training steps for perceptron (supervised learning)

1) input the training data 2) ask the perceptron to guess an answer 3) compute the error (comparing it to the ground truth) 4) adjust the weights according to the error 5) return to step 1 and repeat

Properties of Clusters

1) may have different sizes, shapes, densities 2) may form a hierarchy 3) may be overlapping or disjoint

types of sampling

1) non-probabilistic samples 2) probabilistic samples

5 types of RNN

1) one-to-one 2) one to many 3) many to one 4) many to many 5) many to many

ways to deal with overfitting in decision trees

1) prepruning 2) postpruning

ways to improve the efficiency of frequent itemset generation

1) reduce the number of candidates (M) 2) Reduce the number of transactions (N) 3) reduce the number of comparisons (NM)

Types of data reduction

1) reduction of number of data points 2) reduction of number of dimension

Possible solutions to the introduction of new mins and maxs after normalization

1) rerun the normalization 2) assign a global constant to higher/ lower values 3) based on experience increase the value rangw

typical data classes

1) scalar 2) multivariate and multidimensional data 3) vectors 4) network data 5) hierarchical data 6) time series data 7) geographic data

Apriori algorithm

1) scan DB once to get frequent 1-itemsets 2) Generate length (k+1) candidate itemsets from length k frequent itemsets 3) test the candidates against the database to see whether they are frequent or not based on the minimum support. If they are continue to next k, if not, they are pruned 4) Terminate when no frequent or candidate set can be generated

FP-Trees steps

1) scan the database and find the set of frequent items (1-itemsets) and their support count 2) sort the frequent 1-itemsets in descending order 3) scan the database. each transaction is processed in L order

steps to k-medoids

1) select an object as cluster representative 2) assign data points to the closest centroid 3) calculate the cost function 4) repeat until there is no improvement in the cost function

types of association rules

1) single-dimensional 2) multi-dimensional 3) boolean/ binary 4) quantitative

activation functions (neural network)

1) step function 2) sigmoid function

steps for calculating the PCA

1) subtract the mean 2) calculate the covariance matrix 3) calculate eigenvector eigenvalue 4) forming a feature vector 5) deriving new data set

three components of a perceptron

1) the input 2) the weighted sum of the inputs 3) the activation function that generates the output

Specification necessary before training a neural network

1) the number of units in the input layer 2) the number of hidden layers 3) the number of units in each hidden later 4) the number of units in the output layer

Two weaknesses of RNNs that LSTMs overcome

1) vanishing gradients 2) exploding gradients

distance functions for text documents

1) vector of term frequencies per text document 2) term frequency and inverse document frequency 4) cosine similarity

formula used to maximize the width of the margin

2/ ||w|| or minimize (1/2)||w||^2 (to make it a quadratic problem with a unique solution)

support interpretation (buys(X, "computer")=> buys (X, "software") cofidence= 50%)

50% because if a customer buys a computer, there is a 50% chance that he will buy software as well

calculate P(H|X given P(H), P(X|H), and P(X)

= P(X|H) * P(H)/ P(X)

Recurrent neural network

A class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence; applicable to handwriting recognition or speech recognition

Convolutional Neural Network

A class of deep neural networks, most commonly applied to analyzing visual imagery; convolutional layers make it distinct - prone to overfitting

Generative Adversarial network

A class of machine learning frameworks in which two neural networks contest with each other in a game

Covariance interpretation

A positive value means that both dimensions increase together

squared normalization

(sqrt(v) - sqrt(min)) / (sqrt(max) -sqrt(min))

K-nearest-neighbor disadvantages

- Application of classifier expensive, requires k-nearest neighbor query - Does not generate explicit knowledge about the classes

Steps to DENCLUE

- Apply the influence function to all data points - Sum up the influence functions to obtain the density curve (density function) - Apply a noise threshold - Identify the local maximums and determine clusters accordingly

Dendrogram

A tree of nodes representing clusters, satisfying the following properties: Root represent the whole data set Leaf nodes represent clusters containing a single object Inner nodes represent the union of all objects contained in its corresponding subtrees

Self-organizing map

A type of ANN that is trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction; use competitive learning instead of backpropagation; useful for visualization

Autoencoder neural network

A type of ANN used to learn efficient data codings in an unsupervised manner; learns a representation for a set of data, typically for dimensionality reduction, by training the network to ignore noise

Formula for finding eigenvalues

A v = λ v Where A = a square matrix v = eigenvector λ = eigenvalue

Clustering LARge Applications based upon RANdomized Search (CLARANS)

A way to implement k-medoids with greater efficiency than PAM and greater effectiveness than CLARA by specifying the number of iterations and the sample

frequent itemset

An itemset whose support is greater than or equal to a minsup threshold

"Encoder" or "Analyze" (in a NN name)

Neural Network to extract meaningful information from noisy input

"Decoder" or "Synthese" (in a NN name)

Neural Network to reconstruct the original input from the extracted features

calculate the probability of the joint events P(G=T, S=T, R=T)

P(G=T|S=T|R=T) * P (S-T|R=T) * P(R=T)

algorithms that implement k-medoids

PAM CLARA CLARANS

RNN Advantages

Possibility of processing input of any length Model size not increasing with size of input Computation takes into account historical information Weights are shared across time

difference between random and systematic error

Random noise would fall all over the place, biased would fall in a specific way

STARTING 03-03

STARTING 03-03

STARTING 03-04

STARTING 03-04

Bayes Classification

Classify a given tuple based on the probability of its attributes occurring with the class labels

RNN disadvantages

Computation is slow Difficulty of accessing information from a long time ago Cannot consider any future input for the current state

KDD process model

Data -> Target Data-> Preprocessed Data -> Transformed data -> Patterns -> knowledgwe

Market Basket Analysis

Finds associations between the different items that customers place in their "shopping baskets"

neighborhood function options for SOMs

Gaussian hat function other options

problem with association rule mining

Given a large dataset, the possible number of frequent itemsets is exponential in the number of possible attribute-values

Apriori principle

If an itemset is frequent, then all of its subsets must also be frequent So, if an itemset is infrequent, its superset must no be tested

influence function (DENCLUE)

In DENCLUE, a function applied to the data that highlights how much "influence" it has on other data points in its neighborhoods by counting how many neighbors are in its neighborhood

entropy

Info(D) = - sum of (p) * log2(p) where p = the probability that an arbitrary tuple in D belongs to the class C; a value between 0 and 1, where 1 is the most uncertainty and 0 is we know everything the average amount of information needed to identify the class label of a tuple in D

Calculate the centroid in BIRCH

LS / N

cluster random sampling

sampling method where clusters are randomly selected; - cheap method when it is geographically convenient - easy to increase sample sixe - least representative of the population

Principal Component Analysis

searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k <= n; this then allows the original data to be projected onto a much smaller space, resulting in dimensionality reduction

K-nearest-neighbor

searches the patterns space for the k training tuples that are closest to the unknown tuple

LSTM applications

sequence prediction (in general) specifically, Language modeling, speech recognition, and machine translation

steps to DBSCAN

Check an unclassified point If the unclassified point is reachable by another point connect them until you run out repeat

Complete Linkage

Calculate the distance based on the furthest away two points in the clusters Then take the smallest of those, and merge those two clusters

centroid-linkage

Calculate the mean of all the clusters, the find the distances between the means, and merge the clusters with the minimum distance

DBSCAN disadvantages

Cannot cluster data sets well with large differences in densities

Fine Tuning

similar to transfer learning; involves take a pre-trained model and if we know what a certain layer does for the dataset, then we freeze it so it works in situations in which you have multiple similar datasets

support interpretation (buys(X, "computer")=> buys (X, "software") Support = 1%)

It is 1% because 1% of all transactions under analysis showed that computer and software were purchased together

systematic random sampling

order the data according to a certain characteristic (eg income) and then take every k-th element where k = (the number of data points)/ (the number of samples); has the sample problem has simple random sampling- there's a chance the sample won't describe the population

problem with decision trees

overfitting

M-fold cross validation

partition dataset O into m same size subsets train m different classifiers using the different subsets combine the evaluation results of the the m classifiers

Pooling

performed to reduce dimensionality in a CNN; has no parameters, we just specify the window size and stride

Feature map

produced by applying a convolution on the input data with a convolution filter; has the same size as the filter; will have multiple of these as multiple convolutions are performed on an input; when they are stacked together they become the final output of the convolution layer

Ordering Points to Identify the Clustering Structure (OPTICS)

proposed to overcome the difficulty in using one set of global parameters in clustering analysis; determines the clustering for different density-parameters and shows the results in a visual representation

how weights are initialized (neural network)

randomly

least square error minimization

refers to minimizing sum of the square of the real y value minus the estimate y value

density connected

refers to the connection between two points that are both density reachable from a shared object o

continuous supervised learning example

regression eg linear regression; CNN, RNN

hierarchical data

relationships between nodes in a hierarchy can be specified by links

k-mean efficiency

relatively efficient


Related study sets

Calculations of Deduction & Net Pay

View Set

Lecture 20: Imperial Traditions ~ From Babylon to the Fall of Persia

View Set

A&P: Lab 2 "Introduction to the Human Body"

View Set

4 HIST 201 Lesson 4 Complex Societies and Cities

View Set

Chapter 17: Cardiovascular Emergencies

View Set

PHI2600 - 3.1 Utilitarianism, The Theory

View Set