Machine Learning
decision trees: top-down induction
learning a decision tree is finding the best order to ask the attribute values (best means we want a small tree → information gain)
What is the idea behind local PCA?
first clustering, than PCA in each cluster, or better, iteratively improve position of cluster centers and local PCs. For continous, non-clustered data: --> no continous description of the manifold --> different clusterings are possible, leading to entirely different local projection
What are properties of the basic maximization algorithm?
- maybe caught in local optima - depends on initial partitioning - to escape local optima, 'downhill' steps are accepted with probability ebelta Delta E - simulated annealing: initially small allows frequent downhill steps, increasing it makes them more unlikely - for the following optimization criteria, only W needs to be computed, due to A = B +W
Provide a pseudocode for basic maximization algorithm
(note:~x = vektor x) Initialization: partition data somehow into clusters C1...Cn while stop criterion not reached do Choose an example ~x at random, denote its cluster as C(~x) Randomly select a target cluster Ct compute the change of the goodness function: DeltaE = E('~x element Ct')- E('~x element C(~x)') if Delta E > 0 then Put ~x from C(~x) to Ct else Put ~x from C(~x) to Ct with probability e^(belta Delta E) end if increase beta end while
What are common choices on which to decide the number of dimensions to project the data?
- Eigenvalue magnitudes: Find the cut-off depth. This is useful for classification problems, especially for problems to be solved by computers. - Visualization: Choose the number of dimensions which is useful to visualize the data in a meaningful way. This choice depends a lot on your problem definition. For printing 2D is usually a good choice - but maybe your data is just very nice for !D already. Or maybe you are using a glyph plot (see sheet 06) which can represent high dimensional data. - Classification results: In the Eigenfaces assignment below we figured out that the number of principal components (and thus the number of dimensions) can have a crucial impact on classification rates. It is thus an option to fine tune the number of dimensions for a good classification result on some training set.
Explain at least one other data visualization technique from the lecture
- Scatterplot matrix: Scatter data into plots for each combination of two attributes. - Glyphs: Map data dimensions onto parameters for geometrical figures, e.g. star glyphs, arrows. Properties could be lengths, widths, orientations, colors, ... - Parallel coordinates: Use feature dimensions as one axis and feature values as another. Plot datapoints as lines. - Projections: Several different techniques to project the data: PCA, scaling strategies, ...
What are the properties of formal neurons?
- activity communicated via axon is represented as real number x (aka spiking freq) - for a neuron with d inputs --> ~x element of R^d is the input vector - each input xi is weighted by a weight wi - activation is the sum of the weighted inputs - output y is then given by feeding the activation into an activation function: y = y(s)
What are two complementary methods in hierachical clustering?
- agglomerative clustering: start with each data point as a cluster, then merge recursively bottom up - divisive clustering: start with all data points as a single cluster, split clusters recursively top down The result is a dendrogram representing all data in a hierarchy of clusters.
What are the properties of hierachical clustering?
- any distance measure can be used - we only need the distance matrix (not the data) - no parameters - efficiency: agglomerative O(n^3) (optimized SLINK O(n^2)) / divisive: O(2^n) (optimized CLINK O(n^2)) - resulting dendrogram offers alternative solutions (but has to be analyzed) - cut off at different levels of dendrogram may be necessary to get comparable clusters - outliers are fully incorporated
What are aims of dimension reduction?
- find local dimensionalities of the data manifold - find new coordinate system to represent the data manifold and project the data to it - get rid of empty parts in the space - new parameters may be more meaningful
What is the idea of PCA?
- find subspace that captures most of the data variance. - Unsupervised. - Given: data set D = {~x1, ~x2,...}, ~xi element R^d with zero mean: < ~x >= sum over xi(~xi) = 0 --> PCA finds m < d orthonormal vectors ~p1...~pm such that ~pi are the directions of the largest variance of D.
Which methods can be used if we have missing attributes in decision trees?
- if node n tests A, assign most common values of A among other examples sorted to n - assign most common value of A among other examples with same target value - assign probability pi to each possible value vi of A (prob estimated from value distribution of examples assigned to n). - Assign fraction pi of examples with missing values to each descendant in tree.
What is the general srchitecture of an MLP?
- input layer (Layer 0): one neuron for each dimension of input vector (input neurons only represent input) - at least one hidden layer (L1...LH), hidden layer can have different numbers of neurons - output layer(LH+1): one node per dimension of the output - feed forward architecture: only connections from layer k to i with k < i - more notation: neurons in layer i: 1(i)...N(i) / output of neuron i in layer k: oi(k) / weight from neuron k in layer n to neuron i in layer m: wik(m, n) Sigmoid activation functions σ are used (succession of linear transforms is itself a single linear transform, so nothing to gain here). σ enforces a decision, soft step required (backpropagation requires dierentiability). Squashing maps incoming info to smaller range.
What can cause outliers?
- measurement and technical errors (cut-o e.g. may lead to high concentration of values at the boundaries) - unexpected true effect (can be modeled by two or more overlaid distributions P(x) = (1 -p) * Pa(x) + p * Pb(x) p << 1 - data with inherent high variability (can be modeled by distribution with broad flanks. There are different types of outliers which can have different causes. They could arise through measurement or technical errors when collecting data. This may be connected to having a sharp cut-off in regard to the range of measurements, which could lead to a high concentration of values at the artificial boundaries of an experiment. However they may also show us a true underlying effect in our data that we didn't expect or account for. This might be the case when we are treating the measurements as a single distribution, when in reality there are actually two underlying distributions. Lastly, our distribution might actually naturally have a high variance, which makes outliers or extreme values a natural part of the distribution.
What are properties of k-means?
- number of clusters is the only parameter (implicitly defines scale and shape). - Greedy optimization → local optima, depending on initial conditions. - Fast algorithm.
What to do with outliers?
- removal: simple, but loss of information - weight according to z-values - remove outliers and fill up gaps with methods following
Name different linkage criterions
- single linkage clustering: uses minimum cluster distance. Prefers chaining - complete linkage clustering: uses the maximum cluster distance. Prefers compact clusters - average linkage clustering (UPGMA): uses mean cluster distance. - centroid clustering: uses centroid distance. Clusters are repr. by their centroids / real valued attributes needed for centroid computation / when joining clusters,. resulting centroid is dominated by the cluster with more members
Name and explain two ways to extract more principal components ~v from D
1. succesive application of single Hebb neurons: extract ~v1, project D to space orthorgonal to ~v1 and extract ~v2 and so on --> single neurons need less steps (only get the 'correct' input for them. Good for sequential training 2. Method by Sanger: chain of laterally coupled neurons, each trained by Hebb's rule. First neuron gets unfiltered input, second gets input minus direction of the first weight vector and so on. --> each step requires training of the complete chain, while first neurons are not properly trained, later neurons receive input where PC's with larger eigenvalues are not filtered out correctly. Good for parallel training.
What is a perceptron able to do? What is it not able to do?
A perceptron represents a d-1 dimensional decision surface, a hyperplane othorgonal to ~w. Necessary conditon: only solvable with a perceptron if data is linearly seperable. This is the case for many basic logical operations, except for XOR (XOR can be solved by distortion of feature space or adding of further input channels like x3 = x1 * x2)
What is the bias in cluster algorithms?
All clustering algorithms have a bias: - the bias prefers a certain cluster model that compromises scale & shape of the clusters - optimally the bias can be chosen explicitly, but usual it is partly built in in the algorithm - the adjustable parameters are usually processing parameters, not model parameters - the connection between the parameters and the cluster model has to be inferred from the way the algorithm is working - hierarchical clustering solves the problem for the scale parameter by having all different scale solutions present in an ordered way
Given the general boundary G={Strong,?,?} and the specific boundary S={Strong, sunny, warm}v{Strong, cloudy, cool}, which we learned using the Candidate Elimination algorithm. Provide the complete version space, including more general than relations. Provide a definition of your choice of displaying more general than relations
Answer right??? VS = {<Strong,?,?>,<Strong, sunny, ?>, <Strong,cloudy, ?>,<Strong, ?,warm>,<Strong, ?,cool>,<Strong, sunny, warm>, <Strong, cloudy cool>} The most general boundary in this example is <Strong,?,?>, having less '?'s makes the boundary more specific.
What is the candidate-Elimination algorithm?
Candidate Elimination is a learning algorithm that, in each step, tries to generate a description which is consistent with all previously observed examples in a training set. That description could hypothetically then be used to classify examples outside the training set. idea: compute whole VS. Like list-then-eliminate start with complete VS but do not name them explicitly. Instead represent VS by its boundaries. Start with most general G0 <?; ?; ?; ? > and most specific S0 <∅,∅,∅,∅ >. Those delimit the whole VS. Now for each training example specialize G and generalize S until they overlap.
What is the motivation behind clustering?
Clusters are basic structures. Additionally, clusters in some feature spaces may indicate closeness of the data on a semantic level. Clustered data implies rules. Compression can be achieved by transmitting only cluster centers. However, it is not always trivial to define clusters, this depends on the scale and the shape one wants to achieve.
What are the curse of dimensionality and its implication for pattern classification? Explain how this phenomenon could be used to one's advantage
Curse of dimensionality describes the phenomenon that in high dimensional vector spaces, two randomly drawn vectors will almost always be close to orthogonal to each other. This is a real problem in data mining problems, where for a higher number of features, the number of possible combinations and therefore the volume of the resulting feature space exponentionally increases. In such a high dimensional space, data vectors from real data sets lie far away from each other (which means dense sampling becomes impossible, as there aren't enough samples close to each other). This also leads to the problem that pairs of data vectors have a high probability of having similar distances and to be close to orthogonal to each other. The result is that clustering becomes really difficult, as the vectors more or less stand on their own and distance measures cannot be applied easily. This is actually an advantage if you want to discriminate between a high number of individuals (see Bertillonage, where using only 11 features results in a feature space big enough to discriminate humans), but if you want to get usable information out of data, such a 'singling out' of samples is a great disadvantage.
What is a scatterplot matrix?
Matrix of 2D plots of all possible combinations of available dimensions. If displaying all axes is infeasible, PCA may yield suitable directions.
What are properties of EM-algorithm?
EM yields only local optima computationally much more expensive than K-means precautions against collapse of Gaussian to a single point necessary K-means can provide useful initialization for ~muk, local PCA for Ck There are K! equivalent solutions → parameter identification might be difficult
What is the p-norm?
General norm E.g. p=2 --> Euclidean distance p=1 --> Manhattan distance p--> inf maximum distance
What is the inductive bias in ID3?
H is power set of instances of X → unbiased search? NO, short trees are preferred (corresponds to Occams razor), high info gain attributes near root are preferred. Bias is preference for some hypotheses rather than a restriction of hypothesis space.
What is called "the anti Hebb rule"? Explain!
Habituation ~w has a parallel and an orthogonal component to ~x. For continous training with the same ~x the parallel component becomes 0, while the orthogonal component does not change. The anti-Hebb rule leads to habituation: ~w becomes orthogonal to the repeated stimulus --> ~w filters out new stimuli. For several habituated stimuli, only the ~w component orthogonal to the space spanned by those stimuli can pass the filter.
What ais the idea of the Manhattan distance? (aka city block distance)
Hamming distance (# of positions where two binary strings dier) is equal to Manhattan distance for binary
Why did Chernoff use faces for his representation? Why not something else, like dogs or houses?
Humans are exceptionally good at face recognition. It is very easy to realize if one eye is bigger than another or eye brows are closer together in face-like images for humans than for example figuring out differences in windows sizes or changes in roof skewness between houses.
Explain Hypothesis space search by ID3
ID3 searches the tree that builds up possible decision trees. Since this space is complete, the target hypothesis is surely in there. Output is a single hypothesis (not the version space) → no queries to resolve among competing hypotheses. No backtracking implemented → ends up in local maxima. Statistically-based search choices → robust to noisy data. Disadvantage: all examples needed for every choice, no incremental learning.
Explain the idea behind minimum variance clustering
Idea: For merging, do not only take inter-cluster distance into account but also consider size of the clusters (preferring small ones)
Explain the idea behing Ward's minimum variance clustering. What are properties?
Idea: merge the two clusters for which the increase in total variance is minimized This approach is optimization based. However, it can be implemented by a distance measure Properties: prefers spherical clusters and clusters of similar size. Robust against noise but not against outliers.
Describe in your own words: How does the EM-algorithm deal with the missing value problem?
In the EM-Algorithm all known values are considered via their likelihood depending on the distribution. In the same way hidden (i.e. missing) values are considered as depending on the probability distribution and additionally on the known values. So the complete distribution can be seen as the product of two probability distributions (known and missing values). The algorithm searches for the parameters that maximize the log-likelihood. As they depend on the missing values, those are averaged out. In an iterative procedure the estimated parameter is improved (M-step) followed by averaging over the missing values using the obtained parameter (E-step). This will lead the estimation of the parameter to converge to a local maximum which hopefully is close to the real parameter value. The principle in handling missing values here is to not try to regain them somehow, but to invent values from a model obtained through the probability distribution. In the best case this does not lead to information loss, although it generally does. However, this at least makes the existing values technically usable.
Provide a pseudocode algorithm for agglomerative hierarchical complete linkage clustering
Initialization: assign each of n data elements to a cluster Ci i = 1...n while n > 2 do find the pair of clusters Ci , Cj , i < j that optimizes the linkage criterion merge: Ci ← CiU if j < n then Cj ← Cn end if n - - end while Optimizing the linkage criterion requires a distance measure. This is were the algorithm can be modified.
Explain in your own words the concepts of descriptive and intrinsic dimensionality
Intrinsic dimensionality exists in contrast to the descriptive dimensionality of data, which is defined by the numbers of parameters used to produce or represent the raw data (i.e. the number of pixels in an unprocessed image). Additionally to this representive dimensionality, there is also a (most of the time smaller) number of independent parameters which is necessary to describe the data, always in regard to a specific problem we want to use the data on. For example: a data set might consist of a number of portraits, all with size 1920x1080 pixels, which constitutes their descriptive dimensionality. To do some facial recognition on these portraits however, we do not need the complete descriptive dimension space (which would be way too big anyway), but only a few independent parameters (which we can get by doing PCA and looking at the eigenfaces). This is possible because the data never fill out the entire high dimensional vector space but instead concentrate along a manifold of a much lower dimensionality.
What are Glyphs?
Maps each dimension onto a parameter of a geometrical gure. Glyphs are normally perceived as a whole ! extracting information of glyphs requires training! Examples: star glyphs, parallel coordinates, Chernov faces.
What are problems with minimization in MLPs? What are ways to avoid local minima?
Minimization here is still computationally expensive, suffers from local optima, is diffcult to terminate (good fit but no overfitting). Ways to avoid local minima: - repeat training with different initial weights to find different basins of attraction - annealing: add noise, e.g. every n learning steps: wji(k + 1, k) = wji(k + 1, k) + T Gji(k + 1, k) G is a Gaussian random variable with mu= 0, T is the temperatur Annealing improves minimization but requires longer learning. - Step size adaptation: increase in epsilon at regions and decrease in steep terrain - momentum: Delta wji(k + 1, k)(t))epsilon deltaj(k + 1)oi(k) + wji(k + 1, k)(t - 1) t is a step counter, so direction of step t-1 is kept to some degree. This avoids abrupt changes of direction and thereby counteracts stoping at minor minima and oscillations. Weight decay: overly large weights are problematic, they lead to binary decisions of single neurons. Therefore an additional quadratic regularization term can be added to the error function to avoid large weights. This leads to a decay part in the learning rule.
How does Backpropagation in MLPs work?
Multilayer perceptrons (MLPs) can be regarded as a simple concatenation (and parallelization) of several perceptrons, each having a specified activation function σ and a set of weights wij. The idea that this can be done was discovered early after the invention of the perceptron, but people didn't really use it in practice because nobody really knew how to figure out the appropriate wij. The solution to this problem was the discovery of the backpropagation algorithm which consists of two steps: first propagating the input forward through the layers of the MLP and storing the intermediate results and then propagating the error backwards and adjusting the weights of the units accordingly. An updating rule for the output layer can be derived straightforward. The rules for the intermediate layers can be derived very similarly and only require a slight shift in perspective - the mathematics for that are however not in the standard toolkit so we are going to omit the calculations and refer you to the lecture slides. We take the least-squares approach to derive the updating rule, i.e. we want to minimize the Loss function L=1/2*(y−t)^2 where t is the given (true) label from the dataset and y is the (single) output produced by the MLP. To find the weights that minimize this expression we want to take the derivative of L w.r.t. wi where we are now going to assume that the wi are the ones directly before the output layer ∂L/∂wi=(y−t)oi*y(1−y)
What is the problem with Outlieres?
Outliers can drastically spoil statistics, especially in small data sets. There are some measures, that are robust against outliers even without explicit detection (e.g. median).
What is the difference between PCA and principal curves?
PCA: mean square error function for a data set given by a probability density P(~x): E=1/2*Integral(~x- Sum(a1(~x)*~pt)^2*P(~x)d~x)) Approximate the manifold by m vectors ~pi, find coeffcients ai(~x) for each ~x Principal Curves: Generalize approximation to a parameterized surface ~X (a1,..., a2| ~w) of m dimensions. The vector ~w element of R^n --> parameters that determine the shape of the surface minimizing E =1/2*Integral((~x-~X(a1(~x))...am(~(x)j ~w))^2*P(~x)d~x) The m parameters ai(~x) determine the point on the surface of ~x that best matches ~x and have to be computed for each ~x individually (in a 2D example dataset, there is only one parameter, since m = 1 and ~X is a curve.) The number n of parameters in ~w are responsible for the ability of ~X to t a manifold (small n underfit, large n overfit). For a good fit, additional smoothness constraints should be used. ~w can be fitted using e.g. gradient descent (normally stochastic approximation is used (downhill step over one single sample), since otherwise each step would require integration over all data).
Provide the most specific hypothesis which is learned according to the Find-S algorithm when using the examples below. Provide the individual learning steps, i.e., provide the most specific hypothesis after each learning step. Example a1 a2 a3 Classification 1 s o p true 2 s r t true 3 t r p false 4 t o p false
S0={<∅,∅,∅>} S1={< s, o, p>} S2={< s, ?, p>} S3={< s, ?, ?>} → is this right???
Describe the training of a perceptron and describe two modes of learning
Perceptron is iteratively trained using labled data D = (~x, ~y), ~x^n element R^d Perceptron training rule: Delta ~w = epsilon (t - y)~x Convergence can be shown if learning rate epsilon is sufficiently small (and task is solvable by a perceptron). The learning rule can be derived from minimizing the mean squared error. Two models of learning: -batch mode: gradient descent with respect to entire training set: Delta~w =epsilon Sum((ti-y(~xi))~xi) - incremental mode: gradient descent with respect to one example (~xi, ti) Delta~w = (ti - y(~xi))~xi For small : incremental mode can be shown to approximate batch mode arbitrarily close.
What does it mean if some features( x1 and x2) do not appear in the decision trees?
Sepal length and sepal width are not relevant for the classification. This might be either because they are redundant or because they are independent of the class.
What is the difference between single- and complete-linkage clustering?
Single-linkage tends to chain clusters along the data. That is why it combines the points in the center area with those in the bottom right corner. Complete-linkage prefers compact clusters and thus combines each of the point heavy areas individually without merging them.
What is the idea behind nominal scales? What is a possible problem and solution?
So far, all dist. measures relied on the topology of R^n. But there are data with other topologies (angular attributes...). Solution: embed different topologies into R^n nominal scales: map nominal attributes to real values. (stone;wood; metal) → ((1, 0, 0), (0, 1, 0), (0, 0, 1)) problem: for large n of attribute values, dimensionality becomes too high, a solution would be to choose normalized random vectors instead.
What is the idea behing soft clustering?
So far: clusters as sets of data points belonging to their respective centers. → disjoint clusters / hard clustering (each point assigned to ONE cluster) Drawback → no way to express uncertainty about an assignment. Soft clustering idea: describe data by a probability distribution P(~x). A data point is assigned to a cluster by probabilities (expressing uncertainty). Clusters have no boundaries and are represented by Gaussians.
What is the Rosner test? Describe its purpose and provide its formal definition?
The Rosner test is an iterative procedure to remove outliers of a data set via a z-test while new outliers are found do calculate mean mu or median m and SD sigma find data point xit* with largest z-value: i* = argmaxtzt if xt* is an outlier then remove xt* end if end while
Why do we need data visualization techniques and what are techniques to visualize high dimensional data?
Sometimes it is necessary to visualize high dimensional data and a projection via PCA or similar methods might not help enough: We might lose information in a 2D projection. In those cases it is useful to come up with other representations of data which we could potentially print on a sheet of paper. Techniques are usually glyphs, but different kinds of projection might already be enough (taking information loss into account).
How can we avoid overfitting?
Stop growing when data split is no more statistically significant OR grow tree & post-prune.
What is the effect of Hebb's rule on a pair of neurons?
The connection between two neurons, represented by the weight of the postsynaptic neuron, will increase during training until an equilibrium with the decay term is reached. Then, the weight is proportional to the correlation of the activity of the neurons.
What are the features of PCA?
The eigenvectors are called principal components, which can be interpreted as features, receptive fields or filter kernels. Expansion after the m < d largest eigenvectors is the optimal linear method to minimize the mean square reconstruction error
Why an inductive bias?
The hypothesis space limits what the algorithm can find. E.g. by choosing conjunctive hypothesis we have biased the learner. However, when representing all concepts (disjunction of all positive examples), we would learn nothing. We would only write the data differently. There would be no generalization possible and convergence would only be achieved when all possible instances would have been presented. How is generalization achieved? The 'inductive leap' came about via the independence of the attribute underlying the construction of H. - bias-free learning systems make no a-priori assumptions ! it can only collect examples without generalization - inductive learning makes sense only with prior assumptions (bias)! - learning is simply: collect examples, interpolate among the examples according to the inductive bias, actively acquire examples following the suggestions from the version space - for every applied learning system the inductive bias should be clear!
What is the idea of decision trees?
The idea of a decision tree is to classify data by a sequence of interpretable steps. Each node tests a value and each branch stands for one of the values of this attributes. Each endnode than provides a classification.
What is an inductive bias?
The inductive bias of a machine learning algorithm is the set of assumptions that must be added to the observed data to get a logical deduction from them. That means that it is some preference of the algorithm for a specific set of hypotheses based on a set of training observations.
Which of the learning algorithms you heard about in the lecture (Candidate Elimination and Find-S) has the stronger bias?
The inductive bias of the Candidate Elimination algorithm is the assumption that the target concept is contained in the given hypothesis space. The inductive bias of the Find-S Algorithm is that the resulting hypothesis labels all new instances as negative instances unless the opposite is entailed by its prior knowledge from the training set. This has a really big impact as negative examples are ignored completely. This means Find-S has the stronger bias.
What is the version space?
The version space VS(H;D) with respect to the hypothesis space H and training examples D is the subset of hypotheses from H consistent with all training examples in D (i.e. version space = all consistent hypothesis). VS(H;D) = {h element H | Consistent(h;D)}
How many principal components are there at most when you apply the PCA with the 20 training face images provided? How many principal components were there for the 16 binary images if we made a PCA on all of them?
There are at most 20 principal components possible for the face images, but only four principal components for the binary images.
Tree 2 only has a 96% accuracy on the training set. Why might this tree still be preferable over tree 1?
Tree 1 is probably overfitted to this specific dataset, i.e. it has not only captured the structure but also the noise in the data. It probably won't generalize as well as the second tree. Another advantage of tree 2 is that it is faster at classifying new data since less computations have to be made. This difference is hardly noticeable however.
Name and explain the three types of learning in artificial neural networks
Unsupervised Learning: no teacher, unlabeled examples. Effect of learning is coded in the learning rules. Supervised Learning: Teacher/labeled examples. Learning is directed at mapping the input part of the examples to the label as an output. Reinforcement Learning: Weak teacher: agent tries to reach goal and only gets feedback if goal has been reached or not. Agent has to find way (create mapping) itself. More realistic learning becomes popular: weakly supervised and semi-supervised learning: pre structuring of data with unsupervised learning, use of rare labeled data after unsupervised learning.
training MLP + Backpropagation pseudocode
Use same error function as before & minimize by gradient descent. Problem: all derivatives on the path from wik(m,n) to output layer required. The backpropagation algorithm provides a scheme for ecient computation of all required derivatives and weight adaptations. initialize ~w randomly while stop criterion is met do propagate ~x forward trough net to obtain ~y compute error ti - yi(~x) between actual output and desired output for each output dimension i Propagate errors backwards trough net to find 'responsible' weights update weights end while weight of neuron j is adapted in proportion to activation of neuron in previous layer to which it is connected % weighted errors it causes at the output. This complex scheme is necessary since target values are only available for the outputs, not the hidden layer neurons (so a 'target value' has to be constructed). BP-algorithm performs a stochastic apprximation of gradient descent for error function E.
What do we do with them (in general)? Z-value and Rosner Test
Usually outliers are detected & removed. But to do this, we first need to define what is regular. Most often we use normal distribution here, for multivariate data, clustering algorithms with normal distribution assumed for each cluster can be used. First, we need to detect probable outliers. In order to decide which data points we want to declare as an outlier we have to find a model for regular, meaning "not outlying", data points. What we do most of the time is to assume a normal distribution underlying the data (or a multivariate distribution where each cluster is normally distributed). One option is to calculate the z-value for each data point (a measure of the distance from the mean in terms of the standard deviation) -- data points with a high z-value would be regarded outliers, a common threshold would be a z value bigger than 3. This can be improved by using the median instead of the mean and tweaking the threshold. The Rosner test takes it one step further by iteratively calculating z-values and removing found outliers until none can be found anymore. This can be done one outlier at a time or k outliers at a time for more efficiency. A different approach would be to not remove the outliers completely, but to weight them according to the z-values. And lastly an alternative for complete removal would be to fill up the emerging gaps with values that fit the distribution better.
What is the inductive bias of the nearest neighbor classifier?
assume that most of the cases in a small neighborhood in feature space belong to the same class. Given a case for which the class is unknown, guess that it belongs to the same class as the majority in its immediate neighborhood. This is the bias used in the k-nearest neighbors algorithm. The assumption is that cases that are near each other tend to belong to the same class.
What are the two boundaries of the version space?
boundaries of the version space: - general boundary G of VS(H;D) is the set of maximally general members - specific boundary S of VS(H;D) is the set of its maximally specific members every member of the VS lies between (including) these boundaries
What is rule post pruning?
build decision tree (allow overfitting), convert tree to rules (one rule for each path from root to leaf), prune each rule (remove any preconditions that improve accuracy n test data), sort final data by accuracy on validation set and apply them in this order for new classification. why pruning rules? only paths are prunde, not entire substrees → more sensitive pruning / no hierarchy while pruning, even pruning of root node possible / better readability
What is the curse of dimensionality?
combinatorical explosion: n^d (volume of resulting space) combinations for d variables with n values each, which makes n^d points necessary to sample the whole space !--> sampling for large d becomes impossible. Any real data set will not be able to will a high dimensional space
What is meant by "Compressing by Clustering"?
dea: given we have a data set consisting of d-dimensional vectors that shall be transmitted over some channel where bits per data point depend on dimension d. We can now cluster our data, transmit the cluster centers once and then only transmit for each data point the cluster that represents it best. A small number of clusters has high compression but bad quality, a high number of clusters vice versa.
How does the z-test work?
detect outliers vie z-values: zt =|xt -mu|/ sigma zt is a measure for the distance of xt from the mean mu's normalized with the standard deviation sigma. Normally, data with Z > 3 are considered outliers. Improvement: outliers use mu, use median instead (commonly with threshold 3,5 then)
Explain what entropy and information gain is used for
he decision which attribute to choose is based on entropy: In many machine learning applications it is crucial to determine which criterions are necessary for a good classification. Decision trees have those criterions close to the root, imposing an order from significant to less significant criterions. One way to select the most important criterion is to compare its information gain or its entropy to others.
When is hypothesis h consistent with training examples D of target concept c?
hypotheses h is consistent with training examples D of target concept c iff h(x) = c(x) (correctly classified) for all training examples x; c(x)) element D Consistent(h;D) <==> all (x; c(x)) element D : h(x) = c(x)
What is the idea of ID3 learning algorithm?
idea: find best attribute by the distribution over the examples. Put the best one at the root and the possible values as branches. Search second best value for each of the branches...
What is the idea of k-means clustering?
idea: Divide dataset D into clusters C1...CK which are represented by their K centers of gravity (aka means) ~w1... ~wK and then minimize the quadratic error Iterative K-means: Start with randomly chosen reference vectors, assign data to best match reference vector, update reference vectors by shifting them to the mean of their cluster, stop if cluster centers haven't moved more element
What is the idea behind Conceptual clustering?
idea: employs idea of clustering and decision tree learning for unsupervised classifcation. Most known algorithm: COBWEB. Motivated by the drawbacks of ID3 and inspired by cognitive classifcation: category formation is strongly connected to forming prototypical concepts (basic level category and generalizations and specializations of it). Ideas of COBWEB: - unsupervised learning - incremental learning - probabilistic representation: gradual assignment of objects to categories - no a priori fixed number of categories - realized by global utility function which determines: # of categories / # of hierarchical levels / assignment of objects to categories
What is the idea behind multidimensional scaling?
idea: find a lower dimensional manifold such that projection preserves the structure of the data as good as possible. We must define structure. Important: distances between data points. Given R^D and lower dimensional space R^d find a mapping : R^D--> R^d such that the distances between data points in R^D are well approximated by the distances of the projections of those data points in R^d.
What is the idea behind Find-S algorithm? What are Problems and what are advantages?
idea: finding maximally specific hypotheses Problems: - learns nothing from negative examples, - cannot tell whether it learned a concept, - cannot tell whether training data is inconsistent Good: picks maximally specific h (but depending of H there might be several solutions).
What is the idea of the Chebyshev distance? (aka Maximum/chessboard distance)
idea: minimum number of moves a king needs between two positions on a chessboard
What is the idea behind a projection pursuit? What is interesting? What is the procedure? Name four criteria to measure teh deviation of a distribution from the standardized normal distribution. What are Problems?
idea: project onto 2-3 selected directions like PCA, but choose directions that exhibit interesting structure. what is interesting? variance / non-Gaussian distribution (structure) / clusters procedure: 1. select 1 to 3 directions (by PCA or simply original dimensions) 2. project onto these directions & get density P(~x) of projected data 3. compute index how 'interesting' (according to some criterion) the data is 4. maximize index by search for better directions criteria: - Firedman-Turkey-Index minimized if P is a parabolic function similar to norm. distribution - Hermite/ Hall Index minimized when P is a standarized normal distribution - Natural Hermite Index like Hermite Index only higher weighting of the center - Entropy Index minimized by standardized normal distribution Problems: maximization requires estimation of density P(~x) of the projected data. Methods are Kernel density estimation (Parzen windows) / orthonormal function expansion
What is the idea and what are pro's and con's of the Mahalanobis distance?
idea: scale distances using the covariance matrix C Pro: scale & translation invariant / if C is unit matrix → Euclidian distance Con: scaling might destroy structure within data The points of equal Mahalanobis distance to a center form an ellipsoid
What is the idea behind Multilayer Perceptons?
idea: solve nonlinear seperation tasks with nonlinear separatrices & classes covering disjoint parts by combining several perceptrons and also generalize to several outputs
What is the idea, problems and advantages of the List-then-eliminate algorithm?
idea: start with all hypothesis, then for each example eliminate the inconsistent hypotheses for a large VS one needs many or few quite informative examples, if VS is ∅ there are inconsistencies Good: computes complete VS (ideally only one hypothesis remains) Problems: can only be applied to finite H, requires enumerating all hypothesis → impractical for real problems
What is the problem with attributes with many values in classification? What is a solution for this problem?
if an attribute has many values, Gain tends to select it (even if nonsense, because it enables perfect classification). However, this prevents good generalization. One solution: GainRatio instead of Gain: St is subset of S for which A has value vi and A has n different values in total. SplitInformation is the entropy with respect to the attribute values. 'normalizing the Gain'. GainRatio favors attribute with fewer values, if two attributes yield same gain. GainRatio not defined for attributes with same value for all examples (zero denominator), but they are useless anyway and have to be excluded.
Explain Hebb's rule?
increase weight wi according to the product of the activation of the neuron and the input via this channel i So far, weights can become arbitrarily large. To prevent this, there are several solutions: Decay Term (including constant forgetting): Delta~w = epsilon *y(~x~w)~x - gamma~w Dynamic normalization to |~w| = 1: Delta~w = epsilon *y(~x~w)~x - gamma~w*( ~w^2 - 1) Explicit normalization to |~w| = 1: ~w^new = (~w^old+Delta~w)/(| ~w^old+Delta~w|) Oja's rule, uses weight decay y^2: Delta~w = epsilon * y(~x~w)(~x - ~y(~x~w) ~w)
What is the problem domain of decision trees?
learning a discrete target function for instances describable as attribute-value pairs. Disjunctive hypothesis required. Possible for noisy data. Decision trees can be written as a set of rules (e.g. as disjunction of conjunction of attribute values: each path is one conjunction, combine all paths with disjunction).
Describe possible disadvantages of the Find-S- algorithm.
learns nothing from negative examples, cannot tell whether it learned a concept, cannot tell whether training data is inconsistent
For data structures is normal PCA appropriate?
linear data structures Problems occur if data structure is non-linear. Better solution here is local PCA
PCA: How can we find a suitable number m of eigenvectors?
look for aprupt jumps in the spectrum of eigenvalues indicating a suitable cut off. PCA does not generate structure - it only makes existing structure accessible.
What are possible distance measures of clusters?
minimum distance: Dmin(X, Y ) maximum distance: Dmax(X, Y ) mean of all distances: Dmean(X, Y ) distance of centers: Dmean(X, Y ) D = pdist2(X,Y) returns a matrix D containing the Euclidean distances between each pair of observations in the mx-by-n data matrix X and my-by-n data matrix Y. Rows of X and Y correspond to observations, columns correspond to variables. D is an mx-by-my matrix, with the (i,j) entry equal to distance between observation i in X and observation j in Y. The (i,j) entry will be NaN if observation i in X or observation j in Y contain NaNs.
What are principal curves?
principal curves: can handle nonlinear distributions via nonlinear basis functions. Data is projected on the principal curve. However, an appropriate dimensionality and exibility parameter have to be found.
What are pro's and con's of the Eukledian distance?
pro: simple and frequent measure con: no individual weighting of components
What are pro's and con's of the Pearson distance?
pro: weights dimensions according to their standard deviation con: correlated vector components are overweighted
What are properties of high dimensional spaces?
random pairs of vectors are likely to have similar distances and be close to perpendicular. Also the volume of the surface layer dramatically increases compared to the volume of the inner part (overall probability that one dimension has an extreme value increases).
What is reduced error pruning?
remove nodes to achieve better generalization on test data. Pruning node n: remove subtree of n and make n a leaf node, assigning most common classification of its affiliated training examples. Check all nodes for pruning, actually remove the one that results in highest performance increase (greedy). Do while performance on test data increases. properties: produces smallest version of most accurate tree / removes nodes produced by noise (noise per definition not present in test data). If few data, use post pruning instead.
how can we obtain binary attributes in continuous valued attributes?
to include continuous valued attributes, define thresholds to obtain binary attribute. Thresholds should be chosen to gain info: sort of attribute sets over examples, find boundaries where classification changes, set thresholds at boundaries.
Kohonen Net
too complicated --> Lernen auf Lücke
Provide Pseudocode for ID3 learning algotithm
while not perfect classification do A ← decision attribute with highest Gain node ← A as decision attribute for all value of A → new descendant of node sort training examples according to leaf nodes iterate over new nodes end while
What is the effect of Hebb's rule on the weight vector?
~W converges to the eigenvector of C with the largest eigenvalue. Hebbain learning finds the largest principal component similar to PCA Compared to a winner takes it all rule, the Hebb-trained neurons are affected by all stimuli: close input has less effect than far input. This is why ~w is adjusted according to the data variance.
Give the sigmoid function and its first derivative. Why is it used for Multi-Layer Perceptrons?
σ(t)=1/(1+e^(−t)) ∂σ/∂t=σ(t)(1−σ(t)) The sigmoid function is commonly used because of its nice analytical properties: Its domain is [0,1], it is non-linear, strictly monotonous, continuous, differentiable and the derivative can be expressed in terms of the original function at the given point. This allows us to avoid redundant calculations. The sigmoid function is a special case of the more general Logistic function which can be found in many different fields: Biology, chemistry, economics, demography and recently most prominently: artificial neural networks.