Data Analysis (CS4100)

¡Supera tus tareas y exámenes ahora con Quizwiz!

Bayesian Network

To obtain the Bayesian network, we need to specify probabilities. In the example (story 2 in slides). P(E) = 0.01 P(H) = 0.1 P(R | E) = 0.4 P(R | ¬E) = 0 P(S | E, H) = 0.9 P(S | ¬E, H) = 0.9 P(S | E, ¬H) = 0.8 P(S | ¬E, ¬H) = 0.1 P(D | S) = 0.7 P(D | ¬S) = 0 P(W | S) = 0.7 P(W | ¬S) = 0.2 In general: 1. Choose some variables of interest. 2. For each variable X decide on the possible values x that X can take. For example, the possible values can be true (can be written as 1) or false (can be written as 0); we call such binary variables event nodes. Let V(X) be the set of possible values of X (always assumed finite). 3. Choose an appropriate causal diagram, which is a directed acyclic graph whose nodes are the chosen variables. 4. For every initial node X and every possible value x of X, specify the probability P(X = x). (For an event variable E, P(E) is a shorthand for P(E = true) and P(¬E) is a shorthand for P(E = false).) 5. For every other node X specify the conditional probabilities given the parents' values. Pros: - We get probabilistic predictions (rather than categorical, as in classification). - The distributed nature of computations. Cons: - Strong assumptions: we need to know the dependencies between the variables, and even the true data-generating mechanism is assumed to be known.

Dendrogram

Tree-structure graph used to visualise the result of a hierarchical clustering calculation. The graph contains a fuse. The y-axis coordinate of the line corresponds to the (dis)similarity of merged clusters. An alternative representation is based on sets {{x_1, {x_2, x_3}}, {{{x_4, x_5}, {x_6, x_7}}, x_8}} However, unlike the dendrogram, sets cannot express quantitative information

Inversion

Two clusters are fused at a height below either of the individual clusters in the dendrogram. It leads to difficulties in visualisation as well as in the interpretation of the dendrogram

Dirty data

Types: Incomplete (missing), noisy/errors, inconsistent, intentional (disguised missing data; e.g. 01/01 as everyone's birthday). Why? • Incomplete data may come from: - "Not applicable" data value when collected - Different conventions between the time when the data was collected and when it is analysed - Human/hardware/software problems • Noisy data (incorrect values) may come from: - Faulty data collection instruments - Human or computer error at data entry - Errors in data transmission • Inconsistent data may come from: - Different data sources - Functional dependency violation (e.g., modify some linked data) • Duplicate records also need data cleaning

θ

Step function. θ(t) = t > 0 ? 1 : 0;

Semi-Supervised Learning

Supervised Learning where, out of n observations, m observations (m < n) have labels and the rest don't have. Algorithms: - Seeded K-means • Labelled data (the seed points) provided by users are used for initialisation: initial centre for cluster j as the mean of the seed points having label j. • The seed points are only used for initialisation, and not in subsequent steps - Constrained K-means: • Labelled data are used to initialise the K-means algorithm • Cluster labels of seed data are kept unchanged in the cluster assignment steps and only the labels of the non-seed data are re-estimated.

Centroid Linkage

The dissimilarity between the centroid for cluster A (the mean vector) and the centroid for cluster B. This linkage can result in undesirable inversions.

Regression ANN

The learning machine works as follows: Z_m := σ(α_m0 + α_m*X), m = 1, ..., M Y := β_0 + βZ

Data matrix

Unlike training sets with n observations x_1, ..., x_n. This matrix has x^T_1, ..., x^T_n (x with superscript T and subscript n) as its rows. Its size is n × p, where p is the number of attributes. Each attribute is represented as a column of the data matrix.

Ensemble methods

Use multiple algorithms to obtain better predictive performance than could be obtained from any of the algorithms by itself. Increasing the complexity/flexibility of the model will lead to a reduction in bias. They can help in achieving a low variance and low bias which can't be achieved with a single algorithm. Types: - Bagging - Boosting

Text classification in IR

Uses: - during preprocessing (such as: detecting a document's encoding, true casing, identifying the language of a document) - auto-detection of spam pages - auto-detection of sexually explicit content - sentiment detection - personal email sorting Since each document d is represented as a vector →v(d), we can use almost any prediction algorithms covered so far; KNN is a popular choice for that (with K=3 or 5 being the common choices).

2Q/median

Value to the left of which lies 50% of the distribution

3Q

Value to the left of which lies 75% (or as close to 75% as possible in the discrete case) of the distribution

Variance

Var(X) = 𝔼((X − 𝔼X)²) Properties - Var(cX) = c² * Var(X) always - Var(X + Y) = Var(X) + Var(Y) if X and Y are independent The amount by which fˆ would change if we estimated it using a different training set (where small changes in the training set leads to large changes in fˆ)

Nyquist sampling theorem

The minimum sampling rate is twice the maximum component frequency of the function being sampled Fs=sampling freq. Fs ≥ 2(fmax-fmin)

Sum of Squares Errors

The most widely used clustering criterion, over the clusters w_j. - It measures how well the data set X = {x_1, x_2, ..., x_n} is represented by the cluster centres μ = {μ_1, ..., μ_K} (K ≤ n) - Clustering methods that use this criterion are called minimum variance.

Leave-One-Out Cross-Validation (LOOCV)

The number of folds (of size 1) is equal to the number of available observations. It gives approximately unbiased estimates of the ETE: n − 1 ≈ n (where n stands for the number of observations). It has a higher variance than K-fold CV with K < n. Standard explanation: When we perform LOOCV, we are averaging the outputs of n prediction rules trained on almost identical sets of observations. The outputs are highly positively correlated with each other, and their mean is highly variable. On the other hand, when K is small, there is much less correlation, and the mean tends to be much more stable (remember the law of large numbers).

Conditional probability

The probability of an event (A), given that another (B) has already occurred (i.e. the posterior probability of A after we learn that B has happened). ℙ(A | B) := ℙ(A ∩ B) / ℙ(B) It uses the Markov property (i.e. the future only depend on the present, the past is irrelevant)

Brier loss

Version of RSS to compute the training error of Multi-class ANN

Out-Of-Bag error estimation

We can estimate the expected test error of a bagged prediction rule without cross-validation or a validation set. It can be shown: on average, each bagged tree makes use of around 2/3rds of the observations. The remaining ⅓ is the out-of-bag (OOB) observations. Procedure: - We can predict the response for the ith observation using each of the trees in which that observation was OOB. We have around B/3 predictions for the ith observation. - To obtain a single prediction for the ith observation, we average these predicted responses (regression) or take a majority vote (classification). - An OOB prediction is obtained in this way for each of the n observations, from which the overall OOB MSE (regression) or number of errors (classification) is computed. - The resulting OOB error is a valid estimate of the ETE for the bagged prediction rule since the response for each observation is predicted using only the trees that were not fit using that observation. - The OOB approach is particularly convenient for large data sets (CV would be computationally onerous).

EM for MoGs

We start with an initial guess for the parameters w_j, μ_j, Σ_j. - Then we alternate: • Expectation step (E-step), in which we "complete" the data by estimating (the probabilities of) y_i. • Maximization step (M-step), in which we re-compute the parameters w_j, μ_j, and Σ_j. - In the hard EM, completing the data means that each data point is associated with exactly one Gaussian, taken to be the most likely assignment (in analogy with K-means clustering).

Underfitting

When a model is too simple (e.g too linear), both training and test errors are large (i.e. large training/test MSE). This leads to high bias and low variance.

Not missing at random (NMAR)

When the distribution of an observation having a missing value for an attribute depends on the missing value. Appropriate for when the weight above 100 kg is never reported (and we have no prior idea about the distribution of weight for humans)

Missing at random (MAR)

When the distribution of an observation having a missing value for an attribute depends on the observed data but does not depend on the missing data. Appropriate for when the probability of disclosing weight is higher for males than for females but does not depend on the weight (and the sex is known for all observations)

Missing completely at random (MCAR)

When the distribution of an observation having a missing value for an attribute does not depend on either the observed data or the missing data. Appropriate for clerical mistakes (made by the person typing the data into the computer from hand-written notes)

p-value

The probability of results of the experiment being attributed to chance. In general, we choose a test statistic T (a function of the outcome) and decide whether large or small values of the statistic are significant (by default, large values are significant). The interpretation of T: it measures the strangeness of the outcome. Let t_0 be the observed value of the test statistic. The p-value is P(T ≥ t_0) (if large values of T are significant) or P(T ≤ t_0) (if small values of T are significant). Let us choose a significance level ε, which is a small positive number (customary values: ε = 5% and 1%). Interpretation: we consider fixed events of probability ε as rare (or unlikely). The probability that the p-value is at most ε does not exceed ε: P(p ≤ ε) ≤ ε ∀ ε (if P(p ≤ ε) = ε, we say the p-value function is exact). ∴ p ≤ ε is a rare event (under the null hypothesis). So when the p-value is ε or less, we have a disjunction: either the null hypothesis is wrong or a rare event has happened. Terminology: - If the p-value is ε or less, we reject the null hypothesis. Otherwise, we retain (but never accept) it. - If we reject the null hypothesis at the significance level 5%, our result is statistically significant. • If we reject the null hypothesis at the significance level 1%, our result is highly statistically significant

Bandwidth selection

The problem of choosing h is crucial in density estimation - h is a scaling factor and controls how wide the probability mass is spread around a point. - A large h (e.g. 0.2) will over-smooth the DE and mask the structure of the data. - A small h (e.g. 0.01) will yield a DE that is spiky and hard to interpret. - We would like to find a value of h that minimizes the difference between the estimated density and the true density. - As usual, this can be analysed in terms of the bias/variance trade-off. The bias-variance dilemma applied to bandwidth selection simply means that: • A large bandwidth will reduce the differences among the estimates of p_KDE(x) for different data sets (the variance), but it will increase the bias of p_KDE(x) with respect to the true density p(x). • A small bandwidth will reduce the bias of p_KDE(x) at the expense of a larger variance in the estimates p_KDE(x).

Curse of dimensionality

The problems associated with multivariate data analysis as the dimensionality increases. In practice, that means: - for a given sample size, there is a maximum number of features above which the performance of our classifier will degrade rather than improve - in most cases, the information that is lost by discarding some features is (more than) compensated by a more accurate prediction rule in the lower-dimensional space Implications: - exponential growth with dimensionality in the number of observations required to accurately estimate a function How to beat it? - By incorporating prior knowledge - By using increasingly smooth prediction rules - By reducing the dimensionality

Clustering

The process of organising objects into groups whose members are similar in some way. - high intra-cluster similarity - low inter-cluster similarity Clusters are usually not "right" or "wrong" - different clusters can reveal different things about the data. Some clustering criteria/algorithms have probabilistic interpretations. Applications: - Biology & Bioinformatics: group genes into gene families. - Medical imaging: differentiate between different types of tissue and blood in a 3D image. - Marketing: perform market segmentation by identifying subgroups of people who might be more likely to purchase a particular product. - Social Network Analysis: recognise communities within large groups of people. - Web user profiling Types of clustering methods: - Parametric clustering • Equivalent to density estimation with a mixture of (Gaussian) components. • We can use Expectation-Maximisation: the identity of the component that originated each data point is treated as a missing feature. - Non-parametric clustering • No density functions are assumed or used. • Instead, we are concerned with finding natural groupings (clusters) in a dataset.

Sampling

The process of selecting representative units from a total population (like a bag/multiset of size n)

Probability of a configuration

The product of the probabilities of the values of the initial nodes and the conditional probabilities of the values of all other nodes given the values of their parents. It's important because it defines the probability space.

(Training) Error Rate

The proportion of mistakes made when fˆ is applied to (training) observations. Where ŷ_i = fˆ(x_i) is the predicted label for the ith observation using fˆ, and I(y_i ≠ ŷ_i) is an indicator variable that equals 1 if y_i ≠ ŷ_i and 0 if y_i = ŷ_i

Probability in Bayesian statistics

The strength of our confidence or belief that the event will happen. It's subjective when it's a conditional probability taking as given the information available to us at the time

Residual Sum of Squares

The sum of each residual squared for all the observations in the sample. This reflects the amount of variation in the dependent variable not explained by the regression equation. Where ŷ_Rj is the mean label for the training observations within the jth box. It can also be expressed: Σi=1:n(e_i)² = Σi=1:n(y_i-ŷ_i)²

Law of total probability

The sum of the probabilities of all individual outcomes must equal 1. if B_1, ..., B_m is a partition of the sample space (i.e., a system of pairwise disjoint events one of which is bound to happen), ℙ(A) = Σi=1:m(ℙ(A | B_i)ℙ(B_i)) = Σi=1:m(ℙ(A ∩ B_i))

Nonlinear PCA using kernels

The traditional PCA applies linear transformations (which may be ineffective for nonlinear data). In those cases, apply a nonlinear transformation to map objects to a potentially very high-dimensional space, φ : x → φ(x). I.e., apply the kernel trick, which requires PCA to be rewritten in terms of dot product, K(x_i, x_j) = φ(x_i) ⋅ φ(x_j)

Density estimation (DE)

The true distribution of a random variable is unknown; we model its pdf p = p(x) given a finite set (x_1, x_2, ..., x_n) of observations sampled from p(x). Approaches: - Parameter estimation: Assume a particular form for the density (e.g. Gaussian), so only the parameters (e.g., mean and variance) need to be estimated. In this course: Maximum Likelihood - Non-parametric density estimation: Assume no knowledge about the density. In this course: histogram, Parzen windows and kernel density estimation

Reinforcement Learning

learning association between stimuli and reward receipt

Equal-frequency binning

• An equal number of values are placed in each of the N bins. • Disadvantage: different occurrences of the same continuous value could be assigned to different bins.

Objectives of time series analysis

• Description—summary statistics, graphs. • Analysis and Interpretation—Find a model to describe the time dependence in the data. Can we interpret the model? (Seasonal adjustment.) • Forecasting or Prediction—given a sample from the series, forecast the next value, or the next few values (predict sales). • Control—adjust various control parameters to make the series fit closer to a target. (Impact of monetary policy on unemployment.) • Hypothesis testing. Example: Global warming. • Simulation. Example: Estimate probability of catastrophic events.

Equal-interval binning

• Divide the range (the values of a given attribute) into N intervals of equal size: uniform grid. • If A and B are the lowest and highest values of the attribute, the width of intervals will be W = (B − A) / N. • The most straightforward. • But outliers may distort the picture. • Skewed data is not handled well.

Data transformation

• Logarithm: reduces skewness and is often appropriate for positive variables • Smoothing: removes noise from data • Aggregation: summarization of data • Normalization: scaling to fall within a small, specified range - min-max normalization: v' = (v − min) / (max − min) * (new_max − new_min) + new_min - z-score normalization (aka standardization): for v with mean μ and standard deviation σ, v' = (v − μ) / σ - normalization by decimal scaling v' = v / 10^j where j is the smallest integer such that max(|v'|) < 1 • Attribute/feature construction - new attributes constructed from the given ones It should be done after removing outliers.

Entropy-based discretization

• The main idea is to split the attribute's values in a way that makes the bins as "pure" as possible. • We need a measure of "impurity of a bin" such that - a bin with the uniform class distribution has the highest impurity - a bin with all items belonging to the same class has zero impurity - the more skewed is the class distribution in the bin the smaller is the impurity • As we know, the entropy is such a measure of impurity: (c.f H formula) • If a set of objects S is partitioned into two intervals S1 and S2 using boundary T, the information value I after partitioning is I_T = |S1|/|S| * H(S1) + |S2|/|S| * H(S2) where |S| is the number of objects in bin S. • The boundary that minimizes the information value I over all possible boundaries is selected as binary discretization. • The process is recursively applied to partitions obtained until some stopping criterion is met.

Document vector

→V(d) (V with an arrow on top) Representation of a document d with 1 component corresponding to each term in the dictionary, together with a weight for each component that is given by tf-idf_(t,d).

Scree plot

a graph plotting each factor (X-axis) and its associated eigenvalue (Y-axis). It depicts the proportion of variance explained (PVE) by each of the PCs

Error decomposition

cf. image We assume that: x is a fixed object and the training set is random Expected Squared Error: 𝔼[(Y - fˆ(x))²] = 𝔼[(f(x) + ε - fˆ(x))²] = Bias(fˆ(x))² + Var(fˆ(x)) + Var(ε). The ESE is the average SE that we would obtain if we repeatedly estimated f using a large number of training sets, and tested each estimate fˆ at x. The decomposition tells us that in order to minimize the expected test error, we need to select a prediction algorithm that simultaneously achieves low variance and low bias

Uniform distribution

if Y is uniformly distributed on the segment [0, 1] then p_Y(x) = { 0 if x < 0 or x > 1 1 if 0 ≤ x ≤ 1 }

Child node

if there is an arrow going from a vertex A to another vertex B, we say that B is a child of A

Hard assignment problems

- Clusters may overlap - Some clusters may be wider than others

Induction

- The learning machine and its parameters represent the prediction rule that we found from the training data. - The prediction rule is typically much simpler than the training set. - The prediction rule produces predictions when applied to the test data

Non-parametric Clustering

1. Define a measure of (dis)similarity between observations 2. Define an objective function for clustering. 3. Define an algorithm to minimise the objective function

Missing values analysis

1. Identify patterns of, and reasons for, missing values. • such as skip pattern and/or sampling strategy in a survey 2. Understand the distribution of missing values 3. Decide on the best method of analysis • Deletion methods • Imputation methods

Exploratory Data Analysis (EDA)

A preliminary exploration of data to better understand its characteristics. Why do that? - Always look at your data! - If you can't see it, then don't believe it! • visualize distributions and relationships • detect errors • assess assumptions

On-line learning

An algorithm that learns while predicting on individual observations

Double-blind experiment

An experiment in which neither the experimenter nor the participants know which participants received which treatment

Configuration

Assignment of values to all variables of our Bayes net.

Sampling with replacement

At first the sample is empty, do the following m times: copy a random element from the bag and put the copy into the sample

Feature extraction

Creating new features by combining the existing features

Nominal data

Data which consists of names, labels, or categories (no distance and no order).

Bias

Error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model

Inference

Estimating f by understanding the way Y is affected by changes in X1, ..., Xp. Here, fˆ can't be seen as a black box to know: which attributes are associated with the label, the relationship between the label and each attribute and if the relationship can be summarised by a linear equation or not

P(p ≤ ε) = ε

Exact p-value

Experimental bias

Favouring certain outcomes over others

Fuse (=merge)

Horizontal line connecting two clusters in a dendogram

Accuracy

How close a measured value is to the actual or true value

Gini index

It takes on a small value if all of the p̂_mk are close to zero or one. In this sense it is a node impurity measure, like classification error rate, but it is smoother

Truecasing

It's caused by string normalisation through lower-casing.

Chebyshev distance

L_∞ norm

Law of large numbers

Let X_1, X_2, ... be i.i.d. (independent and identically distributed) random variables. - you know what "independent" means - identically distributed means that they have the same distribution function. Suppose 𝔼(X_1) exists (and is finite). Then, for large n, n ∑i=1:n(X_i ≈ 𝔼(X_1)) / n with high probability.

Y

Output variable / *label* / response / dependent variable

Initial node

Parent-less node

Regression problem

Problem with a quantitative label

Non-linear dependence on attributes

Suppose the learning machine we would like to use for predicting weight Y given height X is Y = β_0 + β_1 * X + β_2 * X² We perform multiple linear regression (with 2 attributes) extending each observation by adding height 2 as another attribute

Qualitative variable

Takes on values in one of K different classes/categories (e.g. gender, product brand) so *discrete values*. It's also known as a categorical variable

Deductive reasoning

The process of applying a general statement to specific facts or situations. It's fact-based (always true only if nothing can falsify its arguments). Example: A cricket ball is round. Our Earth is round. Thus, our Earth is a cricket ball

Dimension Reduction (DR)

The task is to transform data into one with fewer attributes/features. Reasons: - Computational: compressed data → time/space efficiency - Statistical: fewer dimensions → better generalisation and less noise Approaches: - Feature extraction - Feature selection

Quartiles

The values that divide the data into 4 equal parts, denoted as: • 1Q (first quartile, qnorm(0.25)): • Median (or second quartile, qnorm(0.5)) • 3Q (third quartile, qnorm(0.75))

Codification

Transforming nominal data into numbers (using dummy variables). E.g. X_2 := { 1 if sex is male 0 if sex is female } Or with non-binary labels: X_2 := ethnicity == 'Asian' ? 1 : 0; X_3 := ethnicity == 'European' ? 1 : 0; Note: having too many dummy variable would lead to dependent attributes!

Regularisation

Using α > 0 is an example of what is known as regularisation. Instead of minimizing the training error, as recommended by the ERM principle, we minimize the penalized training error (the training error plus a term penalizing for the complexity, or size, or irregularity, of the prediction rule). It increases bias and decreases variance. For ANNs, it uses cJ(λ) as the penalty term where: J(λ) := ∑m(β_m²) + ∑m,l(α_ml²) and c ≥ 0 is a tuning parameter (remember that λ is the complete set of weights β_m and α_m,l

1Q

Value to the left of which lies 25% (or as close to 25% as possible in the discrete case) of the distribution

Irreducible error

Var(ε): Variance of the error term ε (=𝔼(ε²)). Lowest achievable test MSE among all possible methods (which is always there).

Standard model

Y = f(X) + ε Where: - Y is a label - X = (X1, X2, ...Xp) is a set of p different attributes - f is a fixed unknown function of X - ε is a random error term which is independent of X and has a mean of 0 (in regression) or random noise (in classification). Given X_i, the labels are produced by Y_i = f(X_i) + ε_i where ε_1, ε_2, ... are independent random variable with mean 0. If X_i are random elements (Random variables which take values in ℝ).

ith residual

e_i := y_i − ŷ_i

Parent node

if there is an arrow going from a vertex A to another vertex B, we say that A is a parent of B

pdf

probability density function

Propagation for λ

λ(B = b) := P(D^−_B | B = b) λ_C(B = b) := P(D^−_C | B = b)

Propagation for π

π(B = b) := P(B = b | D^+_B) π_B(A = a) = P(A = a | D^+_B)

Standard deviation

σ_X = √(Var(X)) Properties: σ_cX = |c| * σ_X always

Causal diagram

DAG (Directed Acyclic Graph) that connects events If nodes aren't connected, they are conditionally independent (in the earthquake example, E and W aren't conditionally independent).

Time series

A time-ordered sequence of observations taken at regular intervals. It reveals temporal behaviour of the underlying mechanism that produced the data. E.g. stock exchange.

Inductive reasoning

A type of logic in which generalisations are based on a large number of specific observations (e.g. past observations). It's stat-based and is preferred in ML.

Observations

- Rows in a data frame. - Instances/records/entity of the data (i.e. data points). Noted by (x_i, y_i) (in supervised learning) where x_i = {x_i1, x_i2, ..., x_ip}

Tree

A connected acyclic graph

Finding the "nearest" pair of clusters

For two clusters ω_j and ω_k of sizes n_j and n_k: - Minimum distance (single linkage) - Maximum distance (complete linkage) - Average distance (average linkage) - Mean distance (centroid linkage) Complete and average linkages are the most popular types.

Precision

From all the ones that were recalled, how many were correct?

Odds

Ratio at the LHS of: p(X) / (1 - p(X)) = e^(β_0 + β_1 × X_1 + ... + β_p × X_p)

Data acquisition

Sampling of the real world to generate data that can be manipulated by a computer. In frequency sampling, the resulting frequency should be twice the original one.

cosine similarity

Geometrically, sim(d_1, d_2) is the cosine of the angle between →V(d_1) and →V(d_2). It's convenient to normalize document vectors: →v(d) := →V(d) / ||→V(d)|| Then sim(d_1, d_2) := →v(d_1) ⋅ →v(d_2) Given a query q, we can score each document d using this formula above: sim(q, d) := →v(q) ⋅ →v(d). This would help to produce a ranked list of the closest documents to the query q Examples on slide 16-19 (lecture 09_1.pdf)

F1 Score

Harmonic mean of precision and recall

Principal Component Analysis (PCA)

Informal goals: - reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables - retain most of the information in the data set How this is done: The new variables, called principal components (PCs), are uncorrelated (= orthogonal) and are ordered by their informativeness. Aims: finding a set of M PCs (composite variables) that: - it is much smaller than the original set of p variables - it accounts for nearly all of the total sample variance. If these two aims can be accomplished, then the M principal components contain almost as much information as the original p variables. The original data set is thereby reduced, from n measurements on p variables to n measurements on M variables. It often reveals relationships between variables that were not previously suspected. Because of such relationships, new interpretations of the data and variables often stem from PCA. It is usually an intermediate step in larger investigations or in other techniques, such as classification/regression or cluster analysis. In a nutshell: x is a vector of p random variables x ∈ ℝ^p, α_m is a unit vector of p constants, α_m ∈ ℝ^p with α^T_m * α_m = 1 Procedural description: - find a linear function of x, α^T_1 * x, with maximum variance - next find another linear function of x, α^T_2 * x, uncorrelated with α^T_1 * x and with maximum variance - iterate. We hope that most of the variation in x will be accounted for by M PCs, where M << p. Typically we do not know the true covariance matrix Σ for the random object x. We replace it with the sample covariance matrix (computed from the data), which we still denote by Σ. Solution: - For m = 1, 2, ..., p, the mth PC is given by z_m = α^T_m * x, where α_m is an eigenvector of Σ corresponding to its mth largest eigenvalue λ_m. - If α_m is chosen to have a unit length (i.e., c_m * α_m = 1), which we always do, then Var(z_m) = λ_m. We will assume that the data set {x_1, ..., x_n} is centred, i.e., satisfies x̄ := 1/n * ∑i=1:n(x_i) = 0 (the mean, or centroid, x̄ ∈ ℝ^p is computed component-wise). If this is not true, make it true by redefining; x_i := x_i - x̄ (centring). If the data is centred and z_i := αT * x_i, then z̄ = 1/n * ∑i=1:n[α^T*x_i] = α^T * 1 / n * ∑i=1:n(x_i) = 0 (z_i are also centred). For derivations: check slides 33-40 of 08_1.pdf The variance of the scores of each PC is equal to the corresponding eigenvalue for that PC. The eigenvalue λ_m represents the variance displayed ("explained" or "extracted") by the mth PC. The sum of the first M eigenvalues is the variance explained by the first M PCs. Algorithm: Let the data be (x_1, x_2, ..., x_n); each x_i is a p-dimensional vector. We wish to use PCA to reduce dimension to M. 1. Find the sample mean x̄ := [∑i=1:n(x_i)] / n 2. Subtract sample mean from the data x_i := x_i − x̄, i = 1, ..., n 3. Strongly recommended: further standardise the data by x_ij := x_ij / √(1/n * ∑k=1:n[x²_kj]) (making the standard deviation of each attribute equal to 1). 4. Compute the sample covariance matrix Σ := ∑i=1:n[x_i*x_i^T] (The diagonal will be (1, ..., 1) if the previous step carried out.) 5. Compute the eigenvectors α_1, α_2, ..., α_M corresponding to the M largest eigenvalues of Σ (e.g., using the function eigen in R). 6. Let A be p × M matrix whose columns are α_1, ..., α_M (in this order). 7. For x ∈ ℝ^p, the desired z ∈ ℝ^M (the closest approximation to x) is z = A^Tx. Remark: instead of the sample covariance matrix Σ, we can use the scatter matrix S := ∑i=1:n[x_i*x_i^T] = X^T*X It has the same eigenvectors as Σ (but its eigenvalues are n times larger). Cf. slides 46-53 for examples Notes: - Since PCA uses the eigenvectors of the covariance matrix Σ, it can find the independent axes of multivariate Gaussian data. • for non-Gaussian or mixed Gaussian data, PCA simply de-correlates the axes - The main limitation of PCA is that it does not consider class separability (since it ignores label, even if present). • PCA performs a coordinate rotation that aligns the transformed axes with the directions of maximum variance • there is no guarantee that the directions of maximum the variance will contain useful features for discrimination - It's also known Karhunen-Loeve transform (communication theory) How many PCs? - For p original dimensions, the correlation matrix Σ is p × p and has up to p eigenvectors. So p PCs. - We can ignore less significant components. You lose some information, but if the eigenvalues are small, you don't lose much. • p dimensions in original data • calculate p eigenvectors and eigenvalues • choose only the first M eigenvectors, based on their eigenvalues • the final data set has only M dimensions Determining the number of components to retain: - The eigenvalue-one criterion (retain the PCs with λ_m > 1; assumes the original attributes are standardised). - The scree test (informal): look at a graphical display of the variance of each component - The proportion of Variance Explained (PVE): choose M such that [∑m=1:M(λ_m)] / [∑m=1:p(λ_m)] > θ for a given threshold θ (typically 90%). - Interpretability criteria (require domain knowledge).

Transduction

Inference from past to future

X

Input variable(s) / *attribute(s)* / predictor(s) / feature(s) / independent variable(s)

Siblings node

Nodes that share a parent

Batch learning

An algorithm is trained on a training set; then tested and deployed

Correlation analysis

Analysis of the degree to which changes in one variable are associated with changes in another

ML goal

Applying a learning method to the training data to estimate the unknown function f. I.e., finding a function fˆ such that Y ≈ fˆ(X) ∀ (X, Y)

P(p ≤ ε) < ε

Conservative p-value

Mixture model

Consider the now familiar problem of modeling a pdf given a dataset X = {x_1, x_2, ..., x_n}. - If the form of the underlying pdf is known (e.g., Gaussian), the problem could be solved using Maximum Likelihood (ML). - If the form of the pdf was unknown, the problem had to be solved with non-parametric DE methods such as Parzen windows. - We will now consider an alternative DE method: modelling the pdf with a mixture of parametric densities. - In particular, we will focus on mixture models of K Gaussian densities p(x|θ) = ∑j=1:K(p(x|θj)P(θj)) • Can interpret the mixing coefficients P(θj) as prior probabilities

Deletion methods

Delete the observations with missing values. - The simplest approach. Allows the use of unmodified data analysis methods. - Only practical if there are few observations with missing values. Otherwise, it can introduce bias.

Transposition

Denoted v^T, It's the process of turning a column vector into a row vector or vise-versa.

Interval data

Differences and order between values can be found, but there is no absolute 0. (temperature, time, ph level)

Degrees of freedom

Dimension of the parameter space Λ of our learning machine. But the number of effective degrees of freedom is smaller. As c increases from 0 to ∞, the effective degrees of freedom decrease from n (no regularization) to 2 (straight line)

Mean Squared Error

The average of the squared differences between the forecasted and observed values. Where fˆ(x_i) is the prediction that fˆ gives for the ith observation (in the regression setting). It's used to measure the quality of fit (typically on the training data making it the training MSE)

Log-odds / logit

log[p(X) / (1 - p(X))] = β_0 + β_1 × X_1 + ... + β_p × X_p

Leave node

Childless node

Ratio data

Continuous value with a natural (true) zero (Kelvins)

Test data

(Previously) Unseen data

Signal representation approaches

- Latent factors (Factor Analysis): Uncover latent factors underlying a set of variables and describe the observed data in terms of these factors rather than in terms of original variables. - Principal Component Analysis (PCA): Better representation of data without losing much information; building more effective data analyses (classification, clustering) on the lower-dimensional space - Multidimensional Scaling (MDS): Represent data in a lower-dimensional space so that distances are preserved (as well as possible)

Regression tree building

1. Use RBS (Recursive Binary Splitting) to grow a large tree 2. Apply Cost Complexity Pruning to obtain a sequence of the best sub-trees as function of α. 3. Use K-fold CV (Cross Validation) to choose α that minimise the average error 4. Return the sub-tree from step 2 that corresponds to the chosen value of α

Stop word

A common word (such as "and," "the," "it," or "by") deemed uninformative and therefore dropped

Decision tree pruning

Cut a very large tree T0 to obtain a sub-tree.

Decision stump

Decision tree with 1 split (d = 1)

Term Frequency (TF)

For a term t in a document d, it's tf_(t,d): the number of occurrences of t in d

Data preprocessing

• Data cleaning - Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies (using 0, means/medians or stratified means/medians) • Data transformation and discretization - Normalization and aggregation, discretization in particular for numerical data • Data integration - Combine data from multiple sources into a coherent file • Data reduction - Reduce the dimensionality of data, but still, produce same or similar result of the analysis

Text document representation

• Boolean vector: - A document is a vector where each element is a bit representing the presence/absence of word (Bag of Words, where the exact ordering of terms is ignored) - A set of documents can be represented as a matrix (sparse): document d (row) and word w (column) are assigned value 1 or 0 • Vector space representation: Each document is represented as a non-Boolean vector, with one component corresponding to each word in the dictionary

Data smoothing

Using the same data for training and evaluation.

Non-parametric density estimation

- Attempt to estimate the density directly from the data without assuming a particular form for the underlying distribution. - The simplest form of non-parametric DE is the histogram: • Divide the sample space into a number of bins and approximate the density in each bin using the fraction of points in the training data that fall into that bin p_H(x) = 1/n * (number of x_i in the same bin as x) / binWidth • The histogram requires two parameters to be specified (in 1D): bin width and the starting position of the first bin. - The probability that a vector x, drawn from a distribution p = p(x), will fall in a given region R of the sample space is P = integral(,R)[p(x')dx']. - Suppose now that n vectors are drawn from the distribution; the probability that k of these n vectors fall in R is given by the binomial distribution P(k) = (n k)P^k(1-P)^(n-k) - We can use the MLh estimation to estimate the value of P: P = k/n - As n → ∞, the distribution becomes sharper (the variance gets smaller), so we can expect that a good estimate of the probability P can be obtained from the fraction of the points that fall within R: c. - On the other hand, if we assume that R is so small that p(x) does not vary appreciably within it, then integral(,R)[p(x')dx'] ≈ p(x)V, where V is the volume enclosed by the region R. - Merging with the previous result we obtain P = integral(,R)[p(x')dx'] ≈ p(x)V P ≈ p(x)V } ⇒ p(x) ≈ k / nV - This estimate becomes more accurate as we increase the number n of data points and shrink the volume V. In practice, the total number n of observations is fixed. - To improve the accuracy of the estimate p(x) we could let V approach zero but then R would become so small that it would enclose no observations (unless x is an observation). - This means that, in practice, we will have to find a compromise for V: • Large enough to include enough observations within R (≈ controlling variance). • Small enough to support the assumption that p(x) is constant within R (≈ controlling bias). - In conclusion, the general expression for non-parametric density estimation becomes p(x) ≈ k / nV where { V: volume surrounding x n: total number of observations k: number of observations inside V } - Applying this result to practical density estimation problems, we are led to kernel density estimation

Attributes

- Columns in a data frame. - Variables/features/dimensions of the data - Properties or characteristics of an object

Parzen windows

- Let the region R be the hypercube with sides of length h (the bandwidth parameter) centred at x. Its volume is V = h^d, where d is the dimension (number of attributes). - To find the number of observations that fall within this region we define a kernel function K, K(u) = { 1 if |u_j| ≤ 1/2 ∀ j = 1, ..., d 0 otherwise } • This kernel corresponds to a unit hypercube centred at the origin and is known as the Parzen window. • The quantity K((x − x_i) / h) is then equal to 1 if x_i is inside a hypercube of side h centred on x, and 0 otherwise. - The total number of points inside the hypercube is then k = ∑n=1:n[K((x - x_i) / h)] - Substituting back into the expression for the density estimate, p_KDE(x) = 1 / nh^d * ∑n=1:n[K((x - x_i) / h)] - The Parzen window estimate resembles the histogram, except that the bin locations are determined by the data. The Parzen window has several drawbacks, such as: - It yields density estimates that have discontinuities - It weights equally all points x_i, regardless of their distance to the estimation point x - For these reasons, the Parzen window is commonly replaced with a smooth (or piecewise-smooth) kernel function K

MoG as a latent variable model for clustering

- The discrete indicator variable y_i = j means that data point x_i is assigned to cluster j. - The prior probability of being assigned to cluster j is w_j : P (y_i = j|w) = w_j. - Given that data point x_i is assigned to cluster j, its density is Gaussian with mean μ_j and covariance Σ_j: p(x_i|y_i = j, μ, Σ) = N(x_i | μ_j, Σ_j).

Bayes prediction rule

- The expected test error is ℙ(Y ≠ fˆ(X)). Our goal is to minimize it. - Now suppose we know the true data-generating distribution. - It is possible to show (easily follows from the law of total probability in the case of finite X) that the expected test error rate is minimized by the Bayes prediction rule (or Bayes classifier), which assigns each object to the most likely class given the attributes. fˆ(x) = j where j is a class for which ℙ(Y = j | X = x) is largest. E.g. in a 2-class problem: - predicts class 1 for object x_0 if ℙ(Y = 1 | X = x_0) > 0.5 - predicts class 2 if ℙ(Y = 2 | X = x_0 ) > 0.5

Maximum Likelihood (MLh)

- The parameters θ are assumed to be fixed (i.e., not random variable) but unknown. - It solution seeks the "best" explanation θ for the dataset X θ̂ = arg max p(X|θ). θ - Assume we seek to estimate a density p(x) that is known to depend on a number of parameters θ = (θ_1, ..., θ_m)^T. • For a Gaussian pdf, θ_1 = μ, θ_2 = σ² and p(x|θ) = N(μ, σ²). • To make the dependence on x explicit, we write p(x|θ) - Assume we have a dataset X = (x_1, x_2, ...., x_n) drawn independently from a distribution p(x|θ) (an i.i.d. sequence). • Then we can write p(X|θ) = ∏i=1:n[p(x_i | θ)] • The ML estimate of θ is the value that maximises the likelihood p(X|θ) θ̂ := arg max p(X|θ). θ • This corresponds to the intuitive idea of choosing the value of θ that is most likely to give rise to the data. For convenience, we will work with the log-likelihood. - Since log is a monotonic function: θ̂ = arg max p(X|θ) ≡ arg max log p(X|θ). θ θ - Hence, the ML estimate (MLE) of θ can be written as: θ̂ = arg max log ∏i=1:n[p(x i |θ)] ≡ arg max ∑i=1:n[log p(x_i|θ)] θ θ • This simplifies the problem: a sum is easier than a product. • Logs are especially convenient in the Gaussian case, as we will see Cf. slides 11-14 of 07_1.pdf for examples

Agglomerative clustering

1. Begin with n observations and a measure (such as Euclidean distance) of all the n(n − 1)/2 pairwise dissimilarities. Treat each observation as its own cluster. 2. For i = n, n − 1, ..., 2: (a) Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are the most similar. Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed. (b) Compute the new pairwise inter-cluster dissimilarities among the i − 1 remaining clusters.

Comparison of 2 prediction algorithms

1. Get the contingency table with the result of Alg 0 and Alg 1 2. Compute the error rates of both alg.s 3. Form a null hypothesis (e.g. "Is it possible that the probability of error is in fact the same for the 2 prediction rules?") Let the probabilities for a test observation to get into each of the cells (A-D) be: | Alg 1 Alg 0 | correct wrong ---------|------------------ correct | p_A p_B wrong | p_C p_D where p_A + p_B + p_C + p_D = 1 The null hypothesis is p_C + p_D = p_B + p_D which is equivalent to p_C = p_B ∴ the null hypothesis can be restated as: "the conditional probability that a test observation belongs to cell B given that it belongs to B or C is 1/2" 4. Test the null hypothesis In the example, above, we'll focus on cell B (9 obs.) and C (17 obs.) where the 2 prediction rules produce different results. The question can be restated as: "Can we get only 9 heads in 26 tosses of a fair coin?" (possible but unlikely) 5. Compute the p-value p-value for the observation of 9/26: pbinom(9, 26, 0.5) = .08431877; (probability that we will observe 9 or even fewer heads in 26 tosses) Note: pbinom is the distribution function of the binomial distribution such that pbinom(k, n, p) is the probability P(Y ≤ k) where Y is the number of heads in n tosses of a coin (perhaps biased) with probability p of a head. (pbinom(9, 26, 0.5, lower.tail=TRUE) does the same whereas using lower.tail=FALSE is for P(Y > k) for upper tail probabilities. The lower tail refers to the LHS region of the curve. 6. If the p-value is less than the significance level (e.g. 5%) than refect the null hypothesis otherwise, retain it. So far we have assumed that one of the two prediction algorithms (Alg. 0) is the base one, and we are only looking for a deviation from the null hypothesis in one direction. Our alternative hypothesis is one-sided: namely, it is that Alg 1 produces a better prediction rule than Alg 0. If there is no such asymmetry and we are just interested in which prediction rule is better, we can simply multiply the p-value that we get as described above by 2. In this case, our alternative hypothesis is two-sided: the first algorithm produces a better prediction rule than the second or vice versa

K-means algorithm

1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: (a) For each of the K clusters, compute the cluster centroid. The jth cluster centroid is the vector of the d attribute means for the observations in the jth cluster. (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). Since the K-means algorithm finds a local rather than a global optimum, the results obtained will depend on the initial (random) cluster assignment of each observation in Step 1. How to choose K? - Try different values and apply some criterion for each clustering. - If there is a large gap in the criterion values, it suggests a "natural" number of clusters Pro: - Fast way to partition the data into K clusters - No underlying model Con/Pro: - Different clusterings can result from different initialisations

Validation set error rate estimation

1. Randomly divide the available set of observations into two parts, a training set and a validation set (or hold-out set) 2. train the prediction algorithm on the training set 3. use the resulting prediction rule to predict the labels for the observations in the validation set 4. use the resulting validation set error rate (or the validation MSE in the case of regression) as an estimate of the expected test error rate (or the expected test MSE). Pros: - Conceptually simple - Easy to implement Cons: - The Validation estimate of the expected test error can be highly variable. - Only a subset of the observations is used to train the algorithm. Thus, the validation set error tends to overestimate the ETE (Expected Test Err) for the algorithm trained on the entire data set.

Data frame

2d matrix with attributes/features/variables and observations/instances and sometimes with an output column

Contingency table

A data matrix that displays the frequency of some combination of possible responses to multiple variables; cross-tabulation results

Polytree

A digraph that becomes a tree when we erase the directions of the arrows

Arborescence

A directed tree which is a digraph obtained from a tree by choosing a vertex (making it the root) and directing all edges away from the root

Metric

A function d(x, y) offered for measuring the distance between two vectors x and y is a metric if it satisfies the following properties: d(x, y) ≥ 0 d(x, y) = 0 iff x = y d(x, y) = d(y, x) d(x, y) ≤ d(x, z) + d(z, y) In vector spaces (where subtraction is allowed), we often define d(x, y) = ||x − y|| using a norm ||...||

U-shape phenomenon

A fundamental property of statistical learning that holds regardless of the particular data set at hand and regardless of the prediction algorithm being used, which consist of: - a monotone decrease in the training MSE - a U-shape in the test MSE It's the result of 2 competing properties of prediction algorithms, bias and variance

Classification ANN

A general K is needed to perform K-class classification. The kth unit, at the right, models the probability of class k (the possible labels are encoded as k = 1, ..., K). It's convenient to encode each true label y_i, i = 1, ..., n, in the training set by ỹ_i,k := k == y_i ? 1 : 0 The complete set λ of weights now consists of: α_m,0 and α_m,l for m = 1, ..., M and l = 1, ..., p β_k,0 and β_k,m for k = 1, ..., K and Therefore, we have M(p + 1) + K(M + 1) weights overall. The Learning Machine now works as follows: Z_m = σ(α_m0 + α_m * X) m = 1, ..., M T_k = β_k0 + β_k * Z k = 1, ..., K Y_k = e^(T_k) / ∑l=1:k(e^[T_l]) k = 1, ..., K (the last ratio is the softmax function) The 2 standard ways to measure the training error are the Brier loss and deviance

Expectation-Maximization (EM)

A general approach for solving problems involving hidden or latent variables Y = (y_1, y_2, ..., y_n) (assumed discrete, for simplicity). - Goal: learn the parameter vector θ of a model P(X, Y|θ) from observations X = (x_1, ..., x_n), i.e., Y is hidden - Maximum likelihood estimate using X: θ̂ = arg max L(θ) = arg max ∑Y[P(X, Y|θ)] θ θ EM is useful when directly optimising L is intractable, but computing MLE from fully-observed data (x_1, y_1), ..., (x_n, y_n) is easy In some cases, we can analytically maximise the likelihood function. (e.g. this works for binomial and 1D Gaussian cases). But it does not always work, the distribution may not be well-behaved or have too many parameters, so direct maximization may not be feasible. Solution: introduce hidden variables to simplify the likelihood function (hidden variables are sometimes used to account for actual missing data). Note: a few (potentially important) slides were omitted from this set Pros: - Most useful when the model distributions are easy to maximize (as for mixtures of Gaussians) Cons: - local maxima - need to bootstrap training process (pick an initial θ)

Rocchio

A general classification algorithm but it's particularly popular in IR. 1. For each class k, compute the (component-wise) arithmetic mean μ_k of the normalised document vectors in class k in the training set: μ_k := 1 / ( |{i : y_i = k}| ) * ∑i:y_i=k(x_i) where x_i := →v(d_i) and y_i is the label of document d_i in our collection d_1, ..., d_n. 2. Classify a test document as k for its nearest μ_k

Connected graph

A graph where we can travel from any vertex to any other vertex via edges

Recursive Binary Splitting

A greedy top-down approach to split the object space where the best split is made at a particular step of the tree-building process. Steps: 1. Select an attribute Xj and a cutpoint s such that splitting the object space into the regions { X | Xj < s } and { X | Xj ≥ s } leads to the greatest possible reduction in RSS. 2. Repeat (this time splitting one of 2 previously identified regions) until a stopping criterion is met (e.g. no regions contain 6+ observations)

Decision tree

A hierarchical arrangement of criteria that predict a classification or a value. it has wonderful interpretability (except when boosted) but not predictively efficient Main steps: - Stratifying, or segmenting (tree branches) the object space (set of possible values) in simple non-overlapping regions - To predict a given result on an object, we typically use the mean/mode of the training labels in the region to which it belongs (e.g R1 = { X | Years < 4.5 }, R2 = { X | Years ≥ 4.5 && Hits < 117.5 }, R3 = { X | Years ≥ 4.5 && Hits ≥ 117.5 }). The goal is to find regions R1, ..., Rj that minimize the RSS Pros: - very easy to explain - believed to mirror closely human decision-making - can be displayed graphically and are easily interpreted by non-experts - can easily handle qualitative attributes Cons: - Doesn't have the same level of predictive accuracy as some other algorithms

Semi-structured data

A hybrid data format which consists of structured and unstructured data. E.g. emails, blood pressure at different times of the day, X-Ray/MRI, specialist's comments, relationship hierarchy between patient/doctors/hospitals

Entropy

A measure of disorder or randomness. Since 0 ≤ p̂_mk ≤ 1, D is non-negative. The entropy will take on a value near zero if the p̂_mk are all near 0 or 1. Therefore, entropy is also a node impurity measure

Scatter-Gather

A method that clusters the whole collection to get groups of documents that the user can select or gather. The selected groups are merged and the resulting set is again clustered. This process is repeated until a cluster of interest is found. It's useful when the user is unsure about which search terms to use. But automatically finding descriptive labels for clusters is a difficult problem. Clustering for speeding up search: - Searching in the vector space model amounts to finding the nearest neighbours to the query (using cosine similarity). - Computing the similarity of the query to every document is slow. - An alternative is to find the clusters that are closest to the query and only consider documents from these clusters. - Within this much smaller set, we can compute similarities exhaustively and rank documents in the usual way

Back-propagation

A process by which learning can occur in a network, in which an error signal is transmitted back through the network. This backwards-transmitted error signal provides the information needed to adjust the weights in the network to achieve the correct output signal for a stimulus. It's an instance of the Gradient Descent method. It slowly moves against the gradient of the training MSE until some stopping condition is met. Namely: at each step, we add −η times the gradient, where η (the learning rate) is a small positive constant. Algorithm: Start from random weights β_m (m = 0, 1, ..., M) and α_ml (m = 1, ..., M, l = 0, 1, .., p). 1. Forward pass: Compute the variables Z_m = Z_im (m = 1, ..., M) and Y = Y_i for the ith training observation (i = 1, ..., n) following the Regression ANN formulas. 2. Backward pass: Compute the "errors" δ_i := 2(y_i − Y_i) and back-propagate them to s_im := β_m \* σ'(α_m0 + αm \* x_i) \* δ_i for all i = 1, ..., n and m = 1, ..., M 3. Weight update: update the weights (including β_0 and α_m0): β_m := β_m + η/n \* ∑i=1:n(δ_i \* Z_im) α_ml := α_ml + η/n \* ∑i=1:n(s_im \* x_il) 4. Go to 1 (starting a new epoch) unless a stopping condition is met Note: - it's important to scale the inputs (i.e. z-score normalising them). - With standardized inputs, it is typical to take the starting weight distributed uniformly on [−0.7, 0.7]. - Starting with large weights often leads to poor solutions. - To prevent the NN to get stuck in minimas, run the algorithm several times with different starting weights - Possible ways to stop: • Divide the available data into a training set and a validation set. • Train the neural net on the training set until the error (such as MSE) on the validation set starts increasing. • Train the neural net on all data using the number of epochs giving the minimal error on the validation set (found at the previous step). • Use cross-validation.

Discrete random variable

A random variable Y that can take one of a finite number of distinct outcomes. The distribution of a discrete Y taking values x_1, x_2, ..., x_m can be described by probabilities p_i = P(Y = x_i), i = 1, 2, ..., m

Continuous random variable

A random variable Y that may assume any numerical value in an interval or collection of intervals. Y has density p_Y(x) such that R^b P(a < Y ≤ b) = {ba, p_Y(x) dx (integral cf. image)

Hierarchical clustering

A set of nested clusters organized as a hierarchical tree. Methods: - Agglomerative (i.e., bottom-up) - Divisive (i.e. top-down) Main issues: - What is a good inter-cluster distance? - How many clusters are there? Hierarchical methods actually produce several partitions; one for each level of the tree. However, for many applications, we will want to extract a set of disjoint clusters. In order to turn the nested partitions into a single flat partitioning, we cut the dendrogram. A cutting criterion can be defined as a threshold Pros: - Tree-based organization - Several choices of distance measure and linkage criterion (the latter being for how the (dis)similarity is calculated). - Trees can be cut off at some level to generate a flat partition of the data - no underlying model Cons: - Slow (feasible for thousands of observations)

Probabilistic modelling with joint distribution

- A problem domain is modelled by a list of random variables X_1, X_2, ..., X_n. - Knowledge about the problem domain is represented by a joint probability P(X_1, X_2, ..., X_n). - Advantages: • Probability theory well-established and well-understood. • In theory, can perform arbitrary inference among the variables given a joint probability. This is because the joint probability contains information of all aspects of the relationships among the variables. • All inference sanctioned by laws of probability and hence has clear semantics. - Difficulty: complexity in model construction and inference - In general, • P(X_1, X_2, ..., X_n) needs at least 2^n - 1 numbers to specify the joint probability. • the knowledge acquisition difficult (complex and unnatural) • exponential storage and inference

Logistic Regression

A statistical analysis which determines an individual's risk of the outcome as a function of a risk factor. The outcome of interest has two categories. Rather than modelling the label Y directly, it models the probability that Y is, say "Yes" (where Yes is encoded as 1 and No as 0). It models p(X) := ℙ(Y = 1 | X) where X = {X_1, ..., X_p} by p(X) = σ(β_0 + β_1 * X_1 + ... + β_p * X_p). σ is the sigmoid/logistic function: σ(x) = 1 / (1+e^-x) = e^x / (1 + e^x) Notice: - 0 < σ(x) < 1 and lim(x→∞)(σ(x)) = 1, lim(x→-∞)(σ(x)) = 0 - σ(x) + σ(−x) = 1 (symmetry around (0, .5))

Linear Regression

A statistical method used to fit a linear model to a given data set. In simple LR, parameters are vectors (β_0, β_1) ∈ ℝ^2 and the learning machine is F(x, β) = β_0 + β_1 * x In multiple LR, we have p (p > 1) attributes, parameters are vectors (β_0, β_1, ..., β_p) ∈ ℝ^(p+1), and the learning machine is F(x, β) = β_0 + β_1 * x_1 + ... + β_p * x_p

Stratified sampling

A type of probability sampling in which the population is divided into groups with a common attribute and a random sample is chosen within each group. I.e. the proportions per groups/classes after the sampling remains the same.

Hidden variable

A variable that influences the data but it's not easy (in some cases even impossible) to measure (e.g. phonemes that produce a given speech recording, smoke alarm malfunctioning or not) In MoG models, those are labels.

Observed variable

A variable that is directly measured from the data (e.g. waveform values of a speech recording, smoke alarm going off or not)

Neuron

ANN node which has: - k weights w_1, w_2, ..., w_k where k is the number of arrows entering that neuron - one threshold b If the neuron receives signals s_1, s_2, ..., s_k from the neurons (or input variables) below it, it sends the signal θ(w_1 \* s_1 + w_2 \* s_2 + ··· + w_k \* s_k − b) to the neuron after it (i.e. on its right)

Expectation (expected value)

Also known as mean (value) is: - for a discrete random variable Y, 𝔼Y = ∑ x_i * ℙ(Y = x_i) - for a continuous variable Y, 𝔼Y = integral from -∞ to ∞ of x*p_Y(x)dx Properties: - 𝔼(X + Y) = 𝔼X + 𝔼Y always - 𝔼(cX) = c * 𝔼X always - 𝔼(XY) = 𝔼X*𝔼Y if X and Y are independent Remark: 𝔼(X²) = 𝔼(XX) does not have to be (𝔼X)² because X is not independent of itself!

Cost-complexity pruning

Also known as weakest link pruning, gives a small set of sub-trees (the best can then be selected based on test performance). We consider a sequence of trees indexed by a non-negative tuning parameter α. With each value of α, corresponds a subtree T ⊆ T0 such that the penalized RSS is as small as possible. Here |T| indicates the number of terminal nodes of the tree T, Rm is the rectangle corresponding to the mth terminal node, and ŷ_Rm is the predicted label associated with Rm (i.e., the mean of the training observations in Rm). The tuning parameter α controls the trade-off between the subtree's complexity and its fit to the training data. The bigger α is the smaller the resulting tree will be (if α = ∞ the tree would just be the root).

m-dimensional vector

An array of m numbers v = (v_1, v_2, ..., v_m). - vectors are added component-wise if u = (u_1, u_2, ..., u_m) and v = (v_1, v_2, ..., v_m), then u + v = (u_1 + v_1, u_2 + v_2 , ..., u_m + v_m) - vector can be multiplied by a scalar (= number): if v = (v_1, v_2, ..., v_m) and c is a number, then cv = (cv_1, cv_2, ..., cv_m) - the dot product of u = (u_1, u_2, ..., u_m) and v = (v_1, v_2, ..., v_m) is u · v = u_1*v_1 + u_2*v_2 + ··· + u_m*v_m = ∑i=1:m(u_i * v_i) - u and v are called orthogonal if u · v = 0 - the norm of a vector v is ||v|| = √(v · v) if v = 0 = (0, ..., 0) then ||v|| = 0 - vector distance: d(u, v) = ||u − v||

Random variables

Numerical outcome of a random phenomenon. It can be discrete or continuous

Quantitative variable

Numerical/*continuous values* (e.g. person's age, height, income, stock price)

Causal networks

Often called Bayes networks, those are graphic representations of uncertainty. The problem with Bayes networks is that it needs to store all combination (since it uses a joint distribution) which is numOfCases^numOfVars (e.g. 2^5 for 5 variables which can take up to 2 different values each). This approach is the naíve Bayes approach (i.e. assuming all vars are independent). For 5 nodes, causal networks allow us to reduce the amount of combinations from 31 (2^5-1) to 11 (1 for A (P(A)) + 2 for B (P(B | A) and P(B | ¬A)) + 2 for C + 4 for D + 2 for E (P(E | C) and P(E | ¬C))) due to the fact the probs. only dependent on the parent (so 2 per parent node, if there are parent nodes).

Chain rule

P(A ∩ B ∩ C ∩ D) = P(A) P(B | A) P(C | A, B) P(D | A, B, C) Where , stands for ∩

Responsibilities

Posterior probabilities in Mixture of Gaussian models. Intuitively, γ_j(x) tells us how much the jth Gaussian component is responsible for the data point x.

Desiderata for learning algorithms

Predictive efficiency, computational efficiency, scalability, interpretability

Classification problem

Problem with a qualitative label

Experimental design

Process of planning a study to meet specific objectives. Informal principles: • Think about the experiment - why are you doing the experiment? - what do you want from it? • Examine the factors you are interested in - can you unambiguously estimate them? - optimise the precision • Remove effects of unwanted factors • Plan at the outset how the data will be analysed

Cleaning data

Process of removing unnecessary data (duplicates, redundant variables, outliers) and reducing the data

Imputation methods

Process of replacing missing data with substituted values. • Assign a value to the missing one, based on the rest of the dataset. • As these methods extract a model from the dataset to perform the imputation, they are suitable under MCAR and, to a lesser extent, MAR types of missing values. • Not suitable for NMAR type of missing data or if there's more than 10% of the data missing. - It would be necessary in this case to go back to the source of the data to obtain more information. Examples: - Most Common value: • If the missing value is continuous: replace it with the mean value (or median for noise detection, since it's more robust) of the attribute for the dataset • If the missing value is discrete: replace it with the most frequent value of the attribute for the dataset • Simple and fast to compute • Assumes that each attribute has a regular distribution (such as normal) - Regression imputation: replace the missing values by the predicted value from a regression equation • Advantage: - uses information from observed data • Disadvantages: - overestimates model fit and distorts correlations - distorts variance

Independence

Random variables X and Y are independent if ℙ(X ∈ A and Y ∈ B) = ℙ(X ∈ A) × ℙ(Y ∈ B) ∀ sets A and B. Observing an outcome of X gives no information concerning the outcome of Y. Remark: 2 random variables may have the same distribution and still be independent. Random variables X_1 , X_2 , ... are independent if, for any n, X_1, X_2, ..., X_n are independent (i.e. if ℙ(X_1 ∈ A_1 and X_2 ∈ A_2 and .... and X_n ∈ A_n) = ℙ(X_1 ∈ A_1) * ℙ(X_2 ∈ A_2) * ··· * ℙ(X_n ∈ A_n). Two events A and B are conditionally independent given C if P(A, B | C) = P(A | C) P(B | C) (i.e., independence inside C). Also, if P(B, C) ≠ 0 then P(A | B, C) = P(A | C)

Classification tree building

Same as for regression trees, except, instead of using the mean label (of training objects that belong to the same terminal node) for predictions, we use the most commonly occurring class of training objects in the region to which it belongs. When building a classification tree, either Gini index or entropy are typically used to evaluate the quality of a particular split. Any of the 3 approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal

Empirical Risk Minimization (ERM)

Selecting the prediction rule that minimises the training MSE when there's no available test observations

Digraph cycle

Sequence of vertices v_1, ..., v_m such that there are arrows from v_1 to v_2, from v_2 to v_3, .., v_m−1 to v_m, and v_m to v_1; 1 ≤ m ≤ 3.

Graph cycle

Sequence of vertices v_1, ..., v_m such that v_1 is connected to v_2, v_2 to v_3, .., v_m−1 to v_m, and v_m to v_1; the value m = 3 is allowed (but not m = 2 or m = 1).

Λ

Set of parameters (i.e. parameter space) for learning machines

Decision tree methods

Set of splitting rules used to segment the object space which starts at the top of the tree. Also known as a set of internal nodes

Expected Test Error

Sometimes also called the generalization error 𝔼((Y − fˆ(X))²) in the case of regression ℙ(Y ≠ fˆ(X)) in the case of classification By the law of large numbers, it can be interpreted as the MSE (in the case of regression): ∑i=1:m((Y_i - fˆ(X_i))²) / m. or TER (in the case of classification): ∑i=1:m(I(Y_i ≠ fˆ(X_i))) / m for a huge test set of size m >> 1

Average linkage

Average inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.

Feature selection - filtering

Basic idea: assign a heuristic score to each feature to filter out the "obviously" useless. ones. Pros: - Very fast - Simple to apply Cons: - It doesn't take into account interactions between features: apparently useless features can be useful when grouped with others

Feature selection

Choosing a subset of all the features (the ones that are more informative, typically the ones with the highest variance). Motivation: - Features may be expensive to obtain (e.g. blood samples) - Selected features may possess simple meanings (interpretability): we may directly derive an understanding of our problem from the classifier - We often try to find a simple, "parsimonious" model: Occam's razor (the simplest explanation that accounts for the data is best) Methods: - Filter method: selects a subset of features independently of the model that shall subsequently use them. - Wrapper method: selects a subset of features taking into account the model that shall use them. - Embedded method: the feature selection method is built in the learning model itself (e.g. decision trees). Pros: - Removing features: equivalent to projecting data onto a lower-dimensional linear subspace perpendicular to the feature removed - It can be faster (than the other approach) at test time

Experiment

Controlled process or study that results in the collection of data

Structured data

Data already in a database or a spreadsheet. The data format is well-known (relations and tuples). Every tuple conforms to a known schema

Discretization

Divide the range of a continuous attribute into intervals. • Some classification algorithms only accept categorical (non-numerical) attributes (e.g., C4.5 and Bayesian belief networks) • Reduce data size by discretization • Prepare for further analysis Methods: • Unsupervised discretization 1. Equal-interval binning 2. Equal-frequency binning - Labels are ignored - The best number of bins is determined experimentally • Supervised discretization - Entropy-based discretization - It tries to maximise the "purity" of the intervals (the purest intervals are those containing only one class label)

Smoothing spline

Elegant non-linear regression method with ℝ as the object space, thus training points are (x_i, y_i) ∈ ℝ², i = 1, ..., n and our goal is to fit a function fˆ(x) to it. For a given tuning parameter c ≥ 0, the corresponding smoothing spline is the (twice differentiable) function g that minimizes. The penalty term integral g"(t)²dt is minimized by a straight line. c is responsible for the bias/variance trade-off (if it's very big, the bias is huge, adversely if it's 0 then there's no bias). Properties: - it is a piecewise cubic polynomial with knots at x_1, ..., x_n (i.e., it is a cubic polynomial between any pair of adjacent knots) - it has a continuous second derivative (at each knot) - it is linear in the region outside of the extreme knots. The goals are: - the points should be as close as possible to the actual ones - the spline should be as straight as possible. During the minimisation step, the pts before/after the ones we have will be on a straight-lined obtained by minimising the penalty term (the 2nd one in the eq. with a differential) (which is done by nullifying some parameters).

Prediction

Estimating f when a set of inputs X are readily available but the output Y can't be easily obtained. Ŷ = fˆ(X) where fˆ (f hat and prediction rule) is an estimate of f and Ŷ is the resulting prediction for Y. fˆ is often treated as a black box.

Distance measure between document vectors

Euclidean distance could be used, however, 2 documents with very similar content can be very large because one is much longer than the other. The standard way is to then use a similarity measure between document d_1 and d_2 which is done with the cosine similarity of their vector representations →V(d_1) and →V(d_2): sim(d_1, d_2) := (→V(d_1) ⋅ →V(d_2)) / (||→V(d_1)|| * ||→V(d_2)||)

Iterative cluster optimization

Exhaustive enumeration of all partitions, which guarantees the optimal solution, is infeasible (e.g. a problem with 5 clusters and 100 observations yields 5^100 ≈ 10^67 partitions). - The common approach is to proceed in an iterative fashion: 1. Find some reasonable initial partition. 2. Move observations from one cluster to another in order to reduce the objective function. - Such iterative methods produce sub-optimal solutions but are computationally tractable. Groups of iterative methods: - Flat clustering algorithms: • these algorithms produce a set of disjoint clusters • they are the most widely used algorithms and include K-means. - Hierarchical clustering algorithms: • the result is a hierarchy of nested clusterings • these algorithms can be broadly divided into agglomerative and divisive approaches

Distribution function

F_Y(x) = P(Y ≤ x)

Covariance matrix

Features: - Covariance and variance are measure of the "spread" of a set of points around their centre of mass (mean) - The covariance between two variables measures the degree of the linear relationship (a large/small value indicates high/low redundancy) - C_X is a square symmetric matrix. - The diagonal of C_X consists of the variance of the vectors in (the columns of) X. - The other elements of C_X are the covariances between different columns of X

Information Retrieval (IR)

Finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). - Formerly: librarians, lawyers, scientists in libraries - nowadays: web search, searching your email We are given a collection of documents, each document consisting of terms (essentially words). Given a query (a set of terms), the goal is to find the closest document to the query. More generally, it's to rank documents according to their distance to the query. The terms are usually what's left after removing stopwords and which are standardized (i.e. lowercased and using consistent spelling such as no hyphens/diacritics/...). BoW (Bag of Words) can help with IR (as it turns words into vectors). Word2Vector embedding is another approach which is more advanced. TF-IDF can be used to determine the discriminating power of a word in a document based on how many times it appears in that document. So instead of using the BoW, we can use the TF-IDF in the word matrix (vector space model where a document is a point in that space).

Graph

Finite set of vertices (or nodes) pairs of which are connected by undirected edges (in the case of undirected graphs) or directed arrows (in the case of directed graphs). A graph is cyclic if it has a cycle otherwise it's called acyclic.

Overfitting

Fitting a model too closely to training data, resulting in a model that does not accurately generalise the true function f (i.e. small training MSE, large test MSE). This leads to high variance and low bias.

Eigenvector

For a square matrix A, it's a non-zero vector v such that: Av = λv for a constant λ (called eigenvalue). Commonly, v is chosen to be a unit vector (v^T*v = 1) Interpretation: the operation of A in direction v is a scaling by λ. Example: Let: A = (3 1 1 3) and v = (4 -4) The vector is an eigenvector with eigenvalue 2

Document Frequency (DF)

For a term t, it's df_t: the number of documents in the collection that contain the term t

Inverse Document Frequency (IDF)

For a term t, it's idf_t := log(N / df_t) where N is the total number of documents in the collection

Least Squares prediction rule

For a training set (x_1, y_1), ..., (x_n, y_n), the ERM principle recommends using β_0, β_1 that attain: ∑i=1:n[(y_i - ŷ_i)²] → min where ŷ_i := β_0 + β_1 * x_i

Recall

From all the possible correct values, how many were recalled?

Learning machine

Function F : X × Λ → Y which outputs a prediction given an object and a parameter. It's trained by finding a suitable parameter λ ∈ Λ. after the machine has been trained and we have settled for some λ ∈ Λ, we can construct a prediction for a new object x_0: ŷ = F(x_0, λ). The prediction rule we are using now is fˆ : X → Y defined by fˆ(x) := F(x, λ)

Smooth kernel

Function K such that: integral(,R)[K(x)dx = 1] - Usually, but not always, K will be a radially symmetric and uni-modal pdf, such as the Gaussian K(x) = (2π)^(-d/2)*e^(-0.5*x^T*x) - This leads to the density estimate: p_KDE(x) = 1 / nh^d * ∑n=1:n[K((x - x_i) / h)] - Just as the Parzen window estimate can be seen as a sum of boxes centred at the data, the smooth kernel estimate is a sum of "bumps". - The kernel function K determines the shape of the bumps. • should be a proper pdf, usually chosen to be unimodal and symmetric about zero • the influence of each data point is spread about its neighbourhood - The parameter h, also called the smoothing parameter or bandwidth, determines their width.

GLM

Generalized Linear Model

Feature extraction

Given the object space ℝ^p, with x_i ∈ ℝ^p, find a mapping f : ℝ^p → ℝ^M with M < p such that the transformed feature vector z_i := f(x_i) ∈ ℝ^M preserves (most of) the information in x_i ∈ ℝ^p. An optimal mapping f is one that results in no increase in the minimum probability of error. In general, the optimal mapping f will be a nonlinear function, and there is no systematic way to generate nonlinear functions. This approach is commonly limited to linear projections f. Two informal criteria can be used to find the "optimal" feature extraction mapping f. - Signal representation: the goal is to represent the objects accurately in a lower-dimensional space (e.g. PCA) - Classification: The goal is to enhance the class-discriminatory information in the lower-dimensional space (e.g. LDA)

Discrete attribute

Has a finite or countably infinite set of variables. It's sometimes represented as an integer variable. There are nominal, ordinal, and binary attributes.

Continuous attribute

Has real numbers as values, e.g., temperature, height, or weight

Classification Error Rate

Here p̂_mk represents the proportion of training objects in the mth region that are from the kth class.

Bayesian propagation mechanism

Here, we're trying to determine how the influence of the new information (data) can spread through the network. First, we find new λ(B = b) ∀ B and b, and then new π(B = b) ∀ B and b. Let D^−_C be the part of D^−_B in the subtree rooted in a child C of B (including C if it has been instantiated). Message exchange: λ going up and π going down the tree

Random Forest

Idea: Tweak bagged trees by making the trees less dependent. As in bagging, build a number of decision trees on bootstrapped training sets. But when building these decision trees, each time a split in a tree is considered, choose a random sample of m attributes as split candidates from the full set of p attributes. (The split is allowed to use only one of those m attributes.). A fresh sample of m attributes is taken at each split. If m = p, it's just bagging so typically m ≈ √p. The main difference between that and boosting is: - in RF, decision trees are produced independently - in boosting, decision trees are produced in a sequential manner

Bootstrap

Idea: instead of generating independent training sets of size n, generate independent samples of size n (with replacement) from the given training set of size n. Boostrap is often used for estimating the variability of various statistics (functions of a training set)

Conditioning

In general, it is possible that we have learned that some of the variables took specific values (in Story 1: Dr Watson's call). This is our data (or evidence) D; we also say these variables have been instantiated. We are interested in Bel(B = b) := P(B = b | D) ∀ B and b ∈ V(B) (our belief in B = b given D). Split the data D into 2 parts: - D^−_B the part of D which is below B in the graph - D^+_B the part of D which is above B in the graph. If the value of B is already known (i.e., B has been instantiated), we include B in both D^−_B and D^+_B. Since our network is a tree, we have: Bel(B = b) = P(B = b | D^+_B, D^−_B) = α * P(B = b | D^+_B) * P(D^−_B | B = b, D^+_B) = α * P(B = b | D^+_B) * P(D^−_B | B = b) := απ(B = b) * λ(B = b) where α is a normalizing constant (constant = independent of b) - second equality: the simplest version of the Bayes formula applied inside the event D^+_B - third equality: conditional independence - π and λ are the main components of our data structure allowing efficient computations. Fusion: π(B = b) := P(B = b | D^+_B) λ(B = b) := P(D^−_B | B = b) When there is no evidence (D is empty): - all λ(B = b) = 1 - all π(B = b) = P(B = b) in general, Bel will be obtained by fusion of π and λ: Bel(B = b) = απ(B = b) * λ(B = b)

Deviance

It's motivated by maximum likelihood estimation where Y_i,k is the value of Y_k for the ith training observation

Clustering in IR

It's widely used in IR, one of the typical applications: helping in grouping coherent groups of search results (particularly useful if the search term has different meanings, e.g. jaguar, apple) The popular algorithms for this are: - K-means - EM - Hierarchical clustering

Gradient Descent

Iterative optimization algorithm for finding the input to a function that produces the optimal value. It consists of many steps called epochs. At each step, the algorithm makes a small step against the gradient of the training error (perhaps regularized) Pros: Simple, local nature of the algorithm: each hidden unit exchanges information only with the units that are connected to it (so it can be implemented efficiently on a parallel computer) Cons: It can be slow, and nowadays other (more complicated) algorithms are often used

TF-IDF

Its weighting scheme assigns to term t a weight in document d given by tf-idf_(t,d) := tf_(t,d) × idf_t I.e., tf-idf_(t,d) assigns to term t a weight in document d that is: - high when the term t (a) occurs many times in d and (b) occurs within a small number of documents (thus lending high discriminating power to those documents) - lower when the term occurs fewer times in document d or occurs in many documents - lowest when the term t occurs in virtually all documents (like stopwords)

Model flexibility

Known as the bias-variance trade-off. As we use more flexible methods, the variance will increase and the bias will decrease. More precisely: - As we increase the flexibility of a class of methods, the (squared) bias tends to initially decrease faster than the variance increases. The expected test MSE declines. - At some point, increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases. In the context of decision trees: - A high value of α leads to a low variance but perhaps a high bias. - A low value of α leads to a low bias but perhaps a high variance.

Manhattan (city block) distance

L_1 norm A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC.

Euclidean distance

L_2 norm Straight line distance

Minkowski distance

L_k norm The choice of an appropriate value of k depends on the amount of emphasis that you would like to give to the larger differences between the components.

Supervised Learning

Machine learning type where labels (outputs) are available. Kinds: - Batch learning - On-line learning Formally, for each object x_i, i = 1, ..., n there is an associated label y_i. We wish to fit a prediction rule that relates the label to an object

Unsupervised Learning

Machine learning type with no available labels (outputs). Training without a teacher. Formally, we are given objects x_i, i = 1, ..., n, but no associated labels y_i.

Complete linkage

Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.

Single linkage

Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. The single linkage can result in extended, trailing clusters in which single observations are used one-at-a-time (so it leads to the least balanced dendrograms) which is called diversion.

K-Nearest Neighbours

Attempt to approximate the "Bayes algorithm" classification setting in a transductive way. Natural idea: estimate the conditional distribution of Y given X and then classify a given object to the class with the highest estimated probability Classification: - Given a test object x_0, the KNN classifier first identifies the K objects N_0 in the training set that are closest to x_0. It then estimates the conditional probability for class j as P̂(Y = j | X = x_0) = [∑(x_i ∈ N_0)(I(y i = j))] / K (the fraction of objects in N_0 whose label is j). - Finally, KNN classifies the test object x_0 to the class j with the largest probability P̂(Y = j | X = x_0). Regression: Given a test object x_0, KNN regression first identifies the K training observations N_0 that are closest to x_0. - It then estimates f(x_0) using the average of all the training labels in N_0; i.e fˆ(x_0) := [Σ(x_i ∈ N_0)(y_i)] / K A small K (e.g. K=1) leads to overfitting and a large one (e.g. K=100) leads to underfitting.

Gradient computation

As usual, the training set is (x_1, y_1), ..., (x_n, y_n); it is fixed. - The training MSE is ∑i=1:n(Ri) / n where Ri := (y_i - Y_i)² - For some given parameters, we can compute the values of the nodes Z_m and Y for each observation i = 1, ..., n; denote them Z_im and Y_i, respectively. - Set x_i0 := 1 and Z_i0 := 1 for all observations δ and s are defined by δ_i := 2(y_i − Y_i) and s_im := β_m × σ'(α_m0 + αm × x_i) × δ_i (the latter is the back-propagation equation)

Mixtures of Gaussians (MoG)

Assumption: each class will have a Gaussian distribution with its own mean and its own covariance matrix. Ideally: - component distributions have high "peaks" (data in one cluster is tight) - the MoG "covers" the data well (dominant patterns in the data are captured by component distributions). - Think of the individual components in the mixture as kernels, except for there is only a few of them, as opposed to one per data point. The weighted sum of a number of Gaussians (K), where the weights are determined by a distribution, w: p(x) = w_1 * N(x|μ_1, Σ_1) + ... + w_K * N(x|μ_K, Σ_K), where ∑i=1:K(w_i = 1). p(x) = ∑i=1:j(w_jN(x|μ_j, Σ_j). Any continuous density can be approximated to arbitrary accuracy by using a sufficient number of Gaussians Sampling from the MoG: - To generate a data point: 1. Pick one of the component j with probability w_j 2. Draw a sample x_i from that component. 3. Repeat step 1-2 for each new data points. Fitting a MoG: We wish to invert this process - given the dataset, find the corresponding parameters: mixing coefficients w_j, mean μ_j and covariance Σ_j - If we knew which component generated each data point, the maximum likelihood solution would fit each component to the corresponding cluster. - Problem: the data set is unlabelled. How to implement it? - We can apply the method of Gradient Ascent, similarly to what we did for neural networks and logistic regression: taking small steps in the direction of the gradient. - The method is applied to log-likelihood rather than to likelihood in order to obtain a simpler expression for the gradient and to improve numerical stability. Pro: - Statistical model for data generating process Con: - Very slow (feasible for hundreds of observations)

Sampling without replacement

At first, the sample is empty, do the following m times: remove a random element from the bag and put it into the sample. It cannot be done unless m ≤ n

Ri

Decision tree region/box i also known as terminal nodes or leaves in a tree

Bagging

Boosting aggregation: a general-purpose procedure for reducing the variance of a prediction algorithm. Idea: Given a set of n independent real-valued observations Z_1, ..., Z_n, each with variance σ², the variance of the mean Z̄ := [∑i=1:n(Z_i)]/n of the observations is given by σ²/n ∴ averaging a set of observations reduce variance Ideally, the way to reduce the variance of a prediction algorithm is to take many training sets from the population, construct a separate prediction rule using each training set and average the resulting predictions. I.e., calculating fˆ1(x), fˆ2(x), ..., fˆB(x) using B separate training sets and average them: fˆavg(x) := 1/B * ∑b=1:B(fˆb(x)) Realistically, we generally don't have access to multiple training sets, so we: 1. Generate B different bootstrapped training sets 2. Train our prediction algorithm on the bth bootstrapped training set in order to get fˆ^(∗b)(x) 3. Average all the predictions to obtain fˆbag(x) := 1/B * ∑b=1:B(fˆ^(∗b)(x)). It's particularly useful for decision trees (which aren't pruned). In classification trees, one way to do that is, for a given test object, record the class predicted by each of the B trees then take a majority vote (the overall prediction is the most commonly occurring class among the B predictions). The B parameter isn't critical but it should be sufficiently large.

Ordinal data

Data exists in categories that are ordered but differences cannot be determined or they are meaningless. (Example: 1st, 2nd, 3rd)

K-fold Cross Validation

Data is randomly split into k subsets/groups/folds; the 1st fold is the validation set; the method is then fitted on the remaining K - 1 folds. This approach is used for estimating the test error. Steps: 0. Randomly split the data into K groups/folds of approx. equal size 1. Treat the kth fold as a validation set (each time, a different group of observations is treated as a validation set) 2. Fit the method on the remaining K-1 folds 3. Compute the MSEk for the validation set 4. Repeat the above K times (i.e. for k ∈ [1, K], do 0, ..., 3) 5. Compute the K-fold CV estimate (average of the MSE values). Extreme case: LOOCV (leave-one-out). In practice, usually K = 5 or 10 (which leads to an intermediate level of bias, since each training set contains n(K-1)/K obs. Pros: - Much lower variability than the validation set approach - Non-excessive bias/variance (for K ∈ {5, 10}) - Often computationally more efficient than using large values of K (such as LOOCV) Cons: - No bias and huge variance (for LOOCV) When we perform CV, we may only be interested in the location of the minimum point in the estimated test MSE curve. For example, we might be performing cross-validation in order to identify the method or the value of a parameter, that results in the lowest ETE. It is often true that CV curves come close to identifying the correct level of flexibility, even if the estimated MSE is not quite right. In general, the bias decreases and the variance increases as K increases

Incomplete data

Data lacking attribute values, lacking certain attributes of interest or containing only aggregate data (e.g. occupation=" "). Category of missing values: - Missing completely at random (MCAR) - Missing at random (MAR) - Not missing at random (NMAR)

Outliers

Data objects with characteristics that are considerably different from most of the other data objects in the data set. It may be removed depending on the algorithm used (unless it's for intrusion/fraud/anomaly detection)

Noisy data

Data that contains errors or outliers (e.g. salary=-10). The following can handle it: - Clustering (detect and remove outliers) - Combined computer and human inspection (detect suspicious values and have them checked by humans) - Regression (smooth by fitting a regression function to the data)

Unstructured data

Data with no inherent structure and there is no additional annotation. E.g. plain text documents, images, videos

Pearl's algorithm

It allows us to compute probabilities P(X = x) for all nodes X; unconditional and conditional. The main data structure: with each node B are associated 2 sets of numbers, - π(B = b), ∀ b ∈ V(B) (causal, or anticipatory, support for the possible values b of B) - λ(B = b) (diagnostic, or retrospective, support for values b ∈ B). First, we ignore all λ(B = b) and describe how Pearl's algorithm computes the marginal (i.e., not conditional) probabilities for all variables. Note: This is just another representation of the way of computing P(X = x) (in this case also denoted π(X = x)) described above This is done as follows: 1. The algorithm sets π(B = b) := P(B = b) for all possible values b of every initial node B. Every child C of B is activated and given the message π_C(B = b) = π(B = b) (∀ b ∈ V(B)). 2. When a node B is activated, it inspects the π_B(X = x) messages communicated by its parent X (let us assume for now that B has only one parent), sets π(B = b) = ∑x∈V(X)[P(B = b | X = x) * π_B(X = x)] and activates each of its children C giving it the message π_C(B = b) = π(B = b) (∀ b ∈ V (B)). if B has 2 parents, say X and Y, the last equation should be changed to π(B = b) = ∑x∈V(X),y∈V(Y)[P(B = b | X = x, Y = y) * π_B(X = x) * π_B(Y = y)]. This computation is only valid in polytrees, where the values taken by the parents are independent. In the example (story 1, cf. image), the probabilities are: P(H) = 0.1 P(S | H) = 0.9 P(S | ¬H) = 0.2 P(D | S) = 0.7 P(D | ¬S) = 0 P(W | S) = 0.7 P(W | ¬S) = 0.2 Piecing all the things together, the algorithm is as follows: 1. Start from the leaves and compute the λs moving up to the root ("bottom-up propagation") 2. After reaching the root, compute the πs moving down to the leaves ("top-down propagation") 3. After λ and π are computed, fuse them into Bel

Divisive clustering

It begins with the entire data set as a single cluster and recursively divides one of the existing clusters into 2 daughter clusters at each iteration in a top-down fashion. - Specify N_obs (no. of observations); initialise N_cl (no. of clusters). 1. Start with one large cluster, N_cl := 1. 2. Find the "worst" cluster. 3. Split it; N_cl := N_cl + 1. 4. If N_cl < N_obs, go to 2. Choosing the "worst" cluster: the largest number of observations, largest variance, largest sum-squared-error, ... Cons: - More computationally intensive than the other hierarchical clustering method

ANN training

It consists of setting the weights and thresholds for all the neurons and possibly changing the topology of the network. The parameters for the LM consists of all weights and thresholds and the topology if it is variable. The complete set λ of weights consists of: - α_(m,0) and α_(m,l) for m = 1, ..., M and l = 1, ..., p - β_0 and β_m for m = 1, ..., M. - α_(m,0) and β_0 play the role of thresholds - there are M(p + 1) + (M + 1) weights overall. Notation: α_m := {α_(m,1), ..., α_(m,p)} X := {X1, ..., X_p} β := {β_1, ..., β_M} Z := {Z_1, ..., Z_M}

Histogram

It is a bar graph depicting a frequency distribution. An elementary form of density estimation, but has several drawbacks: - The density estimate depends on the starting position of the bins. - For multivariate data, the density estimate is also affected by the orientation of the bins. - The discontinuities of the estimate are not due to the underlying density; they are only an artefact of the chosen bin locations. • These discontinuities make it very difficult to grasp the structure of the data. - A much more serious problem is the curse of dimensionality: the number of bins grows exponentially with the number of dimensions. • In high dimensions, we would require a very large number of examples or else most of the bins would be empty. - These issues make the histogram unsuitable for most practical applications except for quick visualizations in one or two dimensions.

Neural Network (ANN)

Learning machine of a particular that attempts to emulate the way the human brain works. It was motivated by neurophysiology (McCulloch-Pitts model) - At the left level, we have input variables, each of them taking values 0 or 1 (denoted X_1, ..., X_p). - The following levels contain units (not quite neurons), it's called the hidden layer (denoted Z_1, ..., Z_M). - The last level contains one or more neurons (the output neurons, denoted Y_1, ..., Y_K). Each neuron has several inputs and an output It works as follows: - the input variables send their values to the level 1 neurons - the level 1 neurons then compute their output formula θ(w_1 × s_1 + w_2 × s_2 + ··· + w_k × s_k − b) and send it to the next level of neurons - ... - the output neurons compute their outputs and the vector of outputs is the overall output of the NN. Weights can have several types: - α for those betweens X (input) neurons and Z (1st-layer) neurons. - β for those betweens Z neurons and Y (output) neurons. Usage: - vehicle|process control - radar systems - face identification - gesture and speech recognition - automated trading - e-mail spam filtering

Boosting

Like bagging, it's a general approach, applicable to many prediction algorithms. In the context of regression trees: 1. Set fˆ(x) := 0 and r_i := y_i for all i = 1, ..., n. 2. For b = 1, 2, ..., B, repeat: (a) fit a tree fˆb with d splits (i.e., d + 1 terminal nodes) to the training data (x_i, r_i), i = 1, ..., n (b) update fˆ by fˆ(x) := fˆ(x) + η * fˆb(x) (c) update the residuals: r_i := r_i − η * fˆb(x_i), i = 1, ..., n 3. Output the boosted prediction rule: fˆ(x) := ∑b=1:B(η * fˆb(x)). Ideas behind it: - Unlike fitting a single large decision tree to the data, boosting learns slowly. - Given the current prediction rule, we fit a decision tree to its residuals. - We then add a bit of this new decision tree into the prediction rule in order to improve the residuals. - Each of the new trees can be rather small, with their size determined by the parameter d. - This way we slowly improve fˆ in areas where it does not perform well. The parameters are: - B: number of trees (unlike bagging and random forests, boosting can overfit if B is too large) - η: learning rate (a small positive number typically 0.01 or 0.001), a very small η can require a very large B. - d: number of splits in each tree (often d=1 works well)

Abductive reasoning

Logic where the likeliest scenario possible is found on incomplete information. It follows causal relationships.


Conjuntos de estudio relacionados

Implement Microsoft VPN Services Part One

View Set

History of Rock Unit Quiz 5 T/F (lessons 9, 10, 11)

View Set

general science: what is science? test

View Set

Abdominal and Genitourinary Trauma

View Set

Chapter 9: Health Insurance Policy Provisions EXAM

View Set