Machine Learning

Ace your homework & exams now with Quizwiz!

True or false: If you have full data you still need prior knowledge.

False.

What is a discriminant on supervised learning problems?

A function that separates the examples of different classes.

What is a stopping criterion?

A function whose value determines when a leaf node should not be expanded into subtrees.

Why is the complete Bayes classifier impractical?

A lot of data is required to estimate the conditional probabilities.

Why has ID3 been known for its bias?

A many valued attribute will partition the examples into many subsets that are very small and are likely to contain a high percentage of one class by chance alone.

How can learning be viewed in respect to hypothesis classes?

A search of the hypothesis class for the model 𝐡 ∈ 𝐇 maximizing the performance measure.

How can decision trees be used for feature extraction?

By building a tree and taking only the features used as inputs to another learning method.

What two types of learning can decision trees be used for?

Classification and regression

What is the major practical limitation of Bayesian methods?

Combinatorial explosion, if there are n possible attributes to consider, we would need all the conditional probabilities which are as many as the possible combinations of evidence (2^n for binary attributes)

What does the Naïve Bayes classifiers assume?

Conditional independence, then: P(a ⋂ ... ⋂ an | cj)= ∏P (ai | cj ) Example: P(spam| wi ⋂ ... ⋂ wn)= P( wi ⋂ ... ⋂ wn | spam)X [P(spam)/ P( wi ⋂ ... ⋂ wn)]

What are 3 interpretations of noise?

Imprecision in recording the input attributes Errors in labeling the data points Additional attributes (hidden or latent) have not been taken into account

Learning programs that can accept new training instances after learning has begun, without starting again are said to be _______________.

Incremental

What is the set of assumptions needed in order to have a unique hypothesis?

Inductive bias

What is the Most General Hypothesis (G)?

It is the largest rectangle we can draw that includes all the positive examples and none of the negative examples.

What is the Most Specific Hypothesis (S) ?

It is the tightest rectangle that includes all the positive examples and none of the negative examples. Which gives us one hypothesis, h= S. C may be larger than S but is never smaller.

What is the "No Free Lunch" theorem?

It states that if we average over all possible problems, no learning algorithm is better than another one, not even random guessing.

How does regression work? What kind of learning is this?

It uses attributes as inputs and the output is a characteristic, supervised learning. Example: Autonomous cars-- input: sensors, output: angle of the wheel.

What is the general smoothness assumption?

Small changes in the input are unlikely to cause big changes in the output.

What are the assumptions and inductive bias for IBL?

Smoothness, define a meaningful distance measure between training examples, cases that are near each other tend to belong from the same class, given an unseen case, it classifies it as the majority in its immediate neighborhood.

How do bayesian belief networks specify independence assumptions are valid and provide sets of conditional probabilities to specify the joint probability distributions wherever dependencies exist?

Specifying the conditional independence assumptions in a directed acyclic graph. A conditional probability table is provided for each node.

What is the best attribute?

The attribute that best discriminates, or gives the most information gain, the examples in respect to their classes.

True or False: If we choose h halfway between S and G it increases the margin.

True

True or False: if we are not able to explain our expertise, we cannot write a computer program (algorithm)

True

As the amount of training data increases, generalization error _________.

decreases

As the complexity of the model class H increases, the generalization error _____________ first, then starts to ____________.

decreases, increase

For classification trees the goodness of the split is quantified by the __________ measure.

impurity, a split is pure if for all branches, all the instances choosing a branch belong to the same class. We look for a split that minimizes impurity because we want to generate the smallest tree.

What is the assumed model for regression and classification?

y= g(x| θ) , where g(•) is the model, θ is its parameters and y is a number (regression) or class (classification).

What happens when numerical attributes are involved?

Discretization (treat them as categorical attributes) Assume a distribution (Gaussian) these distribution can be used to compute the probabilities.

What is the niche of machine learning?

Detect patterns or regularities to understand a process and/or make predictions.

In a K-class problem, we have _____ hypothesis to learn.

K

A K-class classification problem is _________________ problems.

K two-class

What does machine learning use?

Knowledge from the algorithm and data from a lookup table (prior knowledge).

What is the probability that event X will happen given Y has occurred?

P( X|Y)= P(X ⋂ Y)/ P(Y)

What are the key elements of a machine learner?

Task, associated performance measure, set of examples, set of assumptions, family of possible models (hypothesis class H)

What does Bayes theorem state?

That there is a relationship between P(X|Y) and P(Y|X). P(X ⋂ Y)= P(X|Y)P(Y)= P(Y|X)P(X)

What is the margin?

The distance between the boundary and the instances closest to it.

What is prior knowledge?

information about a problem available in addition to the training data.

What approaches can you take when dealing with missing values?

• Discard training example. • Guess the value (Imputation)(if numerical ⇒ average, if categorical ⇒ most common). • Assign a probability to each possible value.

Why does using a simpler model make more sense (e.g a rectangle instead of a wiggly box)?

• The model is simpler • It is simpler to train, the model has less variance but more bias. Finding the optimal model involves minimizing both the bias and the variance. • It is simple to explain. • A rectangle will be less affected by single instances, and will be a better discriminator. A simple model would generalize better than a complex model

Explain Shannon's information function

• When the outcomes are equally likely, then: For N equiprobable outcomes Information= ln2(N) bits = -ln2(p) bits | p= 1/N • Non-equiprobable outcomes Being told the outcome usually gives you less information than being told the outcome of an experiment with two equiprobable outcomes.

How can we make a machine extract an algorithm automatically? Two steps.

1. Collect a set of examples. 2. Build computer algorithms able to learn task-related algorithms from the data.

What are the three factors in "Triple Tradeoff"?

1. The complexity of the hypothesis we fit to data, namely, the capacity of the hypothesis class, 2. The amount of training data, and 3. The generalization error on new examples.

What percentage does overfitting reduce the accuracy?

20%

What is instance based learning?

A simple way of learning, save some examples and their categories and find the most similar stored example to predict where a candidate belongs to.

What is decision tree induction?

A system that takes data, learns a set of rules and attempts to build a decision tree that will correctly predict the class of any unclassified example.

What is supervised learning?

A teacher/ supervisor provides a label/target for each example in the training set X={(x^t, r^t)} N, t=1. Using labelled examples, we learn a model which provides outputs (y) that are close enough to the observed (r) ones in the training set. We expect the model outputs to be close to the true process outputs on unseen data.

Finding a description (characteristic) shared by all positive examples (objects that fit the description we're looking for) and none of the negative examples to make a prediction or knowledge extraction.

Class learning

What is the solution used to build decision trees for numerical values?

Dividing value sets into smaller contiguous subranges and treating each membership as a categorical variable.

In an instance where an error is very costly, we can say any instance between S and G is a case of ________, which we cannot label with certainty due to lack of data.

Doubt

The proportion of training instances where predictions of h do not match the required values given in X.

Empirical error

The intuitive notion of length of the vector x = (x1, x2, ..., xn) This gives the ordinary distance from the origin to the point X, a consequence of the Pythagorean theorem.

Euclidean norm

Initial information is calculated using the class you will use in the terminal node.

Example: If you want to build a system that predicts precipitation from other attributes rain/ dry will be the terminal node.

True or false: a machine learning algorithm always learns what we want it to learn.

False

True or False: Any h ∈ H outside S and G is a valid hypothesis.

False, any h ∈ H between S and G is a valid hypothesis.

True or False: It is impossible to prune a term from one rule without touching other rules.

False.

What is reinforcement learning?

Feedback is only gotten after a complete sequence of actions, there is no explicit scoring of individual actions, the goodness of each action depends on the actions that follow.

A task may change with time or situation, we need a ___________ framework to perform the task

Flexible

How well a model trained on the training set predicts the right output for new instances. For best results, we should match the complexity of the hypothesis class H with the complexity of the function underlying the data.

Generalization

How well out hypothesis will correctly classify future examples.

Generalization

What is clustering?

Given a ser of unlabelled data points the algorithm forms groups so that points within each cluster are similar to each other and points from different clusters are different. This requires meaningful similarity or distance measure.

What kind of problems does data mining address?

Handling large quantities of data and extracting potentially valuable information.

What are the two established approaches when some data is absent for some variables but the network structure is known?

Hill climbing along line of steepest ascent. EM algorithm

What are the differences between Machine Learning and Data Mining?

MACHINE LEARNING: Output- algorithm Information extracted is used by a machine to make autonomous decisions in the future. Learning can be automated. DATA MINING: Output: new information information is used by humans to make processes more effective and profitable

Is data itself enough to extract algorithms?

No, we still need an expert's knowledge to design the ML system and provide prior knowledge

Any unwanted anomaly in the data, this makes the class more difficult to learn and zero error cannot be accomplished with a simple hypothesis.

Noise

Input is divided into local regions and defined by a distance measure like the Euclidean norm. We do not assume any parametric form for the class densities, the structure is not fixed, it depends on the complexity of the problem (more complex, more branches and leaves)

Nonparametric estimations

If there is noise, an overcomplex hypothesis may learn not only the underlying function but also the noise in the data and make a bad fit.

Overfitting

X and Y are said to be independent if...

P(X⋂Y)=P(X)P(Y) No statistical relationship between X and Y Knowing the value of one doesn't help in predicting the other

Define a model using the inputs and learn its parameters from the training data, you can then use a model for tests.

Parametric estimation

What is rule support?

Percentage of training data covered by the rule, they show the main characteristics of the dataset, important features and split positions.

What are data mining tools?

Specialized packages that provide procedures that can be applied to data, such as: visualization tools *visual display of data points that enables the user to exploit pattern recognition capabilities of the visual system) and analytic tools (provide a range of machine learning and statistical techniques and often use graphics to enable the user to construct sequences of operations using icons).

What are association rules?

Statements that certain groups of items or events tend to occur together. Probability of P(X|Y) where Y is the product we want to condition on X.

What is Occam's Razor?

States that simpler explanations are more plausible and any unnecessary complexity should be shaved off.

What is prepruning?

Stopping tree construction early, before it is full.

When building a decision tree, which attribute should you use to begin the tree?

The best attribute, which provides the most information gain.

What is the difference between Bayes classifiers and Naïve Bayes classifiers?

The conditional probabilities required are drastically reduced.

What is the empirical error?

The empirical error is the proportion of training instances where predictions of h do not match the required values given in X.

What is data mining?

The field is new and evolving, there is no clear definition of it. It is loosely the development and application of computer tools to assist in the discovery of useful information in large databases.

What are the assumptions for machine learning?

The process explains the data we observe, a general model is learned from a limited amount of data and prior knowledge, we can build a good and useful approximation even if we can't identify the process completely, a learned model should be able to make a sufficiently accurate prediction in previously unseen cases.

Conditional independence...

There is no statistical relationship between two variables in the case that a third variable has a particular value. P(X|Y⋂Z)= P(X|Z)

What is the loss function used for?

To compute the difference between desired output rt and our approximation. The approximation error or loss is the sum of losses over the individual instances.

What is unsupervised learning?

We don't have labels for the examples in the training set X={x^t} N, t=1. The system forms clusters (density estimation), finds patterns and hidden structure in unlabeled data. There is no explicit error or reward to evaluate a potential solution. There is no supervisor, just input data.

What is postpruning?

We grow the tree fully until all leaves are pure and have no error, then find subtrees that cause overfitting and get rid of them. We replace the subtree with a leaf node.

Why is prior knowledge important in Machine Learning?

Without prior knowledge, learning a ML model from a finite and incomplete data is an ill-posed problem. Using prior infomation, one can improve the performance by favoring solutions tat are known to perform better in similar situations.

Each terminal node (leaf) is associated with _________.

a class

Each non-terminal node is associated with one ______________.

attribute

A decision tree is a hierarchical data structure implementing the _____________ strategy.

divide-and-conquer

If x is outside the range of xt in the training set is called ____________.

extrapolation

Maximum A Posteriori hypothesis...

hmap= arg max P(E|h)P(h), there E could be the ⋂ of two events

What is knowledge extraction?

learning a rule from data, explanation for the underlying process.

In ____________ _______________ we find the N-1st degree polynomial that we can use to predict the output for any x.

polynomial interpolation

What is "Transformation invariance"?

the output can be assumed to be invariant to some transformations in the input pattern.

If H is less complex than the function we have _____________.

underfitting

Each branch is associated with a particular ________________ that the attribute of its patent node can take.

value

The scalar/ dot/ inner product of two vector is ______ if the vectors are orthogonal.

zero

What are some limitations of ID3?

• Dealing with missing values • Inconsistent Data • Incrementality • Handling numerical attributes • Attribute value grouping • Alternative attribute selection criteria • Overfitting problem


Related study sets

internat busa 3.0 (ethics, social responsibility, sustainability)

View Set

Physics Chapter 6 (in class/homework conceptual)

View Set

Ch 7 Small Business Strategies: Imitation with a Twist

View Set

Chapter 6: Reporting and Analyzing Inventory (Math)

View Set

Inventories T/F, Accounting for Retail Businesses T/F

View Set