ML exam
Suppose a single perceptron with three inputs x1, x2, and x3. Can this perceptron be trained (or assigned weights and a threshold) to learn the functions y1 and/or y2 below? y1 = At least two out of three of x1, x2, and x3. y2 = Exactly two out of three of x1, x2, and x3.
A single perceptron can learn only y1
Ensemble Learning
•Combine simple (ineffective) rules into one effective complex rule
Bagging & Boosting
•Combining simple learners into complex learners are very effective
We say that a learning problem is _____ if the hypothesis space contains the true function
realizable
hypothesis space
set of all possible hypotheses
Training Set
subsection of a dataset from which the machine learning algorithm uncovers or "learns" relationships between the features and the target variable
supervised learning
the agent observes some example input-output pairs and learns a function that maps from input to output
feature selection
the process of selecting attributes which are most predictive of the class we are predicting
Bagging
•Generate K data sets by sampling with replacement from the original training set, each of size N •Run our learning algorithm on each individual data set •We now have K learners/hypotheses •Run new inputs through all of the learners •Classification: Take majority vote across all K learners •Regression: Take mean across all K learners
Random Forest
•Specifically for decision trees •With bagging, we get a lot of similar trees because information gain will be similar •Random forest allows for a more diverse collection of trees •For K trees: •At each split, pick a random subset of attributes. Choose the attribute from that subset with the highest information gain. •Take majority vote across trees •Resistant to overfitting, no need for pruning
VC dimension
•The largest set of inputs that the hypothesis class can shatter (label in all possible combinations)
Which of these best describes what Haussler's theorem provides?
A lower bound on the number of samples required for a learner to PAC-learn the hypothesis space.
Supervised learning is best described as
learning an approximation of a function from known inputs to known outputs, so that it can be applied to new inputs
Eager Learners
Decision Tree and Neural Networks
An attribute should never appear in a decision tree more than once.
False
When we have no domain knowledge about a classification problem, which method is considered the best first algorithm to employ?
Random forests
Regression
When y is a number (such as tomorrow's temperature), the learning problem is called regression
Boosting
Wrong answers get more weight and vise versa
model selection
model selection defines the hypothesis space and then optimization finds the best hypothesis within that space.
Consider a learning problem where each instance has 3 inputs and a single output. All inputs and outputs are continuous. How many columns would be in the X matrix (representing the inputs) when trying to find the best fit quadratic (degree 2) function?
7
entropy
A measure of disorder or randomness.
Which of the following best describes the process of building a single tree during the random forest approach?
At each new node, select a subset of features to consider.
In class, we have discussed the preference and restriction bias of many algorithms we have discussed. For example, we discussed that our algorithm prefers shorter decision trees, and we've discussed that a neural network with one hidden layer is restricted to representing continuous functions (and therefore cannot represent discontinuous functions). We did not discuss the preference and restriction bias of boosting. What are the biases of boosting?
Boosting's biases are whatever the biases of the underlying learners are. (If we use decision trees, we inherit the biases of decision trees. If we use SVMs, we inherit the biases of SVMs, and so on.)
decision tree pruning
Combats overfitting by eliminating nodes that are not clearly relevant.
validation set
subsection of a dataset to which we apply the machine learning algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the dataset's other features
Suppose that I run a business, where the sales amounts change drastically day-to-day. Which of the following is the best reason that I might use a decision tree (rather than other supervised learning techniques we have covered) to predict how much my customers will spend? (Select all answers that apply.)
Decision trees will ignore potentially irrelevant features.
Sample complexity is the only valuable measure for comparing the complexity of training different learners.
False
When the size of the hypothesis space is infinite, an infinite number of samples are required to PAC-learn the hypothesis space.
False
When you find multiple neighbors that are the same
Give both
Lazy learner
KNN
When to KNN
Lots of training with few features. however when more features, amount of data increases drastically (curse of dimensionality)
What is the purpose of the kernel trick?
Make data linearly separable by mapping the data to a higher dimension.
Which of the following best describes the calculation of error for linear regression?
The sum of the vertical distances between each data point and the line of best fit
Issues maybe present in decision trees
Missing data, Multivalued attributes, Continuous and integer-valued input attributes (gives infinity branches, instead find spilt point that gives the highest information gain) and Continuous-valued output attributes (More applicable in regression)
Suppose that I run a business, where the sales amounts change drastically day-to-day. Which of the following is the best reason that I might use a neural network (rather than other supervised learning techniques we have covered) to predict how much my customers will spend? (Select all answers that apply.)
Neural networks are a better match for regression problems.
reinforcement learning
Perform an action then learn based on a reward or punishment
KNN with classification do
Plurality vote
KNN with regression
Return mean y1
stationarity assumption
States that transition probabilities do not change with time
overfitting
The process of fitting a model too closely to the training data for the model to be effective on other data.
Which of the following best describes bagging?
Train n learners, with each one trained on a subset of the data (which may overlap).
Which of these is NOT a bias of decision trees? Trees with high information gain attributes at the root are preferred Correct trees are preferred Shorter trees are preferred Trees that use all attributes are preferred
Trees that use all attributes are preferred
A hypothesis space H is PAC-learnable if and only if the VC dimension is finite
True
information gain
a measure of the predictive power of one or more attributes
regularization
his process of explicitly penalizing complex hypotheses is called regularization (because it looks for a function that is more regular, or less complex).
neural networks
interconnected neural cells. With experience, networks can learn, as feedback strengthens or inhibits connections that produce certain results. Computer simulations of neural networks show analogous learning.
Ockham's razor
prefer the simplest hypothesis consistent with the data