Machine Learning Final

Ace your homework & exams now with Quizwiz!

1. What are Bayesian networks (BNs)? List BN components and importance

A Bayesian network(also known as a Bayes network, belief network, graphical network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph(DAG)

Define the dendrograms, then illustrate how do dendrograms work with a diagram.

A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters. Clustering tree works in the following way: · A binary tree that shows how clusters are merged/split hierarchically. Each node on the tree is a cluster; each leaf node is a singleton cluster

What is the machine learning (ML)? What are the main ML types? What ML algorithms you studied after the midterm exam? Which is more important to you- model accuracy, or model performance, support your answer with an example?

A set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty (such as planning how to collect more data!). The main types are: · Supervised Learning · Unsupervised Learning · Semi-Supervised Learning · Reinforcement Learning Machine learning algorithms we have studied includes: · Decision Trees (DTs) · Recommender Systems · Large Scale and Online Learning · Ensemble Learning · Autoencoders · k-Nearest Neighbors (kNNs) · Principle Components Analysis (PCA) · K-Means Clustering · Hierarchical Clustering · Bayesian Networks · Reinforcement Learning · Hidden Markov Model (HMM) In my opinion, model accuracy and model performance both are very important. For example in an scene where a vehicle in front of autonomous vehicle slams break we would want model to quickly decide how much break to apply so the we don't hit the front car and at the same time not get rear ended by the car behind us. In this case model should be able to accurately and quickly make decision making both accuracy and performance important[NP1] . [NP1]Can it be both? Is this example correct?

What are the advantages and disadvantages of hierarchical clustering?

Advantages · Hierarchical clustering outputs a hierarchy, i.e. a structure that is more informative than the unstructured set of flat clusters returned by k-means. Therefore, it is easier to decide on the number of clusters by looking at the dendrograms · Easy to implement Disadvantages · It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around. · Time complexity: not suitable for large datasets · Initial seeds have a strong impact on the final results · The order of the data has an impact on the final results · Very sensitive to outlier

List advantages and disadvantages of k-Means

Advantages · Super easy to implement · Works with big data, K-Means may be computationally faster than other methods · k- Means may produce tighter clusters than hierarchical clustering · An instance can change cluster (move to another cluster) when the centroids are recomputed. Disadvantages · Difficult to predict the number of clusters (K-Value) · Initial seeds have a strong impact on the final results · The order of the data has an impact on the final results · Sensitive to scale: rescaling your datasets (normalization or standardization) will completely change results. While this itself is not bad, not realizing that you have to spend extra time on to scaling your data might be bad

List advantages and disadvantages of decision trees

Advantages and disadvantages of decision trees are as follows: · Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! · Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters. · Trees can be displayed graphically and are easily interpreted even by a non-expert (especially if they are small). · Trees can easily handle qualitative predictors without the need to create dummy variables. · Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approach. however, by aggregating many decision trees can improve the predictive performance of trees substantially.

List Advantages and Disadvantages of k-NNs.

Advantages of KNN are: · Simple technique that is easily implemented · Building model is inexpensive · Extremely flexible classification scheme o does not involve preprocessing · Well suited for o Multi-modal classes (classes of multiple forms) o Records with multiple class labels · Can sometimes be the best method o K nearest neighbor outperformed SVM for protein function prediction using expression profiles Disadvantages of KNN are: · Classifying unknown records are relatively expensive o Requires distance computation of k-nearest neighbors o Computationally intensive, especially when the size of the training set grows · Accuracy can be severely degraded by the presence of noisy or irrelevant features · NN classification expects class conditional probability to be locally const

List advantages and disadvantages of both collaborative filtering and content-based recommenders.

Advantages of content-based recommender systems are: · Works even when product has no user review Disadvantages of content-based recommender systems are: · Needs descriptive data of every product you want to recommend · Difficult to implement for many kinds of large products databases Advantages of collaborative filtering recommenders are: · Does not require any knowledge of product themselves Disadvantages of collaborative filtering recommenders are: · Cannot recommend product if you do not have user review · Difficult to make new recommendations for brand new users · Tends to favor popular product with lot of reviews

What are autoencoders? List the general types of autoencoders based on size of hidden layer?

An auto encoder neural network is an unsupervised machine learning algorithm that applies backpropagation, setting the target (output) values to be equal to the inputs. Autoencoders are used to reduce the size of our inputs into a smaller representation. If anyone needs the original data, they can reconstruct it from the compressed data. The latent space is simply a representation of compressed data in which similar data points are closer together in space. Latent space is useful for learning data features and for finding simpler representations of data for analysis. General types of autoencoders based on hidden layer size are: · Undercomplete · Overcomplete

What is the cluster analysis? What are the typical applications of the cluster analysis? List, then define the two approaches of hierarchical clustering?

Cluster analysis (or clustering, data segmentation, ...) is finding similarities between data according to the characteristics found in the data and grouping similar data objects into cluster Typical applications: · As a stand-alone tool to get insight into data distribution · As a preprocessing step for another algorithm Agglomerative: a bottom-up strategy · Initially each data object is in its own (atomic) cluster · Then merge these atomic clusters into larger and larger clusters Divisive: a top-down strategy · Initially all objects are in one single cluster · Then the cluster is subdivided into smaller and smaller cluster

List, then define the common clustering algorithms

Common clustering algorithms include: · K-Means clustering partitions data into k distinct clusters based on distance to the centroid of a cluster · Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree · Gaussian mixture models: models clusters as a mixture of multivariate normal density components · Self-organizing maps: use neural networks that learn the topology and distribution of the data · Hidden Markov models: use observed data to recover the sequence of state

Define decision trees? List the main types of decision trees and ensemble methods that construct more than one decision tree in a single application.

DT is a function that takes a vector of values as input and returns a decision- a single output value. The main types of trees are: · ID3 (Iterative Dichotomiser3) · C4.5 (successor of ID3) · CART (Classification And Regression Tree) · CHAID (CHi-squared Automatic Interaction Detector) · MARS: extends decision trees to handle numerical data better · Conditional Inference Tree The ensemble trees include: · Bagging decision tree · Random forest · Boosted tree · Rotation forest · Decision list - a special case.

What is the decision tree algorithm?

Decision tree's algorithm is as follows: · Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations. · Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of alpha. · Use K-fold cross-validation to choose α. For each k= 1, . . . , K: o Repeat Steps 1 and 2 on the (k-1)/kth fraction of the training data, excluding the kth fold. o Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. Average the results, and pick alpha minimize the average error. · Return the subtree from Step 2 that corresponds to the chosen value of alpha.

How we could derive new datasets through the PCA Process - step 5?

Deriving the new data · Final Data = Row Feature Vector x Row Zero Mean Data · Row Feature Vector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top · Row Zero Mean Data is the mean-adjusted data transposed, i.e. the data items are in each column, with each row holding a separate dimension Final Data is the final dataset, with data items in columns, and dimensions along row

What are the different methods for changing training data? list them, then illustrate the working mechanism of each method, support your working mechanisms with illustration diagrams.

Different methods for changing training data are: · Bagging (Bootstrap Aggregation): Resample training data o Create ensembles by "bootstrap aggregation", i.e. repeatedly randomly resampling the training data (Brieman, 1996). o Bootstrap: draw N a subset samples from X samples with replacement o Bagging § Train M learners on M bootstrap samples § Combine outputs by voting (e.g., majority vote) o Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees and neural networks) whose output can change dramatically when the training data is slightly changed · Boosting: Reweight training data o Originally developed by computational learning theorists to guarantee performance improvements on fitting training data for a weak learner that only needs to generate a hypothesis with a training accuracy greater than 0.5 (Schapire, 1990). o Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996). o Key Insights § Instead of sampling (as in bagging), reweight examples! § Examples are given weights. At each iteration, a new hypothesis is learned (weak learner), and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong. § The final classification based on the weighted vote of weak classification

Define the ensemble learning, illustrate the key motivation of the ensemble learning, then draw the general idea diagram of the ensemble learning.

Ensemble learning is a machine learning paradigm where multiple learners (algorithms) are trained to solve the same problem at the same time. In contrast to ordinary machine learning approaches, which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them for us. The key motivation for ensemble learning is reduce the error rate and hope that it will become much more unlikely that ensemble learning will miss classify an example.

List the ensemble methods that minimize variance and bias.

Ensemble methods that minimize variance: · Bagging · Random forest Ensemble methods that minimize bias: · Functional Gradient Descent · Boosting · Ensemble Selection

Reinforcement

In the word document

List the key elements AND components of autoencoders? Then illustrate the components

Key elements of autoencoders are: · It is an unsupervised ML algorithm similar to PCA · It minimizes the same objective function as PCA · It is a neural network · The neural network's target output is its input Key components of autoencoders are: · Encoder: this part of the neural network compresses the input into a latent space representation · Code: this part represents the compressed input that is fed into the decoder · Decoder: this part aims to reconstruct input from latent space representation

List the k-Nearest Neighbors (k-NNs) Main Steps.

Main steps of KNN are: · For a given instance T, get the top k dataset instances that are "nearest" to T o Select a reasonable distance measure · Inspect the category of these k instances, choose the category C that represent the most instances · Conclude that T belongs to category C

Define the principle components analysis (PCA), then list the 3 main fields could be used to and 3 application examples.

Principal components analysis (PCA) is a technique that can be used to simplify a dataset. It is a linear transformation that chooses a new coordinate system for the dataset such that: · Greatest variance by any projection of the dataset comes to lie on the first axis (then called the first principal component). · The second greatest variance on the second axis, and so PCA can be used for reducing dimensionality by eliminating the later principal components. Dimensionality Reduction or Dimension Reduction Can be used to: · Reduce number of dimensions in data · Find patterns in high-dimensional data · Visualize data of high dimensionality Applications: · Face recognition · Image compression · Gene expression analysis

1. What are the pros and cons decision trees?

Pros and cons of decision trees are as follows: · Tree-based methods are simple and useful for interpretation. · However, they typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. · The advanced methods such as bagging, random forests, and boosting etc. grow multiple trees which are then combined to yield a single consensus prediction. · Combining many trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss interpretation

Define the recommendation systems, why using Recommender Systems?

RSs are software agents that elicit the interests and preferences of individual consumers and make recommendations accordingly. They have the potential to support and improve the quality of the decisions' consumers make while searching for and selecting products online. We should use recommender systems as they provide following values for customer and provider. · Value for the customer o Find things that are interesting o Narrow down the set of choices o Help to explore the space of options o Discover new things o Entertainment · Value for the provider o Additional and probably unique personalized service for the customer o Increase trust and customer loyalty o Increase sales, click through rates, conversion etc. o Opportunities for promotion, persuasion o Obtain more knowledge about customer

What are the three require things to implement the k-NNs? How to classify an unknown instance (sample) using the k-NNs? What are the two common distance metrics used for k-NNs?

Requires three things: · Feature Space (Training Data) · Distance metric (to compute distance between instances) · The value of k(the number of nearest neighbors to retrieve from which to get majority class) To classify an unknown instance: · Compute distance to other training instances · Identify k nearest neighbors (k-NNs) · Use class labels of nearest neighbors to determine the class label of the unknown instance Common Distance Metrics: · Euclidean Distance(Continuous distribution): the square root of the sum of the squared differences between a new point (x) and an existing point (y) · Manhattan Distance: the distance between real vectors using the sum of their absolute difference

What is the Idea, algorithm, and types of the Instance-Based Learning?

The idea of Instance-Based Learning is · Similar examples have similar labels. · Classify new examples like similar training examples. The Algorithm of Instance-Based Learning is : · Given some new example x for which we need to predict its class y-Find most similar training examples-Classify x "like" these most similar example Types of instance-based learning is: · Rote-learner - Memorizes entire training data and performs classification only if attributes of the record match one of the training examples exactly · Nearest Neighbor - Uses k "closest" points (nearest neighbors) for performing classification

What are the main differences between PCA and autoencoders?

The main difference of PCA and autoencoders is: · An autoencoder can learn non-linear transformations with a non-linear activation function and multiple layers. · It does not have to learn dense layers. It can use convolutional layers to learn which is better for video, image, and series data. · It is more efficient to learn several layers with an autoencoder rather than learn one huge transformation with PCA. · An autoencoder provides a representation of each layer as the output. · It can make use of pre-trained layers from another model to apply transfer learning to enhance the encoder/decoder

How k-means algorithm works?

The way k-means algorithm works is as follows: · Specify number of clusters K. · Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement. · Keep iterating until there is no change to the centroids. i.e. assignment of data points to clusters is not changing. o Compute the sum of the squared distance between data points and all centroids. o Assign each data point to the closest cluster (centroid). o Compute the centroids for the clusters by taking the average of all-data points that belong to each cluster

What types of recommendation systems, list them, then draw diagrams show the working mechanism of each?

There are two main types of recommender systems: · Content-based recommenders (Characteristic information) · Collaborative filtering recommenders (User-item interaction)

List, then explain the 3 main properties AND 4 hyperparameters of autoencoders.

Three main properties of autoencoders is : · Data-specific: Autoencoders are only able to compress data like what they have been trained on. · Lossy: The decompressed outputs will be degraded compared to the original inputs. · Learned automatically from examples: It is easy to train specialized instances of the algorithm that will perform well on a specific type of input Four main hyperparameters of autoencoders are: · Code size: It represents the number of nodes in the middle layer. Smaller size results in more compression. · Number of layers: The autoencoder can consist of as many layers as we want. · Number of nodes per layer: The number of nodes per layer decreases with each subsequent layer of the encoder and increases back in the decoder. The decoder is symmetric to the encoder in terms of the layer structure. · Loss function: We either use mean squared error or binary cross-entropy. If the input values are in the range [0, 1] then we typically use cross-entropy, otherwise, we use the mean squared error

What are the two main steps of K-means Algorithm? Write the pseudocode of K-means Algorithm

Two Main Steps to K-means Algorithm · Assign(Clusters/Classes/Groups) · Optimize(Cost Function/Distortion/Minimize Errors READ IMAGE IN WORD

List the 8 types AND 5 applications of autoencoders.

Types of autoencoders are: · Stacking · Convolution · Deep · Denoising · Sparse · contractive · Variational · Generative adversarial networks Applications of autoencoders are: · Image Coloring · Feature variation · Dimensionality Reduction · Denoising Image · Watermark Removal

1. List types of probabilistic relationships, then provide 7 real-world Bayesian network applications

Types of probabilistic relationships are: · Direct cause · Indirect cause · Common cause · Common effect 7 Real world network applications are: · Gene Regulatory Network · Medicine · Biomonitoring · Document Classification · Information Retrieval · Semantic Search · Image Processing · Spam Filter · Turbo Code · System Biology · Medical Diagnosis · Ventilator-associated Pneumonia (VAP) · ROC (Receiver Operating Characteristic

1. What do we mean by the variance and covariance? List the differences between the variance and covariance.

Variance: Measure of the deviation from the mean for points in one dimension Covariance: Measure of how much each of the dimensions vary from the mean with respect to each other

Can a set of weak learners create a single strong learner?

Yes, you can create a strong learner from a weak learner. By accepting hypothesis of weak learner and then feed them together on the same model using ensemble learner generate one single hypothesis.

What are the main features of the random Forest method?

random Forest is an ensemble of Decision Trees. They are known to run efficiently on large datasets. Easy to implement and can obtain higher accuracy Can take care of large number of features

List all steps of the hierarchical clustering of agglomerative (bottom-up) approach.

· Make each data point a single-point cluster -> That forms N clusters · Take the two closest data points and make them one cluster ->That forms N - 1 cluster · Take the two closest clusters and make them one cluster ->That forms N - 2 clusters · Repeat Step 3 until there is only one cluster · Finish

List, then define all possible methods of merging the clusters that depend on the distance measures?

· Single link: Smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min{d(xip, xjq)} · Complete link: Largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max{d(xip, xjq)} · Average: Avg distance between elements in one cluster and elements in the other, i.e., d(Ci, Cj) = avg{d(xip, xjq)} · Centroid distance READ WORD DOC

Illustrate the main tasks of the PCA Process - step 1

· Subtract the mean from each of the data dimensions. · All the x values have x subtracted and y values have y subtracted from them. · This produces a dataset whose mean is zero. · Subtracting the mean makes variance and covariance calculation easier by simplifying their equations. The variance and covariance values are not affected by the mean value


Related study sets

Chapter 01 The Core Principles of Economics

View Set

obstetric sonography PRACTICE QUIZ

View Set

Chapter Conducting Questions: Previous Question

View Set

Chapter 5 Selecting a Topic and Purpose

View Set

Physics Final Conceptual Questions

View Set

U13LO6: Compute the tax-equivalent yield of municipal bonds

View Set