Machine Learning

Ace your homework & exams now with Quizwiz!

Random Forest

Random forest is a collection of many decision trees from random subsets of the data, resulting in a combination of trees that may be more accurate in prediction than a single decision tree.

Random Forest

Random forests are an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree. What's the point of this? By relying on a "majority wins" model, it reduces the risk of error from an individual tree. For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of random forests. StatQuest does an amazing job walking through this in greater detail. See here.

What is Regression in Machine Learning?

Regression in data science and machine learning is a statistical method that enables predicting outcomes based on a set of input variables. The outcome is often a variable that depends on a combination of the input variables. A linear regression model performed on the Databricks Lakehouse. Source: https://www.databricks.com/blog/2015/06/04/simplify-machine-learning-on-spark-with-databricks.html

SVM

SVM, or Support Vector Machines create coordinates for each object in an n-dimensional space and uses a hyperplane to group objects by common features

Supervised Learning

Supervised learning involves learning a function that maps an input to an output based on example input-output pairs [1]. For example, if I had a dataset with two variables, age (input) and height (output), I could implement a supervised learning model to predict the height of a person based on their age. To re-iterate, within supervised learning, there are two sub-categories: regression and classification.

K-Means

The K-Means algorithm finds similarities between objects and groups them into K different clusters.

What are the different machine learning models?

There are many machine learning models, and almost all of them are based on certain machine learning algorithms. Popular classification and regression algorithms fall under supervised machine learning, and clustering algorithms are generally deployed in unsupervised machine learning scenarios.

Decision Tree, Random Forest, Neural Network

These models follow the same logic as previously explained. The only difference is that that output is discrete rather than continuous.

What is a Decision Tree in Machine Learning (ML)?

A Decision Tree is a predictive approach in ML to determine what class an object belongs to. As the name suggests, a decision tree is a tree-like flow chart where the class of an object is determined step-by-step using certain known conditions.

Neural Network

A Neural Network is essentially a network of mathematical equations. It takes one or more input variables, and by going through a network of equations, results in one or more output variables. You can also say that a neural network takes in a vector of inputs and returns a vector of outputs, but I won't get into matrices in this article. The blue circles represent the input layer, the black circles represent the hidden layers, and the green circles represent the output layer. Each node in the hidden layers represents both a linear function and an activation function that the nodes in the previous layer go through, ultimately leading to an output in the green circles. If you would like to learn more about it, check out my beginner-friendly explanation on neural networks.

Support Vector Machine

A Support Vector Machine is a supervised classification technique that can actually get pretty complicated but is pretty intuitive at the most fundamental level. Let's assume that there are two classes of data. A support vector machine will find a hyperplane or a boundary between the two classes of data that maximizes the margin between the two classes (see below). There are many planes that can separate the two classes, but only one plane can maximize the margin or distance between the classes. If you want to get into greater detail, Savan wrote a great article on Support Vector Machines here.

What is a Classifier in Machine Learning?

A classifier is a machine learning algorithm that assigns an object as a member of a category or group. For example, classifiers are used to detect if an email is spam, or if a transaction is fraudulent.

What is a machine learning Algorithm?

A machine learning algorithm is a mathematical method to find patterns in a set of data. Machine Learning algorithms are often drawn from statistics, calculus, and linear algebra. Some popular examples of machine learning algorithms include linear regression, decision trees, random forest, and XGBoost.

What is a machine learning Model?

A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. For example, in natural language processing, machine learning models can parse and correctly recognize the intent behind previously unheard sentences or combinations of words. In image recognition, a machine learning model can be taught to recognize objects - such as cars or dogs. A machine learning model can perform such tasks by having it 'trained' with a large dataset. During training, the machine learning algorithm is optimized to find certain patterns or outputs from the dataset, depending on the task. The output of this process - often a computer program with specific rules and data structures - is called a machine learning model.

What is Time Series Machine Learning?

A time-series machine learning model is one in which one of the independent variables is a successive length of time minutes, days, years etc.), and has a bearing on the dependent or predicted variable. Time series machine learning models are used to predict time-bound events, for example - the weather in a future week, expected number of customers in a future month, revenue guidance for a future year, and so on.

Boosting algorithms

Boosting algorithms, such as Gradient Boosting Machine, XGBoost, and LightGBM, use ensemble learning. They combine the predictions from multiple algorithms (such as decision trees) while taking into account the error from the previous algorithm.

Clustering

Clustering is an unsupervised technique that involves the grouping, or clustering, of data points. It's frequently used for customer segmentation, fraud detection, and document classification. Common clustering techniques include k-means clustering, hierarchical clustering, mean shift clustering, and density-based clustering. While each technique has a different method in finding clusters, they all aim to achieve the same thing.

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables [2]. In simpler terms, its the process of reducing the dimension of your feature set (in even simpler terms, reducing the number of features). Most dimensionality reduction techniques can be categorized as either feature elimination or feature extraction. A popular method of dimensionality reduction is called principal component analysis.

Decision Tree

Decision trees are a popular model, used in operations research, strategic planning, and machine learning. Each square above is called a node, and the more nodes you have, the more accurate your decision tree will be (generally). The last nodes of the decision tree, where a decision is made, are called the leaves of the tree. Decision trees are intuitive and easy to build but fall short when it comes to accuracy.

Decision Trees

Decision trees are also classifiers that are used to determine what category an input falls into by traversing the leaf's and nodes of a tree

What are Deep Learning Models?

Deep learning models are a class of ML models that imitate the way humans process information. The model consists of several layers of processing (hence the term 'deep') to extract high-level features from the data provided. Each processing layer passes on a more abstract representation of the data to the next layer, with the final layer providing a more human-like insight. Unlike traditional ML models which require data to be labeled, deep learning models can ingest large amounts of unstructured data. They are used to perform more human-like functions such as facial recognition and natural language processing. A simplified representation of deep learning. Source: https://www.databricks.com/discover/pages/the-democratization-of-artificial-intelligence-and-deep-learning

Hierarchical Clustering

Hierarchical clustering builds a tree of nested clusters without having to specify the number of clusters.

Classification

In classification models, the output is discrete. Below are some of the most common types of classification models.

What are the different types of Machine Learning?

In general, most machine learning techniques can be classified into supervised learning, unsupervised learning, and reinforcement learning.

Regression

In regression models, the output is continuous. Below are some of the most common types of regression models.

What is Reinforcement Learning?

In reinforcement learning, the algorithm is made to train itself using many trial and error experiments. Reinforcement learning happens when the algorithm interacts continually with the environment, rather than relying on training data. One of the most popular examples of reinforcement learning is autonomous driving.

What is Supervised Machine Learning?

In supervised machine learning, the algorithm is provided an input dataset, and is rewarded or optimized to meet a set of specific outputs. For example, supervised machine learning is widely deployed in image recognition, utilizing a technique called classification. Supervised machine learning is also used in predicting demographics such as population growth or health metrics, utilizing a technique called regression.

Principal Component Analysis (PCA)

In the simplest sense, PCA involves project higher dimensional data (eg. 3 dimensions) to a smaller space (eg. 2 dimensions). This results in a lower dimension of data, (2 dimensions instead of 3 dimensions) while keeping all original variables in the model. There is quite a bit of math involved with this. If you want to learn more about it... Check out this awesome article on PCA here. If you'd rather watch a video, StatQuest explains PCA in 5 minutes here.

What is Unsupervised Machine Learning?

In unsupervised machine learning, the algorithm is provided an input dataset, but not rewarded or optimized to specific outputs, and instead trained to group objects by common characteristics. For example, recommendation engines on online stores rely on unsupervised machine learning, specifically a technique called clustering.

Linear Regression

Linear regression is used to identify relationships between the variable of interest and the inputs, and predict its values based on the values of the input variables.

Logistic Regression:

Logistic Regression is used to determine if an input belongs to a certain group or not

Logistic Regression

Logistic regression is similar to linear regression but is used to model the probability of a finite number of outcomes, typically two. There are a number of reasons why logistic regression is used over linear regression when modeling probabilities of outcomes (see here). In essence, a logistic equation is created in such a way that the output values can only be between 0 and 1 (see below).

How many models are there in machine learning?

Many! Machine learning is an evolving field and there are always more machine learning models being developed.

What is model deployment in Machine Learning (ML)?

Model deployment is the process of making a machine learning model available for use on a target environment—for testing or production. The model is usually integrated with other applications in the environment (such as databases and UI) through APIs. Deployment is the stage after which an organization can actually make a return on the heavy investment made in model development. A full machine learning model lifecycle on the Databricks Lakehouse. Source: https://www.databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html

Naive Bayes

Naive Bayes is an algorithm that assumes independence among variables and uses probability to classify objects based on features

Naive Bayes

Naive Bayes is another popular classifier used in Data Science. The idea behind it is driven by Bayes Theorem: In plain English, this equation is used to answer the following question. "What is the probability of y (my output variable) given X? And because of the naive assumption that variables are independent given the class, you can say that: As well, by removing the denominator, we can then say that P(y|X) is proportional to the right-hand side. Therefore, the goal is to find the class y with the maximum proportional probability. Check out my article "A Mathematical Explanation of Naive Bayes" if you want a more in-depth explanation!

Linear Regression

The idea of linear regression is simply finding a line that best fits the data. Extensions of linear regression include multiple linear regression (eg. finding a plane of best fit) and polynomial regression (eg. finding a curve of best fit). You can learn more about linear regression in my previous article.

kNN

The k Nearest Neighbors technique involves grouping the closest objects in a dataset and finding the most frequent or average characteristics among the objects.

What is the best model for machine learning?

The machine learning model most suited for a specific situation depends on the desired outcome. For example, to predict the number of vehicle purchases in a city from historical data, a supervised learning technique such as linear regression might be most useful. On the other hand, to identify if a potential customer in that city would purchase a vehicle, given their income and commuting history, a decision tree might work best.

What is Model Training in machine learning?

The process of running a machine learning algorithm on a dataset (called training data) and optimizing the algorithm to find certain patterns or outputs is called model training. The resulting function with rules and data structures is called the trained machine learning model.

Unsupervised Learning

Unlike supervised learning, unsupervised learning is used to draw inferences and find patterns from input data without references to labeled outcomes. Two main methods used in unsupervised learning include clustering and dimensionality reduction.


Related study sets

Child and Adolescents Psych Final

View Set

Respiratory and Reproductive Patho Q&A's from study guide and thepoint nclex q&a

View Set

systematic innovation - I-TRIZ history, principle of ideality, abstract desc

View Set

Chapter 29: The Child with Musculoskeletal or Articular Dysfunction

View Set

Micro chapter 3 from readings and quizzes

View Set

Chapter 12: Eukaryotic Microorganisms

View Set

Conceptual Physics more Module 1 answers .

View Set

Velázquez La búsqueda de la luz-Cierta o falsa

View Set