QMB3302
Y and y-hat are a little different. Y is our target vector, and y-hat is an output in our model that is a.....
estimate or prediction of y
The elbow method provides an exact number of clusters for a kmeans algorithm.
false
Regressions Making Predictions Video
#here is the entire code to create that model again. import numpy as np import pandas as pd from matplotlib import pyplot as plt %matplotlib inline from sklearn.linear_model import LinearRegression a = [1, 2, 3, 4, 5, 6] d = [1, 4, 6, 12, 9, 13] #creating our dataframe df = pd.DataFrame() df['x1'] = a df['y'] = d df
Chp.10 In Depth: K Means Clustering
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() # for plot styling import numpy as np
When looking at the code in the videos, we sometimes used a variable to hold our model. What is the significance of the word "model" in the below code? model = LinearRegression(fit_intercept=True)
'model' is a named variable and is just holding our linear regression model. It could be renamed anything. The word itself is not important. It is just a container.
Regressions Ames Example
***Come Back To***
Random Forest Disadvantages
- Not easily interpretable - Difficult to draw conclusions about the meaning of the classification model
According to the documentation, a silhouette scores of 1 is the best score and -1 is the worst score (choose the best answer)
1 = the best score -1 = the worst score
What is scikit-learn?
A machine learning package in Python that has built in machine learning algorithms we can use on our dataset
WRONG ANSWER What type of algorithm would you use to segment customers into groups? Assume the groups are already labeled. All of the Above Decision Trees Cluster Regresssion Regression Random Forest
All of the Above!
The LinearRegression estimator is only capable of simple straight line fits.
False Sure. The text and the videos have several examples that are not straight line fits.
The basic idea of a regression is very simple. We have some X values (we called these features and some Y value (this is the variable we are trying to predict . We could have multiple Y values, but that is not something we have covered.
Features (X) and Predict (Y)
Week 8: Supervised Learning (In Depth: Linear Regression//Textbook):
Linear Regressions are: Easily interpretable and can fit very quickly Code for a standard import: %matplotlib inline import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np
The example we walked through was from a fairly famous dataset for learning about machine learning. The dataset is called:
MNIST
Multinomial Naive Bayes
Modeling the data distribution with a best-fit multinomial distribution
An example this week was done in a Jupiter like environment called Google Collab. What was the language that was demonstrated in the videos? (One cool thing about this is that is looks just like any other package! Installing this on your own is tricky)
TensorFlow
Guassian Naive Bayes
The assumption that the data from each label is drawn from a simple Gaussian Distribution
In five fold cross validation
The data set less the holdout is split into five folds
Which of the following machine learning models utilizes supervised learning? regression precipitous algorithm k-means hierarchical clustering
The following machine learning models utilize supervised learning: -Linear regression
What is ONE reason the textbook lists for why a Linear regression is a good starting point in a modeling task.
They are interpretable
When to use Naive Bayes
When the naive assumptions actually match the data (very rare in practice) For very well-separated categories, when model complexity is less important For very high-dimensional data, when model complexity is less important
K-Means Algorithm: Expectation Maximization
comes up in a variety of contexts within data science
Data Partitions an or cross validation helps _________
generalize the ML model
Data Partitions allows for model _________.
Assessment(s); Which model should I use?
(Data Partitions) Why not choose the model with the best fit to the data?
Because it its harder to predict any future data points (Piecewise, highest fit)
Regression Pipelines
Takes all the data coming from the front, and fix and clean, fix missing values, in a streamlined way to be pushed and ready to model on the other side *Refer to Supervised Learning Notes*
random forest
an ensemble of randomized decision trees
Chp. 7 What is Scikit Learn Pt.1
Scikt Learn is a library package similar to NumPy to - Visualizing Y: (N, N): (Y,X) Steps in Using S.L: Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow). 1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn. 2. Choose model hyperparameters by instantiating this class with desired values. 3. Arrange data into a features matrix and target vector following the discussion above. 4. Fit the model to your data by calling the fit() method of the model instance. 5. Apply the model to new data: ->For supervised learning, often we predict labels for unknown data using the predict() method. -> unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.
Decision trees have a few problems, you should probably review those for the final exam! The problem we talked about the most is:
Overfitting
What is the purpose of the following code: from sklearn.preprocessing import StandardScaler scale = StandardScaler() rfm_std = scale.fit_transform(df)
to standardize the data
The features in a model....
----
NLP stands for....
Natural Language Processing
Recommenders come in many flavors. 2 of the most common, often used together and discussed in the lecture are:
User Based Item based
In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What are they? (note that is probably a harder question than I would give you on the final... but understanding these steps and reviewing this concept is helpful... hint hint.)
First Step - Choosing a class of model Second Step - Choose Hyperparameters Third Step - Arrange data Fourth Step - Fit the model Fifth Step - Predict
In Depth: Naive Bayes Classification
Naive Bayes Models: Group of extremely fast and simple classification algorithms that are often suitable for very high dimensional datasets - fast and few tunable parameters -> very useful as a baseline for a classification problem
Creating a decision tree
from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');
Imagine you have a dataset with 2 columns, both filled with continuous numbers. You believe the first column is a predictor of the second column. Which of the model approaches below could work when building a model?
-Regression This is probably the obvious choice that comes to mind first! -Random Forest Sure! Random Forests can be a good thing to try. We did one in the videos to predict home prices. -Decision Trees Maybe not the BEST solution, Decision Trees have some problems like overfitting that we discussed. But you could try it, so it is a viable answer
Tokenization, as defined in the lecture is....
A computer turning letters and/or words into something it can read and understand, like numbers
Deep Neural Networks have only 1 hidden layer and multiple input layers.
False
When viewing a diagram of a neural network there are several layers. Match the layer to the description below.
Input Layer: These are the X's, or inputs from your data Output Layer: These are the Y (the Target variable you are interested in) Hidden Layer: Something you don't see, here there is some computation to transform the X's into the Y.
Bagging
Makes use of an ensemble of parallel estimators, each of which over-fits the data and averages the results to find a better classification Code: BaggingClassifier
Regressions Basic intro (video)
Model = LinearRegression(fit_intercept= True) Holds the linear regression filled with code that we need to run the linear regression. Act as a placeholder Array shape is important! Model Prediction for Linear Regression *Coefficient [2.] = means every time x increments by 1, y goes up by 2 *Intercept 0.0
Chp. 7 What is a Model(?)
N features of X#'s to build a prediction for YHat - Build by using: Yhat = f(X)=f(x1,x2...xn) Capital X is the fucntion of all x's to predict Y hat = main goal is the matrix of all x's and y's
Which of the below were discussed as being problems with the hold out method for validation?
The model is not trained on all of the data Outliers can skew the result
What is a good model fit value?
Unknowable without knowing/understanding the context and the domain.
Chp. 7 Data Validation
Validation: Is my model good, or right, should I use this model? - depends on the context on the problem that you are trying to solve - how do generalize the data to also coordinate for new data
Polynomials model disadvantage _____
its very easy to overfit the data and will not predict the future (y) data very well.
Introducing K-Means
searches for pre-determined number of clusters within an unlabeled multidimensional dataset The "cluster center" is the arithmetic mean of all the points belonging to the cluster. Each point is closer to its own cluster center than to other cluster centers.
The correct number of clusters in Hierarchical clustering can be determined precisely using approaches such as silhouette scores
False
We want the R-squared value for our regression model to be 100% True False
False R-squared (R²) is a statistical measure that represents the proportion of variance in the dependent variable (i.e., the target variable) that can be explained by the independent variables (i.e., the features) in a regression model. The R-squared value ranges from 0 to 1, where 0 indicates that the model does not explain any of the variance in the dependent variable, and 1 indicates that the model explains all of the variance in the dependent variable. While a high R-squared value is desirable, it is not necessary or even possible to achieve a value of 100% in most cases. In fact, a model with an R-squared value of 100% may indicate overfitting, which means that the model is too complex and is fitting the training data too closely, resulting in poor performance on new, unseen data. Instead, we want to balance the complexity of the model with its ability to accurately predict new, unseen data. A good regression model will have a high R-squared value (e.g., above 0.7 or 0.8), but also a low mean squared error (MSE) or root mean squared error (RMSE) on the test data.
Hierarchical clustering is more powerful than Kmeans, as it allows the researcher to determine the exact number of clusters to use in the analysis.
false
## getting the data into the right shape
Xtrain =np.array(df['x1'])[:, None] y = np.array(df['y']) print("X shape is:", Xtrain.shape) print("Y shape is:", y.shape) # creating the model model = LinearRegression(fit_intercept=True)
Validation theory
"In principle, model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the prediction to the known value." -Vanderplas 2016 import pandas as pd import numpy as np from matplotlib import pyplot as plt import seaborn as sns #make the plots show up inline %matplotlib inline from sklearn.datasets import load_iris iris = load_iris() #iris = sns.load_dataset('iris') iris X = iris.data y = iris.target
Select all that apply. Imagine you have a dataset with the following columns (inputs) for a set of customers. Column 1 = Customer ID Column 2 = Distance to Store Column 3= Yearly spend Column 4 = Likelihood to return (a survey response that indicates a customer is likely to shop again) What kind of approaches could you use to understand more about these customers?
- Clustering- to develop groups of customers that have similar patterns - Regression- to understand the effect of one or more variables on the others.
Select all that apply. Imagine you have a data set with columns/inputs for customers: Column 1 = Customer ID (a number) Column 2 = Sales (a dollar value) Column 3= Frequency (a number) Column 4 = Satisfaction (a number) You would like to understand the impact of Frequency on customer Satisfaction. What types of approaches could you use? Note that the type of data is brackets () after the column name. Choose the best answer(s) from the available choices below.
- Linear Regression - Random Forest - Decision tree
Bayesian Classification
- Naïve Bayes Classification are built on these - - If we make naive assumptions about the generative model for each label, we can find rough approximation of the generative model for each class, and proceed with the Bayesian classification matplotlib inline import numpy as np import matplotlib.pyplot as plt import seaborn as sns; sns.set()
Pipelines are useful (in the analytics with Python sense) for the following reasons? (choose all that apply)
- Pipelines help organize the code you used to clean and treat your data - Pipelines make it very easy to change small things in your model, like which variables to include. - Pipelines make it easy to repeat/replicate steps and run multiple models
Chp.9 Random Forest
- Supervised Learning - non parametric - example of ensemble method, relies on aggregaing the results of an ensemble of simple estimators - the result can be greater than the parts - You start with a standard import: %matplotlib inline import numpy as np import matplotlib.pyplot as plt import seaborn as sns; sns.set()
The Hold out Method
- The simplest kind of cross validation - the data set is separated into 2 sets, called the training set and the testing set -The model fits a function using the training set only - Then the model is asked to predict the output values for the data in the testing set - X's on the graph represent some value of Y, with new data, the x is multiplied or some kind of representation for Y Build the model on (training data) & test on testing data - 2 weaknesses: final model is sensitive and model is built on a portion of training data not all of it Cross validation: taking our data and dividing by 3-K into subsets and is repeated K times : every data of K gets to be tested
Simple Linear Regression Model
-> y = ax + b ex: rng = np.random.RandomState(1) x = 10 * rng.rand(50) y = 2 * x - 5 + rng.randn(50) plt.scatter(x, y); The data is scattered about a line with a slope of 2 and intercept of -5 *We can use Scikit-Learn's Linear Regression estimator to fit this data: from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept=True) model.fit(x[:, np.newaxis], y) xfit = np.linspace(0, 10, 1000) yfit = model.predict(xfit[:, np.newaxis]) plt.scatter(x, y) plt.plot(xfit, yfit);
All the the nodes prior to the output nodes essentially 'guess' at the correct weights. Then the algorithm checks to see if the initial guess is correct (usually not). When it is wrong...
.... it tries again (runs another epoch)
Imagine X in the below is a missing value. If I were to run a median imputer on this set of data what would the returned value be? 50, 60, 70, 80, 100, 60, 5000, X (It's okay to have to look up how to do this!)
70
Each of the connections between nodes as a connection, each of those connections has a ________
Activation Function
Which of the following statements best describes an ensemble method in machine learning? Group of answer choices An algorithm that learns to find patterns and relationships in data without being explicitly programmed A model that predicts the value of a dependent variable based on the values of one or more independent variables A technique that combines the results of multiple models to improve overall predictive accuracy A method that automatically groups similar data points into clusters based on their characteristics
An ensemble method in machine learning is a technique that combines the results of multiple models to improve overall predictive accuracy. The idea behind ensemble methods is to create a diverse set of models that are trained on different subsets of the data or using different algorithms. Then, the predictions of these models are combined in a way that reduces errors and increases accuracy. Ensemble methods can be used for both classification and regression tasks and have been shown to improve the performance of models significantly in many real-world applications.
In kmeans- the algorithm has multiple iterations. If we have a simple 2d problem, and a k =2, it begins by assigning the first centroids to [ Select ] ["a random initial starting point", "the euclidean optimized node", "the furthest edge of the state space"] , and then [ Select ] ["iteratively removing the worst fit", "measuring the distance", "removing the center", "reducing the distance"] of each point or record to the centroid.
Answer 1:a random initial starting point Answer 2:measuring the distance
Which of the following statements best describes classification in machine learning? A type of supervised learning where the goal is to assign input data points to predefined categories or classes A type of reinforcement learning where the goal is to learn an optimal policy for making decisions in an environment. A type of unsupervised learning where the goal is to group similar data points into clusters A type of supervised learning where the goal is to predict a continuous target variable based on input features
Classification in machine learning is a type of supervised learning where the goal is to assign input data points to predefined categories or classes. In classification, the algorithm is trained on a labeled dataset, where each data point is assigned to a specific class or category. The goal of the algorithm is to learn a mapping between the input features and the correct class labels, so that it can accurately predict the class of new, unseen input data. Examples of classification include spam detection, image recognition, and sentiment analysis. The performance of a classification algorithm is evaluated using metrics such as accuracy, precision, recall, and F1 score, which measure how well the algorithm can correctly classify new input data.
What is the role of cluster centers clustering, and how are they determined during the algorithm? Cluster centers are the number of clusters chosen prior to running the algorithm, and they are determined based on the desired output of the clustering Cluster centers are the outliers or anomalies in the dataset, and they are determined through a clustering technique known as DBSCAN. Cluster centers are the mean or median value of all data points in a cluster, and they are determined through a distance metric such as Euclidean distance Cluster centers are the initial data points chosen randomly to begin clustering, and they are updated iteratively to minimize the within-cluster sum of squares
Cluster centers are the initial data points chosen randomly to begin clustering, and they are updated iteratively to minimize the within-cluster sum of squares In clustering, cluster centers represent the centroids of clusters, and they play a crucial role in determining the boundaries of the clusters. Cluster centers are usually represented as a vector of feature values that represent the center of mass of the data points within the cluster. The process of determining cluster centers varies depending on the algorithm used for clustering. In k-means clustering, for example, cluster centers are initialized randomly and then iteratively updated to minimize the within-cluster sum of squares. The algorithm alternates between assigning data points to the nearest cluster center and updating the cluster centers to the mean of the data points assigned to the cluster. This process continues until the cluster centers converge or a maximum number of iterations is reached. In other clustering algorithms, such as hierarchical clustering, cluster centers are determined by merging or splitting clusters based on a distance metric between the clusters. In DBSCAN, cluster centers are determined based on a density threshold, and data points are assigned to clusters based on their proximity to the cluster centers. The number of clusters and their respective centers may also be determined based on the desired output of the clustering. For example, in some cases, domain knowledge may be used to determine the appropriate number of clusters, or the desired output may be determined through visual inspection of the data.
Which of the following is true about data validation and cross-validation in machine learning? Data validation and cross-validation are used to evaluate a model's performance and prevent overfitting Data validation and cross-validation are the same thing Data validation involves splitting the data into three sets: training, validation, and testing Cross-validation involves using the same data for both training and testing
Data validation and cross-validation are used to evaluate a model's performance and prevent overfitting in machine learning. Data validation involves splitting the dataset into separate training and validation sets. The training set is used to train the model, while the validation set is used to evaluate the performance of the model and tune its hyperparameters. Data validation helps prevent overfitting by evaluating the model's performance on data that it has not seen during training. Cross-validation involves splitting the dataset into multiple folds and performing training and validation multiple times, each time using a different subset of the data for validation. Cross-validation helps prevent overfitting and provides a more reliable estimate of a model's performance than a single validation set. It can also help to reduce the impact of data variability and ensure that the model generalizes well to new data. It is important to note that data validation and cross-validation are not the same thing, as data validation typically involves splitting the data into training and validation sets, while cross-validation involves repeatedly splitting the data into different training and validation sets. Additionally, data validation does not usually involve a testing set, while cross-validation can use a separate testing set to evaluate the final model performance.
In K Means clustering, the analyst does not need to determine the number of clusters (K), these are always derived analytically using the kmeans algorithm.
False
Neural Networks in computing are exactly the same as the neural networks from biology.
False
Neural networks are an unsupervised technique, because there is no target variable.
False
Errors in validation:
How far is that away from Y hat & gaps to determine the best model.
One weakness of cross-validation discussed is that information can sometimes ____ across different periods. A common situation in which this happens is when we are looking at stock data. underfit overfit not leak leak
Leak. In cross-validation, the dataset is divided into multiple subsets, or "folds," and the model is trained and tested on different combinations of these folds. However, when working with time series data like stock prices, the order of the data points is important, and information from one time period may affect the values in subsequent periods. In this case, if we use cross-validation, there is a risk of "leaking" information from the test set into the training set, or vice versa, because the folds are not strictly independent of each other. To address this issue, we can use a time-series-specific technique called "rolling window cross-validation," where we train the model on a sliding window of data, and test it on the next period. This way, the model is only trained on data that occurred before the test period, and we avoid leaking information across different time periods.
Which of the following best describes supervised learning? Group of answer choices A machine learning approach where the algorithm learns to optimize a performance metric by adjusting its internal parameters A machine learning approach where the algorithm automatically groups similar data points into clusters A machine learning approach where the algorithm receives labeled data and learns to map inputs to outputs based on those labels A machine learning approach where the algorithm learns to find patterns and relationships in data without being explicitly programmed
Supervised learning is a machine learning approach where the algorithm receives labeled data and learns to map inputs to outputs based on those labels. In supervised learning, the algorithm is trained on a labeled dataset, where each input data point is associated with a known output value. The goal of the algorithm is to learn a mapping between the inputs and outputs so that it can accurately predict the output for new, unseen inputs. Examples of supervised learning include image classification, speech recognition, and sentiment analysis. The performance of the algorithm is evaluated using metrics such as accuracy, precision, recall, and F1 score, which measure how well the algorithm can predict the correct output for new input data.
Which of the following is a common use case for the random forest algorithm in machine learning?
The random forest algorithm is a powerful and widely used machine learning algorithm that can be used for various tasks. - One common use case for the random forest algorithm is for classification tasks, where it can be used to predict the class of a new input based on a set of features. For example, it can be used to classify whether an email is spam or not based on features like the sender's address, the subject line, and the content of the email. - It can also be used for regression tasks, where it can predict a continuous output variable based on a set of input features. - Another common use case is for feature selection, where the algorithm can be used to identify the most important features for a particular task. - Random forests are also robust to overfitting, which makes them suitable for high-dimensional data with many features. *Overall, the random forest algorithm is a versatile and powerful tool for various machine learning tasks, including classification, regression, and feature selection.*
Which of the following is a potential benefit of using decision trees in machine learning?
There are several potential benefits of using decision trees in machine learning. Some of the most significant ones are: - Easy to interpret: Decision trees are easy to interpret and understand, even for people who are not familiar with machine learning. The structure of a decision tree is intuitive, and the decisions made at each node can be easily explained. - Able to handle both categorical and numerical data: Decision trees can handle both categorical and numerical data, making them useful for a wide range of applications. Can handle non-linear relationships: Decision trees can capture non-linear relationships between features and the target variable, which is useful when the relationship between the features and the target variable is complex. - Can be used for feature selection: Decision trees can be used for feature selection by identifying the most important features for the task at hand. This can lead to simpler and more efficient models. - Computationally efficient: Decision trees can be trained relatively quickly, making them suitable for large datasets. *Overall, decision trees are a powerful and flexible tool in machine learning that can offer several benefits, including easy interpretability, handling both categorical and numerical data, capturing non-linear relationships, and being computationally efficient.*
(from our possibly overly simplistic explanation) In the attempt to fit values from the input layer to the output layer, the hidden layer applies some weights to the input values.
True
One big difference between the unsupervised approaches in this module, and the supervised approaches in prior modules: Unsupervised models do not have a target variable (Y). This make is difficult to know when they are "right" or correct.
True
Choosing a class of model Your dataset consists of details about customer traits, such as "number of items in the basket at checkout" and "time of day of checkout". Your task is to group customers that are like each other together. You don't already have labeled customer types. What kind of model are you building?
Unsupervised model (such as K means) Correct. If you have a bunch of X's, but no Y's-- the problem is unsupervised. Remember when you are building a supervised model you have an X and a Y. Here... there is no Y. The hint there is "You don't already have labeled customer types". Without these labels, the Y, you can't have any supervision. Good work. Think of several more of these types of scenarios...
In which of these situations would you want to use a clustering algorithm? You have a dataset containing 2023 Charlotte, NC housing data and you want to predict 2024 housing prices You have a dataset containing customer data for Cheesecake Factory and you want to look at customer spending at the restaurant in order to find patterns among customers who share similar characteristics You have a dataset containing past crimes of current defendants and you want to determine the likelihood that they will commit another crime You were given the financial data for the Federal Reserve of New York in 2023 and want to determine where the discrepancy in accumulated depreciation came from before you submit the financial statements
You would want to use a clustering algorithm if you have a dataset containing customer data for Cheesecake Factory and you want to look at customer spending at the restaurant in order to find patterns among customers who share similar characteristics. Clustering is an unsupervised learning technique that is used to group similar data points together based on their characteristics. In this case, we have customer data for Cheesecake Factory, and we want to identify patterns among customers who share similar characteristics, such as spending habits, age, gender, etc. Clustering can help us identify these groups of customers and provide insights into their behavior and preferences, which can be used to inform marketing and business strategies. The other situations mentioned in the answer choices would require different machine learning techniques: For predicting 2024 housing prices, a regression algorithm would be more appropriate. For determining the likelihood of repeat crimes, a classification algorithm such as logistic regression or decision tree would be more appropriate. For identifying discrepancies in financial data, data analysis and visualization techniques may be more appropriate.
Utility function to help visualize the output of the classifier
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'): ax = ax or plt.gca() # Plot the training points ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap, clim=(y.min(), y.max()), zorder=3) ax.axis('tight') ax.axis('off') xlim = ax.get_xlim() ylim = ax.get_ylim() # fit the estimator model.fit(X, y) xx, yy = np.meshgrid(np.linspace(*xlim, num=200), np.linspace(*ylim, num=200)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) # Create a color plot with the results n_classes = len(np.unique(y)) contours = ax.contourf(xx, yy, Z, alpha=0.3, levels=np.arange(n_classes + 1) - 0.5, cmap=cmap, clim=(y.min(), y.max()), zorder=1) ax.set(xlim=xlim, ylim=ylim)
Process of fitting a decision tree in Scikit-Learn:
from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier().fit(X, y)
Basic Regression
import numpy as np import pandas as pd from matplotlib import pyplot as plt %matplotlib inline from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept=True) "What will fit my model best?"
What is the purpose of the below code? Note that this is probably EASIER than similar questions on the final exam. But I will ask you why/purpose/what for questions on the code I have had you run. It is useful to make notes on your notebooks about why a certain chunk of code is run. import matplotlib.pyplot as plt import seaborn as sns import numpy as np
import python packages
Random forest advantages
oth training and prediction are very fast, because of the simplicity of the underlying decision trees. In addition, both tasks can be straightforwardly parallelized, because the individual trees are entirely independent entities. The multiple trees allow for a probabilistic classification: a majority vote among estimators gives an estimate of the probability (accessed in Scikit-Learn with the predict_proba() method). The nonparametric model is extremely flexible, and can thus perform well on tasks that are under-fit by other estimators.
What is a potential downside of using linear regression models in machine learning? They are prone to overfitting the data They are too complex and difficult to interpret They can only handle numerical data They are not suitable for predicting continuous target variables
overfitting the data. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new, unseen data. Linear regression models are susceptible to overfitting because they try to fit a linear relationship to the data, which may not accurately capture the underlying pattern. To mitigate the risk of overfitting, it is important to use techniques such as regularization, cross-validation, and feature selection to ensure that the model is not too complex and is able to generalize well to new data. Additionally, other machine learning algorithms, such as decision trees, random forests, and neural networks, may be more suitable for certain types of data or problems.
This should look familiar: "We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form y=ax+b" (from the text in week 8) where a is commonly known as the , and b is commonly known as the
slope intercept
Advantages of Naive Bayes
they are extremely fast for both training and prediction They provide straightforward probabilistic prediction They are often very easily interpretable They have very few (if any) tunable parameters often good choice for initial baseline classification
Data Partitions gives a measure of the quality of the _________
ultimately chosen model