#5 Week Le Wagon (ML)
What are you doing if you evaluate your model on the same data you trained it with?
You are overfitting! A model's performance should be evaluated on data points it has not seen during training.
What is an AR process? How could you characterize its behavior? Give one example.
- AR stands for "Auto-Regressive" - A process whose values are a direct linear combinations of its past values - In such processes, a one time "shock" will propagate far in the future - e.g. the atmospheric CO2 concentration (when an giant forest-fire suddenly increases it, CO2 concentration is raised for decades)
What is the fundamental reason we need to stationarize a time series before modeling it?
- As its statistical properties are by definition constant over time, we can safely prolongate them in the future! - We are never safe from unexpected changes (black swans), but we can quantify probabilities that they will stay constant (uncertainty intervals) - AR/MA modeling is one way to do so. Many other methods exists.
Name 4 methods to stationarize a time series?
- Detrending (ex: taking the log, removing linear increase, etc...) - Deseasonalizing (ex: using statsmodels seasonal_decompose) - Differencing (ex: y(t)−y(t−1)) Seasonal differencing (ex: y(t)−y(t−12))
What is a MA process? How could you characterize its behavior? Give one example.
- MA stands for "Moving-Average" A process whose values are a direct linear combinations of its past changes - In such processes, any "shock" will have a limited time effect (rebound/elastic behavior) - e.g. a country's GDP growth in % (when a pandemic shock lowers it to -10%, it may bounce back to +5% the following year)
List some advantages of using autoregressive models like ARIMA over conventional ML models for Time Serie?
- Only one single ARIMA model is required to forecast any time horizon ahead. - Indeed, they model the recursive behavior of the data, forecasting one data point after the other - You can compute 95% confidence interval on your forecasts
What are the iterative steps of the K-means algorithm?
1. Specify the number of clusters K. 2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement. 3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn't changing.
If you are training an ensemble of bootstrap aggregated decision trees, what model are you actually using?
A Random Forest model! Random Forest is the bagged (bootstrap aggregated) version of decision trees
What is a bag-of-word representation of text?
A bag-of-word represents a text as a count of occurences of each of its words. flip card
What do ACF and PACF measure?
ACF: Measure of the simple correlation coefs between Y(t) and each lagged features Y(t−i) PACF: Measure of the partial correlation coefs between Y(t) and each lagged features Y(t−i)
Name two methods of seasonal decomposition?
Additive Decomposition (y = Trend + Seasonal + Residuals) Multiplicative Decomposition (y = Trend * Seasonal * Residuals)
What are the eigenvectors of Principal Component Analysis?
An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it.
What is the name of the statistical test used to test for stationarity?
Augmented Dickey Fuller - ADF Tests Consider the time series stationary when its p-value is below 0.05
To which family of ensemble methods do Bagging and Boosting belong to? Sequential or Parallel?
Bagging is a parallel ensemble method.Boosting is a sequential ensemble method.
Give 3 statistical properties of stationary time series that remain constant over time ?
Constant mean, constant variance, and constant autocorrelation flip card
How do you choose for ARIMA hyperparameters?
Diff (d): minimum number of differences before you achieve stationarity AR term numbers (p): number of non-null terms in PACF plots of stationary TS MA term numbers (q): number of non-null terms in ACF plots of stationary TS
The Boosting ensemble technique can only be applied to decision trees. True or False?
False again! Both bagging and boosting can be applied to any algorithm!!
Clustering is a supervised learning method that groups data points. True or False?
False! Clustering is an UNSUPERVISED learning method that groups data points.
Your pipeline is made up of a scaler and a linear model. If the pipeline's predict method is called, the scaler would be fitted to the data before transforming it. True or False?
False! On execution of pipeline's predict method, only the transformers' transform method will be called, using the variables learnt during the original fit.
What is the difference between a grid search and a random search.
In a grid search, the data scientist specifies which hyperparameter value to test, and each possible combination is tested.In a random search, a defined number of random samples are made out of a hyperparameter space and tested for performance. 50 sample iterations has proven to be a good tradeoff between time and efficiency.
What does LDA stand for and what would you use it for?
In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model. It locates similarities between texts in the form of key words that belong to potential topics. The algorithm can be used to generate topics and label text data. flip card
If you are "ensembling" multiple models for a classification task, how would you combine the predictions of the different models?
In the case of classification, you can perform a majority vote across the predictions of the models
What does the Sklearn tool ColumnTransformer do?
It allows you to perform parallel preprocessing operations on specific columns and package it as a single transformer. For example, scaling a numerical feature while encoding a categorical one.
How does PCA function in mathematical terms?
It works by finding the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors give the direction and eigenvalues indicate the variance of the data explained by a component.
What is lemmatization?
Lemmatizing consists of reducing word derivatives down to their ethimological roots. flip card
Name a few potential causes for overfitting, as well as few approaches to combat it.
Overfitting happens when the model learns the training data too closely, almost by heart. When that happens, it is unable to generalize well unseen data. You could add more data for training, use regularization (Ridge or Lasso), tune the model parameters, or try a simpler model.
What does PCA stand for and what is it?
PCA stands for Principal Component Analysis. It's a way to explain our matrix X by finding a K-dimensional orthogonal projection that preserves the greatest variance.
What are POS tags?
Part of Speech tags are attributed to each word in a text specifying its grammatical role within the sentence. They can be treated as a feature or used to improve certain algorithms like the WordNetLemmatizer.
Name a dimensionality reduction technique you learnt today?
Principal Component Analysis. flip card
What is the difference between lemmatization and stemming?
Stemming algorithms work by cutting off the end or the beginning of the word. Lemmatization takes into consideration the morphological analysis of the words. flip card
What does TfIdf stand for and what does it do calculate?
TfIdf stands for Term Frequency Inverse Document Frequency. It computes an importance value for each word in its text and according the entire corpus. That value is the product of the TF and the IDF.
What is the k-means algorithm?
The K-means algorithm is an iterative unsupervised learning algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.
What text vectorization derives from the bag-of-word and captures some context?
The n-gram representation. N-gram is simply a sequence of N words considered as a single token. flip card
The bag-of-word representation of a text is not able to capture context. True or False?
True! Bag-of-word representation disregards the order of words completely. It is a only a represenatation of content. flip card
In bagging, all weak learners have the same weight in the final prediction votes. True or False?
True! In bagging, all weak learners are attributed the same importance in the final vote.In boosting however, the best weak learners are attributed more weight in the final vote.
The parameters of a vectorizer can only be fine tuned in relation to a model. True or False?
True! The parameters of a vectorizer dictate the transformations applied to the text. Those transformations will then impact the performance of a model. As such, a vectorizer must be fine tuned in relation to the modelling objective. flip card
Name a few potential causes for underfitting, as well as few approaches to combat it.
Underfitting occurs when the model is unable to capture and learn a structure in the data. The model could be too simplistic, or the data not sufficiently informative. In that case, you could choose a more complex model or do some feature engineering.
What is vectorization and why must text be vectorized before modelling?
Vectorization transform raw text into a numerical representation. It is necessary because Machine learning algorithms cannot ingest raw text data. flip card
How would you go about finding out the optimal number of clusters for your K-mean algorithm?
You can use an elbow method with a distance metric of the points assigned to their respective clusters (i.e. sum of squared distances).
What technique would you use to reduce the dimensionality of an image?
You can use the PCA to reduce the dimensionality of an image. Another possibility is to use a K-means algorithm. The number of clusters K will correspond to the number of dimensions you want your original image to be compressed to.
How would you visualize the output of a hierarchical clustering model?
You could use a Dendogram. A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering.
You are doing a binary classification task, testing a patient to have a virus (e.g. COVID-19), you don't want to miss any COVID-19 case. It is ok to have some false alarms. Which performance metric should you look at?
You should be looking at the recall score. The ability to detect occurences of a class of interest.