ML w/ Databricks
Naives Bayes
A probability model based on the Bayes Theorem that assumes independence between features to make predictions.
What is Hyperopt?
A python library for automating the process of hyperparameter tuning for ML models.
Dropout
A regularization technique that prevents overfitting.
Gradient Descent
A simple but effective iterative optimization method that can be used to find the minimum of a function. Spark ML uses it to minimize the error of a linear regression model.
Target Encoding
A spark ML technique used to encode categorical variables based on the target variable in a regression or classification problem. It replaces each category with a numerical value that represents the average of the target variable for that category. It provides the model with valuable information about the relationship between the categorical feature and the target which can improve predictive performance.
Supervised Learning
A type of machine learning where algorithms are trained by providing explicit examples of results sought, like defective versus error-free, or stock price.
Unsupervised Learning
A type of model creation, derived from the field of machine learning, that does not have a defined target variable.
Classification Evaluation Metrics
Accuracy Prediction Recall F1-score Logloss ROC/AUC
applyInPandas
Allows you to apply a function that takes a pandas DataFrame and returns another pandas DataFrame, treating each group as a whole.
Tree of Parzen Estimators (TPE) Optimization
An approach that belongs to the family of Bayesian optimization techniques for hyperparameter tuning.
Random Forest
An ensemble model that combines multiple decision trees to make predictions. It aggregates predictions of individual trees which reduces overfitting.
Apache Arrow
An in-memory columnar data format used in Apache Spark to efficiently transfer data between JVM and Python processes. This format enables efficient data processing and transformation for Python developers working with pandas and NumPy data.
Linear Regression
Assumes a linear relationship between the input features and target variable. It generates a formula to calculate the target variable based on a one to one relationship between input vs output.
RFormula
Automatically transforms categorical data by string indexing or one hot encoding.
What are the ways to combat underfitting?
Can be corrected by selecting a more powerful model, including better features, and reducing constraints.
What are the ways to combat overfitting?
Can be corrected by simplifying the model and gathering more data to reduce the noise in the training data.
Nonlinear Regression
Captures nonlinear relationships between input features and the target variable.
StringIndexer
Commonly used when you want the machine learning algorithm to identify a column as a categorical variable or when you want to convert textual data to numeric data while preserving the categorical context (e.g., converting days of the week to numeric representation).
What are the key aspects of feature engineering?
Creating Features: This involves transforming raw data into features that better represent the underlying problem to the predictive models. Examples include extracting parts of dates (like day of the week from a date column), creating interaction terms, or aggregating data. Feature Transformation: This includes normalization or standardization of features, converting categorical variables into numerical ones (e.g., using one-hot encoding), or applying transformations like log or square root to change the data distribution. Feature Selection: Identifying the most relevant features to use in model building. It helps in reducing the dimensionality of the data, improving model performance, and reducing overfitting. Domain Knowledge Incorporation: Using expert knowledge about the domain to create features that machine learning algorithms wouldn't be able to create on their own.
Neural Networks
Deep learning models consisting of multiple layers of interconnected nodes (neurons) that can learn complex patterns and relationships in data.
Model Parallelism
Dividing the model into smaller parts and processing each part on a separate node or processor
OHE Gotcha
During OHE continuous variables are given more importance than the dummy variables used by the algorithm which obscures the order of importance. Because of this, OHE should never be used with random forest models.
Benefits of Using Hyperopt
Efficient Hyperparameter Tuning: Automates the often tedious and time-consuming task of hyperparameter tuning, making it more efficient and systematic. Improved Model Performance: By systematically searching through the hyperparameter space, Hyperopt can help in finding the optimal settings that improve the model's performance. Ease of Use: Integration with Databricks simplifies the usage of Hyperopt, especially for users already working within the Databricks ecosystem. Scalability and Speed: Leveraging Databricks' cloud infrastructure and Hyperopt's support for parallelization, the hyperparameter tuning process can be significantly accelerated. Experimentation and Tracking: With MLflow integration, every aspect of the hyperparameter tuning process can be tracked and analyzed, aiding in reproducibility and analysis.
Spark ML Decision Tree
Environment: Spark MLlib is designed for distributed computing and can run on a cluster of machines. It's built on top of Apache Spark, a big data processing framework. Use Case: Suitable for large-scale data processing where the data is too large to fit in the memory of a single machine. Implementation: Written to work within the Spark framework, typically using Scala, Python, or Java. It leverages Spark's distributed computing capabilities. Performance: Optimized for distributed computing, which can handle very large datasets efficiently by splitting the work across multiple nodes. Scalability: Highly scalable and can handle big data processing tasks. The performance scales effectively with the addition of more computing resources. Community and Ecosystem: Part of the Apache Spark ecosystem, which is renowned for big data processing and analytics.
Single Node scikit-learn Decision Tree
Environment: scikit-learn is designed for single-node, in-memory computation. It doesn't natively support distributed computing across multiple nodes. Use Case: Ideal for small to medium-sized data that fits into the memory of a single machine. Implementation: Written in Python, and it's widely used for its simplicity and ease of integration with other Python libraries. Performance: Works efficiently on datasets that can be accommodated in the memory of a single machine. For very large datasets, performance can be a bottleneck due to memory and computational constraints. Scalability: Not suitable for big data applications where the data cannot be loaded into a single machine's memory. Community and Ecosystem: Part of the broader scikit-learn ecosystem, which is very popular in the data science community for prototyping and development of machine learning models.
Sampling Noise
Error associated with sampling a small dataset.
Type II Error
False negative
Type 1 Error
False positive
Early Stopping of Epochs
Form of regularization while training a model with an iterative method, such as gradient descent.
Transformer
Generates a new df with appended columns based on rules.
F1-score
Harmonic mean of precision and recall. Should be favored when there is a substantial imbalance between the positive and negative classes and minimizing false negatives is important.
Data Snooping Bias
If characteristics of the data cause the gravitation toward a specific type of model.
How to access the metrics of an MLflow run?
In Databricks MLflow, once you have retrieved the most recent run from an experiment using the MLflow client, you can access the metrics of the best run by using runs[0].data.metrics. This will return a dictionary containing the metric keys and their respective values.
OHE requires StringIndexer?
In Spark ML, StringIndexer should be used prior to OneHotEncoder when dealing with categorical features. The StringIndexer converts each string in its input column to an index. It handles unseen labels during transformation by either assigning them the index of a special additional bucket, or by throwing an exception. These indices can then be effectively one-hot encoded. Other options are either incorrect or irrelevant to the issue described in the question.
How to impute a column in a spark df?
In Spark ML, to use an imputer, you first need to fit it to your data to create an 'ImputerModel'. This model learns the imputation values (in this case, the median values of the columns) from the input dataset. After the ImputerModel is fitted, you can then use the 'transform' method to impute the missing values in your dataset. The other options are either incorrect or irrelevant to the problem stated.
Cross Validation
In the training-validation split method, the dataset is divided into two parts: Training Set: A larger portion of the dataset (often around 70-80%) used to train the model. Validation Set: The remaining portion (around 20-30%) used to evaluate the model's performance. Process: The model is trained on the training set. The trained model is then tested on the validation set to evaluate its performance. When to Use: Quick Evaluation: When a rapid, initial assessment of the model's performance is needed. Smaller Datasets: In cases where the dataset is not large enough to be effectively divided into multiple subsets for cross-validation. Less Computational Resources: It requires fewer computational resources compared to k-fold cross-validation, as the model is trained fewer times.
Precision / Recall Tradeoff
Increasing precision reduces recall.
Random Search Optimization
Involves randomly selecting combinations of hyperparameters to train the model.
Feature Engineering
Involves selection and extraction to produce the most useful features to train.
The Kernel Trick
Involves transforming data into another dimension that has a clear dividing margin between classes of data. It allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.
VectorAssembler
Is a type of transformer that combines a given list of columns into a single vector column. It can handle various types of input columns - real numbers, booleans, vectors, etc., and simplifies the workflow.
Estimator
Is an algorithm which can be fit on a df to produce a transformer.
K-fold Cross-Validation
K-fold cross-validation is a more systematic and thorough method of evaluating model performance: Divide the Dataset: The dataset is divided into 'k' equal-sized subsets. Training and Validation Cycles: For each of the 'k' "folds":A fold is kept as the validation set.The remaining 'k-1' folds are combined to form the training set.The model is trained on the 'k-1' folds and validated on the remaining fold. Average Performance: The performance measure reported is the average of the performance on each of the 'k' folds. When to Use: Robust Evaluation: When a more robust assessment of the model's performance is needed. Larger Datasets: Effective for larger datasets, as it makes more efficient use of data for both training and validation. Reducing Bias: Helps in reducing bias and variance in the model evaluation, as every data point gets to be in a validation set exactly once and in a training set 'k-1' times. Hyperparameter Tuning: Ideal for hyperparameter tuning as it provides a thorough understanding of how the model performs across different subsets of the dataset.
Unsupervised Learning Algorithms
K-means clustering DBScan Anomaly detection Novelty detection Visualization and Dimensionality reduction
Supervised Learning Algorithms
K-nearest neighbor Linear regression Logistic regression Support Vector Machines Decision trees and Random forests Neural networks
Regression Evaluation Metrics
Mean Squared Error (MSE) Root Mean Squared Error (RMSE) *note: returned as a log Mean Absolute Error (MAE) R-squared (Coefficient of determination)
MCAR
Mean imputation is most appropriate for handling missing values when the data is missing completely at random (MCAR). In this case, missing values are not related to any other variable, and mean imputation does not introduce significant bias. For other types of missingness, more advanced techniques may be necessary to avoid bias and maintain the integrity of the dataset.
fmin
Minimizes the objective function by searching for optimal hyperparameter combinations. It takes an objective function, a search space definition, a search algorithm, and a maximum number of evaluations as input and returns the best-found hyperparameter combination that minimizes the objective function.
Accuracy
N correct / N number Does not account for weighted classes.
Underfitting
Occurs when a machine learning model has poor predictive abilities because it did not learn the complexity in the training data.
One Hot Encoding (OHE)
Process used to convert a categorical feature into a binary (dummy) variable suitable for machine processing.
When there is a risk of data leakage from earlier steps in the pipeline...
Putting the Pipeline inside the CrossValidator is the safest approach when there is a risk of data leakage from earlier steps in the pipeline. By doing this, the CrossValidator first splits the data and then fits the pipeline, ensuring that there is no leakage of information from the hold-out set to the train set.
Impute
Replacing missing values with something intentional like the median or mean.
Model Training Common Errors
Sampling Noise Sampling Bias Overfitting Underfitting Data Snooping Bias
Spark ML DT vs sklearn DT
Scalability and Data Size: The most significant difference is scalability. scikit-learn's Decision Tree is limited by the memory of a single machine, whereas Spark ML's Decision Tree can handle larger datasets by distributing the computation. Performance: For smaller datasets, scikit-learn might be faster since it doesn't have the overhead of distributed computing. For larger datasets, Spark ML is more efficient. Ease of Use: scikit-learn is generally easier to use and more accessible for beginners or for projects with smaller datasets. Integration: scikit-learn integrates well with the Python ecosystem, making it suitable for projects that require a variety of Python libraries. Spark ML is better for projects already within the Spark ecosystem or those that require integration with other big data tools. Deployment: scikit-learn models are easier to deploy in small-scale or standalone applications. Spark ML models are suited for deployment in distributed environments.
Scatter Plots on Databricks
Scatter plots in Databricks are based on the top N dataset, not the entire dataset. Scatter plots are useful for visualizing the relationship between two continuous variables, but they may not capture the full extent of the data if the entire dataset is too large.
Hyperopt Usage
Search Algorithms: Supports various algorithms for searching the hyperparameter space, such as: Random Search: Randomly samples hyperparameters from the defined search space. Tree of Parzen Estimators (TPE): A Bayesian approach that models the function and uses past trial information to choose the next hyperparameters to evaluate. Adaptive TPE: An advanced version of TPE that adapts the search based on previous results. Search Space Specification: Users can define the hyperparameter space flexibly, specifying ranges, distributions, and discrete choices for hyperparameters. Parallelization: It can leverage parallel computing resources to speed up the search process.
What are scenarios to use Naives Bayes?
Spam Detection: Classifying emails as spam or not spam. Document Categorization: Classifying news articles into different categories. Sentiment Analysis: Determining whether a given text expresses positive, negative, or neutral sentiments.
Train-Test-Split
Splitting the Dataset: The dataset is divided into two parts: Training Set: A subset of the data used to train the model. Typically, this is a larger portion of the dataset (commonly 70-80%). Test Set: The remaining part of the dataset used to test the model. This set is not used in any way to train the model (commonly 20-30%). Training the Model: The model is trained exclusively on the training set. This process involves the model learning the relationships between the input features and the target variable. Testing the Model: After training, the model is used to make predictions on the test set. Since the test set contains data that the model hasn't seen before, it serves as a proxy for new, real-world data. Evaluating Performance: The model's performance is then evaluated based on how accurately it predicts the target variable in the test set. Common metrics include accuracy, precision, recall, F1 score for classification tasks, and mean squared error, mean absolute error for regression tasks. Importance: Model Assessment: It provides a straightforward way to evaluate the performance and generalization ability of the model. Overfitting Detection: Helps in detecting overfitting, where a model performs well on training data but poorly on unseen data. Bias-Variance Trade-off: Assists in understanding the balance between underfitting (high bias) and overfitting (high variance).
Regularization
Technique to reduce the complexity of the model.
Specificity
The classification metric to use when False Positives are unacceptable i.e it's better to have false negatives than having FP.
Boosting
The ensemble process of training machine learning models sequentially with each model learning from the errors of the preceding models.
Pandas API vs native Spark df
The pandas API uses eager evaluation of all operations, which can cause it to be slower than native Spark DataFrames for larger datasets. Eager evaluation means that operations are executed immediately as they are called, as opposed to lazy evaluation used in native Spark DataFrames where operations are optimized and computed only when needed.
Bagging vs Boosting
The primary difference between bagging and boosting in the context of ensemble methods is that bagging trains weak learners in parallel, while boosting trains them sequentially. This difference in training approach leads to different characteristics and strengths for each method.
Binning
The process of converting numeric into categorical data by grouping continuous data into discrete bins or intervals.
Overfitting
The process of fitting a model too closely to the training data which ultimately takes away from performance when attempting to assess a general data set.
Validating Number of Bins in a DT
To verify if the number of bins for numerical features in a Databricks Decision Tree is sufficient, you can check if the number of bins is equal to or greater than the number of different category values in a column. This can be done by performing a group by operation on the column to count the distinct category values."
Classification Model
Trained to predict the category or class label of an input data point based on its features.
Data Parallelism
Training a single model on multiple subsets of data simultaneously.
Recall (Sensitivity, True Positive Rate)
True Positives / (True Positives + False Negatives)
Precision
True Positives / (True Positives + False Positives) Does not account for the missed positive cases.
Validation Best Practices
Use train-test split for large datasets, limited computational resources, or for an initial quick evaluation. Opt for k-fold cross-validation for small datasets, complex models, or when you require a more robust estimate of model performance. Consider stratified k-fold cross-validation for imbalanced datasets. Use time series specific methods for time-series data.
Regression Model
Used for predictions of continuous numerical values based on input features. It aims to estimate a continuous target variable.
How to calculate the total number of models that can be trained simultaneously during a 3 fold cross validation?
When conducting a grid search with 3-fold cross-validation, the total number of models that can be trained in parallel is determined by the number of distinct hyperparameter combinations and the number of folds. In this case, the hyperparameter grid has 3 options for Hyperparameter 1; and 2 options for Hyperparameter 2, giving a total of = 6 unique combinations. As we have 3 folds for each combination, the total number of models that can be trained simultaneously during this process is = 18. Possible Combinations x 3
Classification/Regression Frameworks
lightgbm sklearn xgboost
DBR for ML
- built in popular ML libs like tensorflow, pytorch, keras, and XGBoost - built in distributed training libs like Horovod - GPU support - faster cluster creation
When are single node clusters ideal for ML?
- when only using Spark to read and save data - when performing lightweight exploratory analysis - when working with a dataset that fits into a single node
Z-Ordering
A Delta Lake write optimization technique that colocates related information in the same set of files, improving query performance by minimizing the amount of data that needs to be read
Feature Store
A component of Databricks Machine Learning that provides data teams with the ability to streamline their work by creating new features to use in their machine learning models, exploring and reusing existing features, and building training data sets.
Sampling Bias
A flawed sampling process that produces an unrepresentative sample.
Logistic Regression
A linear model used for binary classification that calculates the probability of an instance belonging to a particular class.
Grid Search Optimization
A methodical approach to hyperparameter tuning that involves exhaustively searching through a manually specified subset of the hyperparameter space of a learning algorithm.
Support Vector Machine
A model that finds an optimal hyperplane to separate different classes in the feature space. It is particularly well-suited for complex but small- or medium-sized datasets.
Decision Tree
A model that partitions the feature space based on a series of if-else conditions creating a flowchart like structure for classification. Can be used for both classification and regression tasks.
Kernel Density Estimation (KDE) Plot
A non-parametric way to estimate the probability density function (PDF) of a random variable. It is used in statistics and data analysis to create a smooth, continuous representation of the distribution of a dataset, particularly useful for visualizing the underlying distribution of the data.
