Interview Prep
Convergence. [E] How do we know when a model has converged?
A model is considered to have converged when its performance or the values of its parameters stabilize, and further training iterations result in negligible improvement. This can be observed by monitoring metrics such as loss or accuracy on a validation set, where stability indicates that the model has reached a satisfactory state and additional training may not yield substantial benefits.
Train, valid, test splits. [E] Why do we need a validation set on top of a train set and a test set?
A validation set is crucial for fine-tuning a model during training without overfitting to the test set. It helps in assessing the model's performance on unseen data, guiding adjustments to hyperparameters and preventing potential biases introduced during training on the train set. This ensures that the model generalizes well to new, unseen data when deployed in real-world scenarios.
Sample duplication [M] What happens if we accidentally duplicate every data point in your train set or in your test set?
Accidental Data Duplication Impact: Train Set: Risk of overfitting, as the model becomes overly sensitive to duplicated patterns. Overly optimistic evaluation metrics on the train set. Test Set: Inflated test performance may mislead about true generalization. Real-world performance could be poorer, lacking exposure to genuinely novel instances.
Bagging and boosting are two popular ensembling methods. Random forest is a bagging example while XGBoost is a boosting example. [M] How are they used in deep learning?
Bagging and boosting are ensemble learning techniques, but they differ in their approach. Bagging, as exemplified by Random Forest, builds multiple independent models in parallel by training on different subsets of the data and averaging their predictions. Boosting, like AdaBoost and Gradient Boosting, builds models sequentially, giving more weight to misclassified instances in each iteration to correct errors and improve overall model performance. The key distinction lies in the strategy of aggregating base models, with bagging focusing on reducing variance, and boosting aiming to sequentially reduce bias.
[M] What problems might we run into when deploying large machine learning models?
Challenges in Deploying Large ML Models: Resource Intensiveness: High computational demands and increased infrastructure costs. Latency Issues: Challenges in meeting low-latency requirements for real-time applications. Memory Constraints: Potential operational issues on devices with limited memory. Network Bandwidth: Time-consuming transmission over networks, especially with constrained bandwidth. Scalability Challenges: Inefficiencies in scaling large models in distributed systems. Maintenance and Updates: Complexity in versioning and updating large models regularly. Interpretability: Decreased interpretability as model complexity increases. Regulatory Compliance: Increased intricacies in complying with regulatory frameworks, especially for sensitive data. Energy Consumption: Significant environmental impact due to increased energy consumption. Overfitting to Training Data: Proneness to overfitting, capturing noise and reducing generalization performance.
Class imbalance. [E] Why is it hard for ML models to perform well on data with class imbalance?
Challenges of Class Imbalance for ML Models: Bias Towards Majority Class: Models tend to prioritize predicting the majority class, leading to biased outcomes. Limited Exposure to Minority Class: Inadequate exposure during training hinders learning of patterns in the minority class. Misleading Metrics: Accuracy may be high, masking poor performance on the minority class. Model Generalization: Difficulty in generalizing patterns from imbalanced data impacts overall model performance on new instances.
k-means and GMM are both powerful clustering algorithms. [M] When would you choose one over another
Choose k-means when computational efficiency and simplicity in cluster shapes are priorities, especially with large datasets. Opt for Gaussian Mixture Model (GMM) when dealing with datasets where clusters have non-spherical shapes or when a probabilistic framework for cluster assignments is preferred. GMM is more flexible and accommodating of diverse cluster structures but may require more computation.
Parametric vs. non-parametric methods. [H] When should we use one and when should we use the other?
Choosing Between Parametric and Non-parametric Methods: Parametric Methods: Use When: There's prior knowledge about the underlying structure, and assumptions align with the data distribution. Advantages: Typically computationally efficient and require fewer data to estimate parameters. Non-parametric Methods: Use When: Lack of knowledge about the underlying structure, and assumptions of parametric methods may not hold. Advantages: More flexibility in capturing complex relationships, suitable for diverse data distributions.
Classification vs. regression. [E] Can a classification problem be turned into a regression problem and vice versa?
Classification vs. Regression Transformation: Classification to Regression: Assign numerical values to classes. Example: Predicting average weight for animal classes. Regression to Classification: Discretize continuous output into classes. Example: Categorizing house prices as "low," "medium," or "high."
Feature selection. [M] What are some of the algorithms for feature selection? Pros and cons of each.
Common feature selection algorithms include Recursive Feature Elimination (RFE), LASSO (L1 regularization), and Tree-based methods like Random Forest feature importance. RFE systematically removes the least important features, LASSO enforces sparsity by penalizing less informative features, and Tree-based methods measure feature importance based on how much they improve model performance. Each method has pros and cons, with considerations such as interpretability, computational efficiency, and robustness to different data types influencing the choice of algorithm for a specific task.
[M] How to determine whether two sets of samples (e.g. train and test splits) come from the same distribution?
Comparing Sample Distributions: Visual Inspection: Plot histograms or kernel density estimates for visual comparison. Quantitative Tests: Use Kolmogorov-Smirnov or Anderson-Darling tests for statistical comparison. Feature Comparison: Compare summary statistics and key feature characteristics to assess distribution similarities.
Production Cycle. ML model development
Create training datasets, labeling them, generate features, train models, optimize models, and evaluate them.
[M] How does data sparsity affect your models?
Data sparsity, characterized by limited or missing values, poses challenges for machine learning models, affecting their ability to generalize, increasing the risk of overfitting, and complicating feature importance identification. Addressing data sparsity requires careful consideration of model complexity, regularization, and appropriate handling of missing values to ensure effective learning and generalization.
Empirical risk minimization. [E] Why is it empirical?
Empirical Risk Minimization (ERM): "Empirical": Stresses reliance on observed data.
Feature leakage [E] What are some causes of feature leakage?
Feature leakage occurs when information from the target variable is inadvertently included in model features, leading to inflated training performance but poor generalization. Causes include using future information, preprocessing errors, target-related statistics, time-based issues, proxy variables, data transformation problems, and pitfalls in feature engineering. Preventing leakage involves careful data preprocessing, feature creation, and ensuring separation between training and validation sets.
Feature selection. [E] Why do we use feature selection?
Feature selection is used to enhance model performance and interpretability by identifying and retaining the most relevant features while discarding irrelevant or redundant ones. It helps mitigate the curse of dimensionality, reduces computational complexity, and can prevent overfitting, leading to more efficient and effective machine learning models.
[E] Suppose you want to build a model to predict the price of a stock in the next 8 hours and that the predicted price should never be off more than 10% from the actual price. Which metric would you use?
For a stock price prediction model with a requirement that the predicted price should not deviate by more than 10% from the actual price, you would use a metric like Mean Absolute Percentage Error (MAPE). MAPE calculates the average percentage difference between predicted and actual values, providing a measure that aligns with the specified constraint on prediction accuracy for financial forecasting.
[M] You're building a neural network and you want to use both numerical and textual features. How would you process those different features?
For numerical features in a neural network, standardize or normalize the data to ensure consistent scales. For textual features, use techniques like tokenization and embedding layers to convert text into numerical representations. These representations can then be combined with the numerical features as inputs to the neural network for joint training.
Bias-variance trade-off [M] How do you know that your model is low variance, high bias? What would you do in this case?
If your model is low variance and high bias, it may perform poorly on both training and validation/test data. Signs include a consistent gap between the model's performance and the desired outcome. To mitigate this, consider increasing model complexity, using more expressive models, or incorporating additional relevant features to reduce bias and improve the model's ability to capture underlying patterns.
Class imbalance. [E] How would class imbalance affect your model?
Impact of Class Imbalance on Model: Bias Towards Majority Class: Models may exhibit bias by favoring predictions of the majority class. Poor Generalization: Suboptimal performance on the minority class, leading to poor generalization. Misleading Metrics: Accuracy may be high, but performance on the minority class may be overlooked. Addressing Imbalance: Use resampling, alternative metrics, or specialized algorithms like SMOTE to mitigate class imbalance.
[M] Why is randomization important when designing experiments (experimental design)?
Importance of Randomization in Experimental Design: Confounding Variables: Randomization distributes known and unknown confounding variables equally across treatment groups. Causal Inference: Enhances causal inference by minimizing the impact of extraneous factors. Statistical Validity: Facilitates statistical tests, improving the generalizability of results. Fair Comparison: Ensures treatment and control groups are comparable, providing a fair basis for evaluating treatment effects.
Class imbalance. [M] Imagine you want to build a model to detect skin legions from images. In your training dataset, only 1% of your images shows signs of legions. After training, your model seems to make a lot more false negatives than false positives. What are some of the techniques you'd use to improve your model?
Improving Skin Lesion Detection Model with Class Imbalance: Resampling Techniques: Over-sample minority class or under-sample majority class to balance class distribution. Data Augmentation: Augment minority class images to enhance variability and generalization. Ensemble Methods: Use ensemble methods with models trained on diverse subsets of the imbalanced data.
Classification vs. regression. [E] What makes a classification problem different from a regression problem?
In a classification problem, the goal is to predict the categorical class or label of an input, assigning it to one of several predefined classes. In contrast, a regression problem involves predicting a continuous numerical value.
k-means clustering. [M] How would you do it if the labels aren't known?
In the absence of known labels, you can use internal validation metrics for k-means clustering, such as the silhouette score or Davies-Bouldin index. The silhouette score measures the cohesion and separation of clusters, while the Davies-Bouldin index assesses the compactness and separation of clusters. Higher silhouette scores and lower Davies-Bouldin values indicate better-defined and more separated clusters, providing an internal evaluation of clustering performance.
[E] What are the basic assumptions to be made for linear regression?
Linear regression assumes a linear relationship between the independent and dependent variables, implying that changes in the independent variables result in proportional changes in the dependent variable. It also assumes homoscedasticity (constant variance of errors) and independence of errors, meaning that the residuals exhibit consistent variability and are not correlated. Additionally, linear regression assumes that the errors are normally distributed.
[M] For logistic regression, why is log loss recommended over MSE (mean squared error)?
Log loss is recommended over mean squared error (MSE) for logistic regression because log loss is designed for classification problems and is well-suited for models that output probabilities, as in logistic regression. Log loss penalizes confidently incorrect predictions more effectively, making it a more appropriate metric for assessing the performance of probability-based models in binary classification tasks.
Hyperparameters. [E] What are the differences between parameters and hyperparameters?
Parameters vs. Hyperparameters: Parameters: Learned internal variables influencing model predictions. Hyperparameters: External settings guiding the learning process, set before training.
Parametric vs. non-parametric methods. [E] What's the difference between parametric methods and non-parametric methods? Give an example of each method.
Parametric Methods vs. Non-parametric Methods: Parametric Methods: Definition: Assume a specific form for the underlying function or distribution with a fixed number of parameters. Example: Linear regression, where the model assumes a linear relationship between input features and the target variable. Non-parametric Methods: Definition: Do not make strong assumptions about the form of the underlying function and have a flexible number of parameters that can grow with the amount of data. Example: k-Nearest Neighbors (k-NN), where predictions are based on the majority class of the k-nearest data points.
[M] Suppose you want to build a model to classify whether a tweet spreads misinformation. You have 100K labeled tweets over the last 24 months. You decide to randomly shuffle on your data and pick 80% to be the train split, 10% to be the valid split, and 10% to be the test split. What might be the problem with this way of partitioning?
Randomly shuffling and partitioning tweets into training, validation, and test sets without considering temporal order may introduce information leakage in a misinformation classification model. To address this, it's crucial to ensure a temporal split where the training set contains tweets from earlier periods, and the validation and test sets include tweets from later periods to better reflect real-world scenarios and prevent biases from learning future information during training.
Naive Bayes classifier. [E] How is Naive Bayes classifier naive?
The Naive Bayes classifier is considered "naive" because it makes the assumption that the features used to describe an instance are conditionally independent, given the class label. This simplifying assumption allows for efficient and straightforward probabilistic calculations, although it may not always hold true in real-world scenarios where features might be correlated.
Bias-variance trade-off [E] What's the bias-variance trade-off?
The bias-variance trade-off is a fundamental concept in machine learning that involves balancing the model's ability to capture complex patterns (low bias) with its ability to generalize well to new, unseen data (low variance). Increasing model complexity typically reduces bias but may lead to overfitting and increased variance. Striking the right balance is crucial for optimal model performance.
Bias-variance trade-off [M] How's this tradeoff related to overfitting and underfitting?
The bias-variance trade-off is closely tied to overfitting and underfitting in machine learning. Overfitting occurs when a model is too complex, capturing noise in the training data and leading to high variance but low bias. Underfitting, on the other hand, arises from a model that is too simplistic, resulting in high bias but low variance. Finding the right balance in model complexity is essential to mitigate both overfitting and underfitting and achieve optimal generalization performance.
Train, valid, test splits. [E] What's wrong with training and testing a model on the same data?
Training and testing a model on the same data can lead to overfitting, where the model performs well on the training set but fails to generalize to new, unseen data. This approach may not accurately reflect the model's true performance, as the model has essentially memorized the training data rather than learning underlying patterns.
[M] When should we use RMSE (Root Mean Squared Error) over MAE (Mean Absolute Error) and vice versa?
Use RMSE (Root Mean Squared Error) when you want to penalize large errors more heavily, which is suitable for situations where outliers have a significant impact on model performance. Use MAE (Mean Absolute Error) when you want a metric that is less sensitive to outliers, providing a more robust measure of average error. The choice between RMSE and MAE depends on the specific characteristics and requirements of the problem at hand.
[E] What are the algorithms you'd use when developing the prototype of a fraud detection model?
When developing a prototype for a fraud detection model, consider using algorithms like logistic regression, decision trees, random forests, or gradient boosting. These algorithms are suitable for binary classification tasks, offer interpretability, and can capture complex patterns in the data, providing a solid foundation for initial model development in fraud detection.
Production Cycle. Deployment
how to set up your infrastructures or help your customers set up their infrastructures to run your ML application data-, memory-, and compute-intensive It might also require ML to compress ML models and optimize inference latency unless you can push these
k-means and GMM are both powerful clustering algorithms. [M] Compare the two.
k-means and Gaussian Mixture Model (GMM) are both clustering algorithms, but they differ in their underlying assumptions and flexibility. k-means assumes spherical clusters and assigns each point to the nearest cluster center, making it computationally efficient but sensitive to outliers. GMM, in contrast, models clusters as ellipsoids and provides a probabilistic framework, allowing for more flexibility in capturing complex cluster shapes and accommodating data with mixed distributions. The choice between the two depends on the nature of the data and the desired level of complexity in cluster representation.
Empirical risk minimization. [E] What's the risk in empirical risk minimization?
Empirical Risk Minimization (ERM): Risk: Refers to the average error on training data. Issue: Potential overfitting to training data, compromising generalization.
Your model performs really well on the test set but poorly in production. [M] Imagine your hypotheses about the causes are correct. What would you do to address them?
Addressing Model Deployment Issues: Data Distribution Shift: Solution: Regularly update training data with recent examples from production. Retrain the model to align with the current data distribution. Concept Drift: Solution: Implement mechanisms for detecting and adapting to concept drift. Regularly retrain the model with fresh data. Feature Drift: Solution: Monitor feature statistics in production and update the model if significant drift is detected. Consider feature engineering. Model Overfitting: Solution: Simplify the model, adjust hyperparameters, and regularize to prevent overfitting. Ensure generalization to unseen data. Environmental Differences: Solution: Validate the model's performance in an environment similar to production during testing. Ensure the deployment environment accurately reflects operational conditions. Lack of Real-time Feedback: Solution: Implement real-time monitoring and feedback mechanisms. Regularly update the model based on the feedback loop. Data Quality Issues: Solution: Improve data quality in production by addressing inconsistencies. Implement data validation checks. Deployment Bugs: Solution: Conduct thorough testing before deployment. Implement a robust deployment process to minimize bugs. Inadequate Monitoring: Solution: Enhance monitoring systems for timely alerts. Regularly review and update monitoring strategies.
Convergence. [E] When we say an algorithm converges, what does convergence mean?
Convergence in the context of an algorithm refers to the point where the algorithm has reached a stable or optimal state. Specifically, it indicates that the iterative updates or adjustments made by the algorithm have sufficiently minimized the objective function or achieved a satisfactory solution, and further iterations are unlikely to significantly improve the results.
k-means clustering. [E] How would you choose the value of k?
Choosing the value of k in k-means clustering involves using methods like the elbow method. The elbow method looks for the point where the rate of decrease in the within-cluster sum of squares slows down.
k-nearest neighbor classification. [E] How would you choose the value of k?
Choosing the value of k in k-nearest neighbor classification involves using methods like cross-validation and grid search. Cross-validation helps assess the model's performance for different k values on various subsets of the data, enabling the selection of a value that balances bias and variance. Grid search systematically tests multiple k values, allowing the identification of the optimal k that maximizes classification accuracy or other relevant metrics.
[E] What are the conditions that allowed deep learning to gain popularity in the last decade?
Conditions for Deep Learning Popularity: Data Availability: Abundance of labeled datasets. Computational Power: Increased GPU capabilities. Algorithmic Improvements: Enhanced optimization and architectures. Transfer Learning: Leveraging pre-trained models. Open Source Frameworks: Accessibility of tools like TensorFlow and PyTorch. Research Community Collaboration: Global knowledge exchange. Industry Adoption: Success in applications led to widespread use.
[M] For classification tasks with more than two labels (e.g. MNIST with 10 labels), why is cross-entropy a better loss function than MSE?
Cross-entropy is a better loss function than mean squared error (MSE) for classification tasks with multiple labels because cross-entropy is specifically designed for probability distributions and naturally handles multi-class scenarios. Cross-entropy penalizes the model more strongly for incorrect predictions, providing better optimization for classification problems with multiple labels. MSE, designed for regression tasks, may not capture the nuances of class probabilities in the same way, making cross-entropy more suitable for tasks like MNIST digit classification with 10 labels.
Cross-validation. [E] Explain different methods for cross-validation.
Cross-validation involves partitioning a dataset into subsets to assess a model's performance. Common methods include k-fold cross-validation, where the data is divided into k subsets for training and testing, and stratified cross-validation, which ensures each subset maintains the class distribution of the original data. Leave-One-Out cross-validation uses individual data points as test sets, providing a robust evaluation but can be computationally expensive, while holdout validation involves randomly splitting the data into training and test sets, useful for large datasets.
Cross-validation. [M] Why don't we see more cross-validation in deep learning?
Deep learning models often require large amounts of data, making traditional cross-validation computationally expensive. Additionally, deep learning models may have millions of parameters, and training them is resource-intensive, making it impractical to repeatedly train and evaluate the model across multiple folds. Instead, techniques like holdout validation or using a single validation set are commonly employed in deep learning.
Feature leakage [M] How do you detect feature leakage?
Detecting feature leakage involves thorough examination of features, timestamps, and model performance. Strategies include domain knowledge, cross-validation analysis, correlation checks, and visualizations to ensure that the model is not inadvertently learning from information not available at the time of prediction. Identifying proxy variables and auditing feature engineering processes can further help prevent leakage.
Training data leakage. [M] Imagine you're working with a binary task where the positive class accounts for only 1% of your data. You decide to oversample the rare class then split your data into train and test splits. Your model performs well on the test split but poorly in production. What might have happened?
During oversampling, it's crucial to prevent any unintended information transfer between the training and test sets, ensuring that the test set remains representative of the original data distribution. Issues like data leakage can lead to a model that doesn't generalize well to unseen production data despite performing well on the test split.
[H] How do you know if you've gathered enough observations for an ML model?
Determining Enough Observations for ML: Model Complexity: Assess data sufficiency based on the complexity of the chosen model. Cross-Validation Performance: Use cross-validation to evaluate generalization and stability of model performance. Learning Curves: Analyze learning curves for signs of diminishing returns with additional data. Statistical Significance: Ensure results are statistically significant through confidence intervals and hypothesis testing. Domain Knowledge: Leverage domain expertise to gauge if critical variations are adequately represented. Data Quality: Prioritize high-quality, representative data over sheer quantity. Resource Constraints: Consider practical constraints such as time, budget, and computing resources. Monitoring Performance: Regularly monitor model performance on new data to identify potential degradation and the need for more observations.
[M] How to determine outliers in your data samples? What to do with them?
Determining Outliers: Visual Inspection: Use plots like box plots or histograms for visual identification. Statistical Methods: Apply Z-scores, IQR, or other statistical measures to quantify deviations. Machine Learning Models: Train anomaly detection models to automatically identify outlier observations. Handling Outliers: Removal: Exclude outliers cautiously, considering the impact on data representativeness. Transformation: Apply data transformations to reduce the impact of outliers on statistical analyses. Winsorizing: Replace extreme values to mitigate their influence without complete removal. Modeling Approaches: Use robust models or ensemble methods less sensitive to outliers. Investigation: Understand the context of outliers, as they may provide insights or indicate data quality issues.
[M] Why does ensembling independently trained models generally improve performance?
Ensembling for Improved Performance: Diversity: Independently trained models capture different patterns. Error Compensation: Combining predictions compensates for individual model errors. Variance Reduction: Ensembling reduces sensitivity to variations in the training data. Improved Generalization: Enhances the model's ability to generalize to new, unseen data. Robustness to Noise: Mitigates the impact of noise by focusing on consistent patterns across models.
[M] If we have a wide NN and a deep NN with the same number of parameters, which one is more expressive and why?
Expressiveness of Deep vs. Wide NNs: Deep NN: More expressive due to hierarchical feature learning. Reasoning: Deep networks capture complex patterns at various abstraction levels. Advantage: Parameter sharing and composition enhance generalization.
F1 score. [E] What's the benefit of F1 over the accuracy?
F1 score is beneficial over accuracy when dealing with imbalanced datasets, such as in medical diagnosis. Unlike accuracy, which may be misleading in the presence of imbalanced classes, F1 score considers both precision and recall, providing a balanced measure that accounts for false positives and false negatives and is particularly useful when one class is rare.
Missing data [H] In your dataset, two out of 20 variables have more than 30% missing values. What would you do?
Handling Variables with >30% Missing Values: Assessment: Evaluate the importance of these variables for the analysis and modeling goals. Options: Exclude:If the variables are deemed unimportant, consider excluding them. Impute:If the variables are crucial, use imputation methods like mean, median, or machine learning-based imputation.
Production Cycle. Project scoping
Goals & objectives, constraints, and evaluation criteria Stakeholders should be identified and involved Resources should be estimated and allocated
Two popular algorithms for winning Kaggle solutions are Light GBM and XGBoost. They are both gradient boosting algorithms. [E] What is gradient boosting?
Gradient boosting is an ensemble learning technique that sequentially builds a series of weak learners, typically decision trees, to correct errors made by preceding models. It minimizes a cost function by assigning higher weights to misclassified instances in each iteration, producing a strong predictive model with improved accuracy and the ability to handle complex relationships in the data.
Two popular algorithms for winning Kaggle solutions are Light GBM and XGBoost. They are both gradient boosting algorithms. [M] What problems is gradient boosting good for?
Gradient boosting is effective for a wide range of machine learning problems, especially when dealing with complex, non-linear relationships in the data. It excels in tasks such as regression, classification, and ranking, and is particularly useful when the dataset has a mix of categorical and numerical features, as well as in scenarios where interpretability is important.
Missing data [M] How might techniques that handle missing data make selection bias worse?
Handling Selection Bias in Missing Data: Worsening Bias: Imputing missing values based on observed data may worsen bias if missingness is not completely random.
Hyperparameters. [M] Explain algorithm for tuning hyperparameters.
Hyperparameter Tuning Algorithm: Define Space: Identify hyperparameters and their value ranges. Choose Technique: Select optimization method (grid, random search). Split Data: Create training and validation sets. Train Models: Iterate, training models with different hyperparameter combinations. Evaluate Performance: Assess model performance on the validation set. Update Hyperparameters: Adjust the search space based on performance. Repeat Steps 4-6: Iterate until optimal hyperparameters are found. Test on Unseen Data: Assess the final model's performance on a separate test set.
Hyperparameters. [E] Why is hyperparameter tuning important?
Hyperparameter tuning is essential for optimizing a machine learning model's performance. It allows for the adjustment of external configuration settings to improve accuracy, prevent overfitting, and enhance the model's robustness across different datasets and scenarios.
Your model performs really well on the test set but poorly in production.[M] What are your hypotheses about the causes?
Hypotheses for Test Set vs. Production Discrepancy: Data Distribution Shift: Differences between test and production data affecting generalization. Concept Drift: Changes in the relationship between features and target in the production environment. Feature Drift: Evolving distributions of input features impacting model accuracy. Model Overfitting: Potential overfitting to the test set, lacking generalization to real-world scenarios. Environmental Differences: Variances in production environment conditions affecting model performance. Lack of Real-time Feedback: Delayed feedback in production hindering issue identification. Data Quality Issues: Poor-quality or inconsistent data in production impacting model performance. Deployment Bugs: Issues in deployment process causing unexpected behavior. Inadequate Monitoring: Lack of proper monitoring leading to undetected performance degradation.
[E] Your team is building a system to aid doctors in predicting whether a patient has cancer or not from their X-ray scan. Your colleague announces that the problem is solved now that they've built a system that can predict with 99.99% accuracy. How would you respond to that claim?
I would express caution and inquire about the validation and testing procedures used, as achieving extremely high accuracy might be indicative of overfitting to the training data. It's essential to ensure the model's performance generalizes well to new, unseen X-ray scans and is validated using appropriate evaluation metrics, considering factors like sensitivity and specificity in the context of cancer prediction.
[E] What happens if we don't apply feature scaling to logistic regression?
If feature scaling is not applied to logistic regression, variables with larger scales may dominate the optimization process, leading to biased parameter estimates. This can result in slower convergence, an inefficient optimization process, and suboptimal model performance. Feature scaling, such as normalization or standardization, is crucial to ensure fair contributions from all features and enhance the efficiency of logistic regression training.
Bias-variance trade-off [M] How do you know that your model is high variance, low bias? What would you do in this case?
If your model is high variance and low bias, it might show excellent performance on the training data but struggle to generalize to new data. Signs include a large gap between training and validation/test performance. To address this, consider reducing model complexity, using regularization techniques, or increasing the amount of training data to help the model generalize better.
[E] Occam's razor states that when the simple explanation and complex explanation both work equally well, the simple explanation is usually correct. How do we apply this principle in ML?
In machine learning, applying Occam's razor involves favoring simpler models when they perform comparably to more complex models. This principle helps prevent overfitting and encourages model generalization. The goal is to find a balance between model simplicity and performance on training and validation data, ensuring that the chosen model is both effective and generalizable.
[M] Why does L1 regularization tend to lead to sparsity while L2 regularization pushes weights closer to 0?
L1 Regularization and Sparsity: L1 (Lasso): Adds absolute values of coefficients as a penalty. Effect: Encourages sparsity by driving some coefficients to exactly zero. L2 Regularization and Weight Shrinkage: L2 (Ridge): Adds squared values of coefficients as a penalty. Effect: Pushes weights closer to zero but typically does not lead to sparsity.
k-nearest neighbor classification. [M] How does the value of k impact the bias and variance?
Larger values of k result in higher bias and lower variance, smoothing the decision boundary and making the model less sensitive to noise. Smaller values of k lead to lower bias and higher variance, creating a more intricate decision boundary that can capture fine patterns but may be sensitive to noise in the data. The choice of k involves a trade-off between bias and variance, and it should be selected based on the characteristics of the dataset for optimal model performance.
SVM. [E] What's linear separation? Why is it desirable when we use SVM?
Linear separation refers to the ability to separate classes in a dataset using a straight line (hyperplane) in the feature space. In Support Vector Machines (SVM), linear separation is desirable because it simplifies the decision boundary and makes the optimization process more efficient. SVM aims to find the hyperplane that maximally separates classes, and linear separation allows for a clear and well-defined decision boundary, often resulting in better generalization to new, unseen data.
[E] Why does an ML model's performance degrade in production?
ML Model Performance Degradation in Production: Factors: Data distribution shift, feature and concept drift, lack of real-time feedback. Environmental Impact: Changes in deployment environment can affect performance. Model Aging: Effectiveness may decrease over time if data patterns change. Data Quality Issues: Inconsistent or poor-quality data in production can negatively impact the model.
Empirical risk minimization. [E] How do we minimize that risk?
Minimizing Risk in ERM: Strategy: choose model with the lowest average error on the training set
Production Cycle. Business analysis
Model performance needs to be evaluated against business goals and analyzed to generate business insights These insights can then be used to eliminate unproductive projects or scope out new projects.
Production Cycle. Monitoring and maintenance
Models need to be monitored for performance decay and maintained/updated to be adaptive to changing environments and changing requirements
Feature leakage [E] Why does normalization help prevent feature leakage?
Normalization helps prevent feature leakage by ensuring consistent scaling of features between training and test sets. It avoids information leakage by normalizing based on statistics (e.g., mean and standard deviation) computed solely from the training set, promoting generalization to new data and maintaining data integrity during the modeling process.
Training data leakage. [M] You want to build a model to classify whether a comment is spam or not spam. You have a dataset of a million comments over the period of 7 days. You decide to randomly split all your data into the train and test splits. Your co-worker points out that this can lead to data leakage. How?
Randomly splitting a dataset containing comments over 7 days for spam classification without considering temporal order can lead to data leakage. This is because it may introduce future information into the training set, causing the model to learn patterns that do not generalize well to new, unseen data.
Sample duplication [M] When should you remove duplicate training samples? When shouldn't you?
Remove Duplicate Training Samples: When to Remove:If duplicates introduce bias or skew the model, especially in cases of redundancy or feature importance inflation. Do Not Remove Duplicate Training Samples: When Not to Remove:If duplicates are intentional, capturing essential patterns or emphasizing specific cases.In scenarios where sample repetition is inherent, and removing duplicates would lead to information loss.
[E] What are saddle points and local minima? Which are thought to cause more problems for training large NNs?
Saddle Points and Local Minima: Saddle Point: In optimization, a point where the gradient is zero but is neither a minimum nor a maximum, leading to a flat region. It's challenging for optimization algorithms. Local Minimum: A point where the function has a lower value than its neighbors but not necessarily the lowest possible value. Optimization algorithms may get stuck in suboptimal solutions. Challenges for Training Large NNs: Saddle Points: Saddle points can cause optimization algorithms to slow down as gradients become very small, hindering progress. Local Minima: While initially considered a major obstacle, modern optimization algorithms and the high-dimensional nature of neural networks make finding global minima less problematic. Saddle points are often more challenging.
[M] What is the difference between sampling with vs. without replacement? Name an example of when you would use one rather than the other?
Sampling with Replacement: Definition: In each draw, the selected item is returned to the population before the next draw. The same item can be chosen more than once. Example: Drawing cards from a deck and placing each card back before the next draw. Sampling without Replacement: Definition: In each draw, the selected item is not returned to the population. Once an item is chosen, it cannot be selected again in subsequent draws. Example: Drawing winners from a raffle, where each person can win only once.
Production Cycle. Data management
Sources, data formats, data processing, data control, data storage, scalability
[E] Explain supervised, unsupervised, weakly supervised, semi-supervised, and active learning.
Supervised Learning: Trained on labeled data, maps input to output (classification, regression). Unsupervised Learning: Finds patterns in unlabeled data without specific guidance (clustering, dimensionality reduction). Weakly Supervised Learning: Mix of labeled and imprecise labeled data, less precise supervision. Semi-Supervised Learning: Trained on a mix of labeled and unlabeled data, leverages limited labeled examples. Active Learning: Interacts with an oracle to intelligently select informative data points for labeling, improves model with targeted queries.
Your model performs really well on the test set but poorly in production. [H] How do you validate whether your hypotheses are correct?
Validating Hypotheses for Model Discrepancy: Logging: Implement detailed logging during deployment to capture input data and predictions. Data Monitoring: Monitor production data for distribution shifts and compare with training and test data. Concept Drift Detection: Use methods to identify changes in the relationship between features and the target in production. Feature Drift Analysis: Examine statistics of input features in production compared to the test set. Model Evaluation Metrics: Continuously monitor production model performance using appropriate metrics. A/B Testing: Deploy different model versions concurrently and compare their performances. Feedback Loop: Establish a continuous monitoring and retraining feedback loop. Cross-functional Collaboration: Collaborate with domain experts and stakeholders to gain insights into environmental differences and potential issues.
k-means clustering. [E] If the labels are known, how would you evaluate the performance of your k-means clustering algorithm?
When the true labels are known, you can evaluate the performance of a k-means clustering algorithm using metrics like Adjusted Rand Index (ARI) or normalized mutual information. These metrics compare the agreement between the true labels and the cluster assignments, providing a quantitative measure of clustering accuracy. Higher values indicate better alignment between the true and predicted clusters.
[H] Your model has been performing fairly well using just a subset of features available in your data. Your boss decided that you should use all the features available instead. What might happen to the training error? What might happen to the test error?
When transitioning from a subset to all available features, the training error might decrease as the model can potentially capture more complex patterns. However, the risk of overfitting increases, leading to a potential rise in test error as the model may struggle to generalize well to new, unseen data due to the increased complexity and noise from additional features.
k-nearest neighbor classification. [E] What happens when you increase or decrease the value of k?
When you increase the value of k in k-nearest neighbor classification, the decision boundary becomes smoother, making the model less sensitive to noise but potentially missing finer patterns in the data. On the other hand, decreasing the value of k results in a more complex and sensitive decision boundary, which might capture intricate patterns but is susceptible to noise. The choice of k should be carefully considered based on the characteristics of the dataset to achieve a balance between bias and variance.
F1 score. [M] Can we still use F1 for a problem with more than two classes. How?
Yes, F1 score can be extended to problems with more than two classes by computing a weighted average of the F1 scores for each class. This can be achieved through micro-average (aggregating counts of true positives, false positives, and false negatives across all classes) or macro-average (computing the average F1 score for each class independently), depending on the specific requirements of the multi-class classification problem.
[E] Is feature scaling necessary for kernel methods?
Yes, feature scaling is necessary for kernel methods like Support Vector Machines (SVM) and kernelized versions of algorithms. Kernel methods rely on the calculation of pairwise distances or inner products between data points, and unnormalized features can lead to biased contributions, affecting the model's performance. Feature scaling ensures that all features contribute equally and is essential for achieving optimal results with kernel methods.