ML Fundamental Questions Qualcomm

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

How do ensemble methods (bagging, boosting) reduce variance and/or bias in machine learning?

Bagging (e.g., Random Forests): Averages multiple independently trained weak learners (usually decision trees) to reduce variance. Each learner is trained on a bootstrap sample, making them decorrelated. Boosting (e.g., XGBoost): Sequentially trains weak learners, each focusing on the errors of the previous ones, reducing bias and potentially variance. Boosting often achieves strong performance but can be more prone to overfitting if not properly regularized.

Can you describe and contrast batch and layer normalization and how it affects training stability?

Batch normalization and layer normalization are both techniques designed to stabilize and accelerate the training of deep neural networks by normalizing activations, but they operate differently and are suited to distinct scenarios. Batch normalization (BN) normalizes the activations across the entire mini-batch for each feature, which helps reduce internal covariate shift, allowing for higher learning rates and faster convergence. BN leverages batch statistics (mean and variance) during training, making it highly effective in convolutional neural networks where large, consistent batch sizes are common. However, BN introduces dependencies between samples within a batch, which can be problematic with small or variable batch sizes and less effective in sequential models like RNNs. On the other hand, layer normalization (LN) normalizes the activations across the features within each individual data sample, independent of the batch size. This makes LN particularly suitable for models with variable batch sizes or sequential data, such as transformers and recurrent neural networks, where maintaining consistency across different inputs is crucial. While both BN and LN enhance training stability by ensuring consistent activation distributions, BN is preferred for large-scale, parallel architectures, whereas LN offers greater flexibility and robustness in diverse and sequential settings, providing a more sample-independent normalization approach.

What is DAgger (Dataset Aggregation), and how does it address the limitations of behavior cloning?

DAgger iteratively refines the policy by collecting new data from the current policy's rollouts. The expert corrects the policy's actions in off-distribution states, adding these transitions to the training set. This reduces compounding errors and narrows the domain gap between training and testing.

How does MAMQL derive an agent's policy from the learned marginal Q-function?

MAMQL assumes Boltzmann (softmax) policies with respect to the marginal Q. Concretely, an agent's policy πi(ai∣s) is proportional to exp⁡(λ Qi(s,ai)). This ensures each agent's strategy is a generalized Boltzmann policy, reflecting both an entropy bonus (encouraging exploration) and the agent's estimated return under others' strategies.

What is MAMQL (Multi-Agent Marginal Q-Learning from Demonstrations) at a high level?

MAMQL is a framework to learn both reward functions and policies in a general-sum Markov game. Each agent learns a marginal Q-function that averages over the other agents' actions. From these marginal Q-functions, MAMQL constructs Boltzmann-type policies. It thereby captures the cooperative/competitive dynamics seen in expert demonstrations and recovers rewards consistent with those behaviors.

How does MAMQL compare to earlier multi-agent IRL methods such as MA-AIRL or IQ-Learn MA?

MAMQL typically converges faster, achieves higher returns, and recovers more accurate reward functions. It outperforms MA-AIRL and IQ-Learn MA in diverse tasks (e.g., a gridworld with cooperative/competitive gems, Overcooked, multi-agent Highway driving). By leveraging marginal Q-learning and a direct optimization objective, MAMQL exhibits stronger sample efficiency and more robust reward recovery.

Why do marginal Q-functions matter in MAMQL?

Marginal Q-functions estimate an agent's expected return after marginalizing over other agents' actions. This sidesteps the need to store or update a separate Q-value for each joint action combination, which can become intractable with many agents. By integrating over other agents' actions, MAMQL remains more tractable and can apply single-agent "soft Q-learning" principles (like in IQ-Learn) to a multi-agent context.

How would you gather expert demonstrations for imitation learning in autonomous driving?

Methods include manually driving routes with instrumented vehicles, using simulation with a human-in-the-loop, or leveraging data from large-scale real-world driving logs. Ensuring coverage of diverse scenarios is critical for robust policy learning.

What metrics and evaluation methods are used to assess the performance of Behavioral Foundation Models?

Metrics include trajectory accuracy (ADE/FDE), collision rate, comfort metrics (jerk, acceleration), and compliance with traffic rules. Evaluations range from offline dataset-based benchmarks (forecasting or planning tasks) to closed-loop simulation or real-world trials. Realism of generated behaviors can also be assessed via human-in-the-loop evaluations.

How would you measure the success of an imitation learning policy in an autonomous driving context?

Metrics include trajectory similarity to expert (e.g., deviation, collision avoidance), success in navigation tasks, compliance with traffic rules, comfort metrics (jerk, acceleration), and generalization to scenarios not in the training set.

How do you approach optimizing GPU usage and memory constraints for large-scale training jobs?

Mixed Precision Training: Utilizing lower-precision data types (e.g., float16) reduces memory usage and accelerates computations by leveraging specialized hardware capabilities, such as NVIDIA's Tensor Cores. Gradient Accumulation: Simulating larger batch sizes by accumulating gradients over multiple smaller batches allows for effective training without exceeding GPU memory limits. This approach maintains the benefits of large batches while operating within memory constraints. Model Parallelism: Distributing different parts of the model across multiple GPUs helps manage memory usage for extremely large models that cannot fit on a single device. Data Parallelism with Efficient Memory Management: Implementing data parallelism with optimized memory allocation and gradient synchronization can maximize GPU utilization and minimize idle times. Gradient Checkpointing: Saving memory by recomputing certain activations during the backward pass instead of storing all intermediate results. This trade-off between computation and memory allows for training deeper networks within limited memory. Memory-Efficient Architectures: Designing or selecting neural network architectures that inherently require less memory, such as using smaller or more compact layers, can help fit models within GPU constraints. Efficient Data Loading and Preprocessing: Ensuring that data pipelines are optimized to feed data to GPUs without bottlenecks, using techniques like data prefetching and parallel data loading. Pruning and Quantization: Reducing model size through pruning unnecessary connections and quantizing weights can lower memory usage and improve computational efficiency without significantly sacrificing performance.

In motion planning, what is the difference between model-based and learning-based approaches?

Model-based approaches: Rely on explicit models of vehicle dynamics and constraints (e.g., kinematic/dynamic models). They solve optimization or graph search problems for safe, feasible paths. Learning-based approaches: Learn either direct policy outputs (actions) or cost functions from data, often using neural networks or reinforcement learning, which can adapt to complex environments but require comprehensive training data.

How does model-free RL differ from model-based RL in the context of self-driving cars?

Model-free RL: Learns a policy or value function directly from interactions without an explicit world model. Simpler to implement but data-inefficient. Model-based RL: Learns or has access to an environment model (e.g., vehicle dynamics, traffic rules). Can plan or simulate future rollouts, potentially achieving better sample efficiency and interpretability but requiring accurate modeling.

Describe an example use case of DiT for trajectory generation in autonomous driving.

A DiT could encode the past and present states of a vehicle as input tokens, add positional or temporal embeddings, and diffuse from noisy future trajectory hypotheses. Each step refines these hypotheses through attention across the context tokens, yielding diverse and realistic future trajectories.

Explain the concept of a "behavioral planner" in autonomous driving.

A behavioral planner decides high-level maneuvers (e.g., lane changes, overtaking) based on contextual cues such as traffic rules, other road users, and route objectives. It bridges the gap between route-level navigation (global planner) and low-level trajectory generation (trajectory planner).

How is the loss function typically formulated in diffusion models?

A common approach is to predict the noise added at each timestep using an MSE loss between predicted noise and actual noise. This direct noise-prediction objective simplifies training, ensuring stable convergence while preserving image or trajectory fidelity.

How does convolution differ from a fully connected layer, and why is it well-suited for image data?

A fully connected layer connects every neuron in one layer to every neuron in the next, resulting in a dense connectivity pattern. This structure allows the model to learn global patterns but leads to a large number of parameters, especially with high-dimensional inputs like images, making it computationally expensive and prone to overfitting. In contrast, a convolutional layer applies a set of learnable filters (kernels) that slide across the input data, performing local operations. Each filter detects specific features such as edges, textures, or shapes by focusing on small, localized regions of the input. This local connectivity and parameter sharing significantly reduce the number of parameters, enhancing computational efficiency and reducing the risk of overfitting. Additionally, convolutional layers inherently capture spatial hierarchies and translation-invariant features, making them exceptionally well-suited for processing image data where local patterns and spatial relationships are crucial.

Discuss the potential pitfalls of imitation learning, such as distributional shift and covariate shift.

A policy trained via behavior cloning observes states from expert behavior only. At test time, errors compound, leading the policy into unfamiliar states not in the training distribution (distributional shift). Domain adaptation, DAgger (Dataset Aggregation), or robust policy training can mitigate this issue.

What are the primary components of an autonomous driving stack, and how do they interact?

A typical stack includes perception (object detection, tracking, sensor fusion), prediction (forecasting the motion of other agents), planning (high-level route selection, trajectory generation), and control (vehicle actuation). These modules form a pipeline, sharing processed sensor data and predicted behaviors to generate safe, feasible trajectories.

What is the purpose of activation functions in neural networks, and why is ReLU commonly used?

Activation functions introduce non-linearity into neural networks. ReLU (Rectified Linear Unit) Computational Efficiency: ReLU is computationally simple, involving only a thresholding operation, which speeds up the training process. Mitigation of Vanishing Gradients: Unlike sigmoid or tanh activations, ReLU does not saturate for positive inputs, reducing the likelihood of vanishing gradients and facilitating the training of deeper networks. Sparse Activation: ReLU outputs zero for half of the input values, leading to sparsity in activations, which can improve model efficiency and reduce overfitting. However, dyling ReLUs

Discuss common exploration strategies in RL and their relevance to safety-critical domains like autonomous driving.

Approaches include ϵ epsilonϵ-greedy, Boltzmann exploration, or parameter noise. In safety-critical domains, naive exploration can be disastrous. Methods like safe exploration or adding risk-sensitive constraints ensure that the agent respects safety boundaries while learning.

Compare behavior cloning (BC) and generative adversarial imitation learning (GAIL) in terms of training procedure and outcomes.

BC: Straightforward supervised approach, matching expert actions state-by-state. It's simpler but prone to compounding errors. GAIL: Trains a policy to mimic expert behavior in a distribution sense by adversarially distinguishing expert from policy rollouts. It can handle off-distribution states better but is more complex and computationally expensive.

Compare BERT-style masked language modeling with autoregressive Transformers (e.g., GPT).

BERT (bidirectional): Uses masked language modeling to learn context from both directions, suitable for encoding tasks (e.g., classification). GPT (autoregressive): Predicts the next token given previous tokens, focusing on unidirectional context, which is beneficial for generative tasks.

What is the difference between data parallelism and model parallelism in distributed ML training?

Data Parallelism involves replicating the entire model across multiple devices (e.g., GPUs) and distributing different subsets of the training data to each replica. Each device performs forward and backward passes on its data subset, computes gradients, and then aggregates these gradients to update the shared model parameters synchronously or asynchronously. Data parallelism is well-suited for scenarios where the model fits entirely within the memory of a single device but the dataset is too large to be processed by one device alone. It is commonly used in large-batch training and can effectively leverage the parallel processing capabilities of multiple GPUs. Model Parallelism splits the model itself across multiple devices, assigning different layers or parts of the model to different hardware units. This approach is necessary when the model is too large to fit into the memory of a single device. During training, each device handles its assigned portion of the model, passing intermediate activations between devices as needed. Model parallelism is particularly useful for extremely large models, such as those used in natural language processing or complex autonomous driving systems, where individual layers or modules may require significant computational resources. In practice, a combination of both data and model parallelism, known as hybrid parallelism, can be employed to optimize training efficiency and scalability, especially for very large models and datasets.

What are the primary challenges in deploying deep learning models on embedded systems, and how can they be addressed?

Deploying deep learning models on embedded systems requires addressing technical constraints through techniques like hardware-aware neural architecture search (NAS), model quantization, pruning, and knowledge distillation to optimize for limited computational resources, while leveraging cross-platform frameworks and specialized hardware accelerators such as FPGAs to ensure compatibility across different embedded targets. These optimizations must balance power efficiency through dynamic voltage and frequency scaling (DVFS), memory usage via weight sharing and compressed representations, and latency requirements using efficient architectures like MobileNets, all while maintaining model robustness and reliability through error detection mechanisms and redundant pathways.

How is sensor fusion (e.g., LiDAR, camera, radar) beneficial in perception for autonomous vehicles?

Different sensors complement each other: LiDAR provides precise 3D range data, cameras offer rich semantic information, and radar is robust under adverse weather. Fusing these modalities yields a more accurate and robust environmental representation, which improves object detection, localization, and tracking.

What are Diffusion Transformers (DiT), and how do they combine diffusion processes with Transformer architectures?

Diffusion Transformers incorporate the transformer-based attention mechanism into each diffusion step or denoising block, leveraging the global context capture of attention while iteratively refining noisy inputs. They often treat each timestep's hidden state as a sequence, enabling flexible context modeling at each denoising phase.

Discuss techniques to reduce the computational complexity of self-attention for very long sequences.

Efficient Transformers include sparse attention (e.g., Longformer, Big Bird), low-rank approximations, or memory-based approaches (e.g., Compressive Transformers). These reduce the O(n2) complexity to linear or n log(n) by restricting or approximating attention patterns.

What are the key architectural components needed to adapt DiTs to handle spatiotemporal data in autonomous driving?

Essential components include spatiotemporal embeddings (e.g., 2D for location + 1D for time), multi-head attention for capturing agent-agent and agent-environment interactions, and specialized conditioning modules that incorporate sensor or map information. Residual blocks can help with stable denoising.

How do you evaluate safety and reliability when testing autonomous driving systems?

Evaluation combines simulation-based testing (covering edge cases, large-scale scenarios), closed-course testing, and real-world pilot drives. Metrics include collision rates, time-to-collision, intervention counts, and compliance with traffic rules. Formal verification techniques may also be applied to critical modules.

Discuss the significance of interpretability and explainability in Behavioral Foundation Models for safety-critical applications.

Explainability is crucial for diagnosing failures, building trust, and satisfying regulatory requirements. Techniques like attention visualization, saliency maps, or surrogate explainable models can shed light on the model's decision-making process, ensuring that any anomalous behaviors are detectable and correctable.

What are GANs?

Generative Adversarial Networks (GANs) consist of two neural networks—the generator and the discriminator—in a minimax game framework. During training, the generator architecture (often deep convolutional or transformer-based) synthesizes data samples to mimic the target distribution, while the discriminator (typically a binary classifier) evaluates the authenticity of these samples against real data. The training pipeline involves alternating gradient updates: the discriminator is optimized to maximize the probability of correctly distinguishing real from fake samples, using loss functions like binary cross-entropy or Wasserstein loss, while the generator is simultaneously trained to minimize the discriminator's ability to identify its outputs as fake. Architectural innovations such as DCGAN's convolutional layers, spectral normalization, and techniques like gradient penalty are employed to stabilize training and mitigate issues like mode collapse. This adversarial interplay drives both networks to improve iteratively, resulting in the generator producing increasingly realistic data.

What is the difference between generative and discriminative models in ML?

Generative Models: Model the joint probability p(x,y), aiming to learn how data is generated. Examples include Gaussian Mixture Models and VAEs. Discriminative Models: Model the conditional probability p(y∣x) or directly learn a decision boundary. Examples include SVMs, logistic regression, and typical neural classifiers.

What is gradient checkpointing, and how does it help with memory constraints?

Gradient Checkpointing is a technique used to reduce memory consumption during the training of deep neural networks by strategically storing a subset of intermediate activations (checkpoints) and recomputing the missing activations during the backward pass. Instead of saving all activations required for gradient computation, only selected checkpoints are stored, which significantly lowers memory usage. When a backward pass is performed, the activations that were not saved as checkpoints are recalculated by re-executing the forward pass from the last stored checkpoint. This trade-off between increased computational overhead and reduced memory footprint allows training of larger models or using larger batch sizes on memory-constrained hardware, such as GPUs with limited memory capacity. Gradient checkpointing is particularly beneficial for very deep networks or models with high memory demands, where storing all intermediate activations would otherwise exceed available memory. By enabling the training of more complex models within memory limits, gradient checkpointing enhances the scalability and flexibility of deep learning workflows, making it a valuable tool for efficient model optimization.

Compare SGD, GD, and Adam optimizers. Also discuss loss functions for unbalanced classes and the most suitable loss function for diffusion models.

Gradient Descent uses the entire dataset for each update, which can be slow for large datasets. Stochastic Gradient Descent (SGD) updates the model's weights in the direction of the negative gradient of the loss with respect to the weights, using mini-batches of data. SGD can be slow to converge and may oscillate in the presence of high curvature or saddle points, but it is effective for large-scale and sparse data. Adam (Adaptive Moment Estimation) combines the benefits of two other extensions of SGD: Momentum and Adaptive Learning Rates. Adam maintains running averages of both the gradients (first moment) and their squared values (second moment), using these to adapt the learning rates for each parameter individually. This allows Adam to handle sparse gradients and varying data distributions more effectively, often leading to faster and more stable convergence compared to SGD. Loss Functions for Unbalanced Classes: Focal loss or weighted cross-entropy address imbalance by focusing on harder examples or weighting minority classes more heavily. Best Loss Function for Diffusion Models: A common choice is the mean-squared error (MSE) on the noise prediction (e.g., ℓ2​ between predicted and true noise), as it aligns well with the diffusion process framework.

How does regularization (L1, L2, dropout) help in preventing overfitting?

L1 Regularization: Encourages sparsity, zeroing out less-important parameters. L2 Regularization: Penalizes large weights, distributing weights more evenly. Dropout: Randomly zeroes neurons during training, reducing inter-neuron co-adaptations, thus improving generalization. Induces ensemble effect

What role do latent variable models play in capturing multimodal driver or pedestrian behaviors?

Latent variable models, like VAEs or diffusion-based architectures, learn a probabilistic representation of behaviors, allowing multiple plausible futures to be generated or predicted. They capture inherent uncertainty and diversity in real-world driving scenarios, essential for robust planning and simulation.

Explain the concept of hardware-software co-design in the context of deep learning for autonomous driving.

Hardware-software co-design in autonomous driving systems involves the simultaneous optimization of neural network architectures alongside specialized hardware accelerators, instruction sets, and memory hierarchies, enabling enhanced speed and power efficiency through techniques like parallel processing and optimized data flows, while custom accelerator architectures specifically designed for deep learning operations like matrix multiplications ensure maximum resource utilization within embedded constraints.

How does hierarchical RL benefit autonomous driving systems?

Hierarchical RL decomposes tasks into sub-policies for simpler subtasks (e.g., lane following, merging) governed by a high-level policy. This modular approach improves sample efficiency, interpretability, and reusability of sub-policies while handling complex, long-horizon tasks like driving.

What is inverse reinforcement learning (IRL), and how might it be applied to autonomous driving?

IRL infers a reward function from expert demonstrations. In autonomous driving, it can capture complex driving preferences (comfort, safety, compliance) implicitly encoded in expert behavior. The learned reward can then guide a planning or RL policy that generalizes beyond the demonstration set.

What are the main steps involved in the MAMQL training procedure?

Initialize each agent's marginal Q-function Qi and reward function Ri​. Collect transitions via the current joint policy and store them in a buffer. Update critics by enforcing a soft Bellman consistency for Qi (similar to IQ-Learn but marginalized over other agents). Recover rewards by matching the marginal Q-critic to the true (but unknown) reward, using expert transitions to guide the fitting. Derive policies as Boltzmann policies from Qi​. Repeat until convergence.

What are some key insights and limitations discussed in the MAMQL paper?

Insights:Marginalizing over other agents' actions reduces environment non-stationarity and simplifies multi-agent IRL.A direct, non-adversarial objective (building on IQ-Learn) offers greater stability and efficiency. Limitations:Experiments rely on simulated experts, which may differ from human or suboptimal experts.Large discrete action spaces can still be computationally heavy.Additional research is needed to handle real-world human biases, safety, or continuous actions at scale.

What is Inverse Q Learning?

Inverse Q-learning is an extension of inverse reinforcement learning that seeks to infer the underlying action-value function Q∗(s,a) from observed expert behavior without explicit reward signals. The algorithm operates by formulating an optimization problem where the estimated Q-function assigns higher values to the actions demonstrated by the expert compared to alternative actions, typically enforcing constraints derived from the Bellman optimality conditions. By leveraging techniques such as maximum likelihood estimation or entropy regularization, Inverse Q-learning iteratively adjusts the Q-values to best explain the observed trajectories, thereby enabling the derivation of policies that replicate expert performance based solely on behavioral data.

Explain the backpropagation algorithm and why it is essential for training neural networks.

It leverages the chain rule of calculus to propagate error signals backward through the network, from the output layer to the input layer. The process involves two main phases: the forward pass, where input data is passed through the network to compute predictions and the loss, and the backward pass, where gradients of the loss with respect to each parameter are calculated. Backpropagation is essential because it enables gradient-based optimization methods to update the network's weights in a direction that minimizes the loss function.

What is multi-head attention, and how does it differ from single-head attention?

Multi-head attention applies self-attention multiple times with different learned projections. Each head captures different aspects of the pairwise relations. The outputs are concatenated and linearly projected, improving the model's capacity to attend to diverse features and correlations simultaneously.

What is neural architecture search (NAS)?

Neural Architecture Search (NAS) is an automated process for designing neural network architectures tailored to specific tasks and constraints. NAS algorithms explore a predefined search space of possible architectures using strategies like reinforcement learning, evolutionary algorithms, or gradient-based methods to identify optimal or near-optimal configurations that maximize performance metrics while adhering to resource limitations.

What is "non-stationarity" in multi-agent RL, and how does MAMQL address it?

Non-stationarity arises because each agent's learning process changes the environment for the others—i.e., each agent's policy influences all agents' outcomes. MAMQL tackles this by learning marginalized Q-functions that average (or marginalize) over the other agents' actions, effectively smoothing out their contributions. This enables each agent's critic to reflect a stable approximation of its own expected returns even as other policies evolve.

Compare on-policy methods (e.g., PPO) with off-policy methods (e.g., DQN, SAC) for autonomous driving tasks.

On-policy (PPO): Updates a policy based on trajectories sampled from the current policy, ensuring stable but slower learning. Good for continuous control (with stable objectives). Off-policy (DQN, SAC): Learns from a replay buffer, allowing reuse of past experiences, often more sample-efficient. SAC handles continuous action spaces and can learn robust behaviors but might require more careful hyperparameter tuning.

How does online RL differ from offline RL in the context of self-driving cars?

Online RL: Learns in real-time by interacting with the environment, enabling adaptability to new scenarios but posing safety risks and requiring significant computational resources. Ideal for simulation-based training or controlled environments. Offline RL: Trains on pre-collected datasets without live interaction, ensuring safety and efficiency but limited by dataset quality and generalization to unseen scenarios. Suitable for pre-training deployable self-driving policies.

Explain principal component analysis (PCA) and how you might use it in a high-dimensional setting.

PCA projects data onto orthogonal components of maximal variance, effectively reducing dimensionality. In high-dimensional settings, PCA can help remove redundancy, mitigate noise, and reduce computational complexity. One should carefully choose the number of principal components to preserve relevant information while dropping noise.

How does Proximal Policy Optimization (PPO) differ from MAMQL?

PPO is a single-agent (or centralized) policy gradient method that updates a policy by clipping the objective to avoid overly large steps. In contrast, the paper's multi-agent IRL approaches (e.g., MAMQL) aim to learn reward functions and policies simultaneously from demonstrations, while addressing non-stationarity due to multiple agents optimizing different objectives. PPO directly maximizes cumulative reward under a known reward function, whereas multi-agent IRL infers those objectives first, then learns the policies accordingly.

What is policy lag?

Policy Lag Critic Too Fast: If the critic (value function) updates too quickly relative to the actor, the policy may not effectively follow the optimal value gradient because the "goalposts" keep shifting. Actor Too Fast: Conversely, if the actor updates too quickly without an adequately trained critic, it may act based on noisy or inaccurate value estimates, leading to suboptimal or unstable behavior. Resulting Instability: This mismatch can cause oscillations in policy updates, inefficient exploration, or even divergence in the learning process. Causes of Policy-Critic Mismatch High Variance in Value Estimates: If the critic's Q-function isn't well-regularized, it may produce noisy gradients that destabilize the actor's learning. Update Frequency Differences: SAC updates the critic more frequently than the actor (e.g., multiple critic updates per policy step). If this balance is mismanaged, it can exacerbate the lag. Suboptimal Hyperparameters: Poorly tuned learning rates, target smoothing coefficients, or entropy coefficients can amplify the misalignment. Delayed Target Networks: SAC uses target networks for stability, but overly delayed target updates can contribute to the lag between the actor and critic.

What are positional encodings in Transformers, and how do they compensate for the lack of recurrence or convolution?

Positional encodings inject information about the relative or absolute position of tokens in a sequence. Since Transformers rely purely on attention, these encodings preserve ordering information. Common implementations include sinusoidal encodings or learnable embeddings.

What metrics would you use to evaluate a model on a highly imbalanced classification problem?

Precision measures the proportion of true positive predictions among all positive predictions, indicating the model's accuracy in identifying the minority class. Recall (Sensitivity) measures the proportion of actual positives correctly identified, reflecting the model's ability to capture all relevant instances. F1-Score: The harmonic mean of precision and recall, providing a single metric that balances both aspects. It is particularly useful when seeking a balance between precision and recall. Precision-Recall AUC (Area Under the Curve): Evaluates the trade-off between precision and recall across different threshold settings, offering a comprehensive view of the model's performance on the minority class. ROC AUC: Although useful, it can be less informative in cases of extreme imbalance, as it considers the true positive rate and false positive rate without focusing solely on the minority class.

What are the main differences between PyTorch and TensorFlow in terms of computational graphs and debugging?

PyTorch uses a dynamic computation graph, which is more intuitive for debugging and flexible for research. TensorFlow's static-graph approach (especially in TF1.x) can optimize execution efficiently but is less straightforward to debug. TensorFlow 2.x introduced eager mode to narrow this gap, but PyTorch remains the more "pythonic" experience for many researchers.

What is a VQ-GAN?

Q-GANs (Vector Quantized Generative Adversarial Networks) integrate a convolutional encoder and decoder with a discrete latent space achieved via vector quantization, enabling high-fidelity image generation. The training pipeline involves encoding input data into latent vectors, quantizing these vectors against a learned codebook, and decoding them back to reconstruct the original input, simultaneously training a discriminator to enforce realism through adversarial loss. The loss functions typically include a reconstruction loss (e.g., L1 or perceptual loss) to ensure fidelity, an adversarial loss to enhance realism, and a commitment loss that encourages the encoder outputs to commit to the nearest codebook entries. Additionally, the architecture leverages residual blocks and attention mechanisms to capture complex dependencies, while the adversarial component ensures that the generated outputs are indistinguishable from real data. This combination allows VQ-GANs to efficiently learn rich, discrete representations

What is Q Learning?

Q-learning is a model-free reinforcement learning algorithm that iteratively estimates the optimal action-value function, Q∗(s,a)Q^*(s, a)Q∗(s,a), which represents the maximum expected cumulative reward achievable from state sss by taking action aaa and following the optimal policy thereafter. The algorithm updates Q-values using the Bellman equation: Q(st,at)←Q(st,at) + α[ rt+1 + γmax⁡a′Q(st+1,a′) − Q(st,at) ] where α is the learning rate and γ is the discount factor. By employing exploration-exploitation strategies, such as ε-greedy, Q-learning progressively converges to the optimal policy without requiring a model of the environment, making it effective for solving a wide range of sequential decision-making problems.

What approaches can be used to speed up hyperparameter tuning in large-scale ML tasks?

Random Search: Instead of exhaustively searching all possible hyperparameter combinations, random search samples a subset randomly, often finding good configurations more efficiently. Bayesian Optimization: Uses probabilistic models to predict the performance of hyperparameter combinations and intelligently explore the search space, focusing on promising areas to find optimal settings with fewer evaluations. Hyperband: Combines random search with adaptive resource allocation and early stopping, allocating more resources to promising configurations while quickly discarding underperforming ones. Grid Search: Systematically explores a predefined set of hyperparameter values. While exhaustive, it can be parallelized to speed up the search, though it may still be inefficient for large spaces. Gradient-Based Optimization: Techniques like Hypergradient Descent leverage gradient information to adjust hyperparameters dynamically during training, reducing the need for separate evaluations. Early Stopping: Terminates the training of poor-performing hyperparameter configurations early, saving computational resources by not fully training models that are unlikely to perform well. Parallel and Distributed Computing: Leveraging multiple processors or distributed systems to evaluate multiple hyperparameter configurations simultaneously, significantly reducing overall tuning time. Transfer Learning and Warm Starting: Using knowledge from previous tuning tasks to inform the search in new tasks, potentially reducing the number of required evaluations.

When would you prefer using ReLU over other activation functions like sigmoid or tanh?

ReLU avoids saturation in the positive domain, mitigating vanishing gradients and enabling faster training. Sigmoid or tanh may be preferable in certain cases (e.g., binary output, gating mechanisms in RNNs), but ReLU is widely used due to its simplicity and effectiveness in deep CNNs and MLPs.

What methods can be used to predict the trajectories of surrounding vehicles and pedestrians?

These include physics-based models (constant velocity or acceleration), maneuver-based models (classifying probable maneuvers), and deep-learning-based approaches (LSTM, Transformer, GNN) that capture spatial-temporal interactions. Hybrid methods might combine model-based constraints with learned representations.

How do skip or residual connections in architectures like ResNet help address the vanishing gradient problem?

Residual connections allow gradients to flow directly across layers, bypassing some nonlinear transformations. This direct path facilitates training deeper networks by preserving gradient magnitude, thus alleviating vanishing gradients and enhancing representational power.

What is reward shaping, and how does it affect the training of an RL agent for autonomous driving tasks?

Reward shaping modifies or augments the reward function to guide the agent's exploration towards desired behaviors (e.g., staying in lane, avoiding collisions). Proper shaping accelerates learning but must be carefully designed to avoid suboptimal local maxima or overshadowing the primary goal (safe and efficient driving).

What is score-based generative modeling, and how is it related to diffusion models?

Score-based models learn the gradient (score) of the log probability density, guiding denoising from noisy samples. Diffusion models can be framed as score-based methods, where the "score" is learned at various noise levels. Both approaches revolve around iteratively refining noisy samples.

Explain the concept of "self-attention" and why it is crucial in Transformers.

Self-attention computes a weighted aggregation of all positions in a sequence for each position, capturing contextual relationships dynamically. It enables the model to selectively focus on relevant parts of the sequence at each step, improving representational efficiency and interpretability.

How do single-agent IRL methods differ from multi-agent IRL methods conceptually?

Single-agent IRL methods assume one agent faces a fixed reward function in an environment whose dynamics are unaffected by other learners. Multi-agent IRL must account for multiple rewards (one per agent) and joint policies that can be both cooperative and competitive. Hence, multi-agent IRL solutions incorporate equilibrium concepts, adapt to other agents' strategies, and must estimate more complex interactions from demonstrations.

What is Soft Actor Critic

Soft Actor-Critic (SAC) is an off-policy, model-free reinforcement learning algorithm designed for continuous action spaces that integrates maximum entropy principles to enhance exploration and robustness. Architecturally, SAC employs dual Q-networks to reduce overestimation bias and a separate stochastic policy network that outputs a probability distribution over actions, thereby encouraging diverse action selection. The training pipeline involves simultaneously updating the Q-functions to minimize the soft Bellman residual, optimizing the policy to maximize both expected rewards and entropy, and dynamically tuning the temperature parameter to balance exploration and exploitation. This combination of entropy regularization and stable actor-critic updates enables SAC to achieve high sample efficiency and reliable performance across a wide range of continuous control tasks.

How do you train DiTs efficiently given their potentially high computational demands?

Strategies include mixed-precision training, gradient checkpointing, distributed training over multiple GPUs, and careful hyperparameter selection (learning rate scheduling, smaller batch sizes with accumulative gradients). Structured or sparse attention may also reduce overhead.

Discuss strengths and weaknesses of diffusion models in terms of training stability, mode coverage, and inference time.

Strengths: Stable training (no adversarial min-max), good mode coverage (fewer collapse issues), and high-quality generations. Weaknesses: Inference is slow due to multi-step sampling, requiring sequential denoising iterations. Speed-up strategies like Denoising Diffusion Implicit Models (DDIM) or learned sampler schedules can mitigate this.

What are potential strategies to reduce inference time in diffusion models?

Techniques include fewer reverse steps (DDIM), learned sampling schedules, or distillation approaches that approximate multi-step diffusion with fewer or single steps. These strategies trade off sampling speed for some reduction in sample fidelity or diversity.

What is the Bellmann Equation?

The Bellman equation is a foundational recursive relationship in dynamic programming and reinforcement learning that defines the value of a state as the maximum expected return achievable by taking an optimal action and subsequently following the optimal policy. Formally, for a value function V∗(s), it is expressed as: V∗(s)= maxa ​[ R(s,a) + γs′∑​P(s′∣s,a)V∗(s′) ] where R(s,a) is the immediate reward, γ is the discount factor, and P(s′∣s,a) represents the transition probabilities. In algorithms like Q-learning and value iteration, the Bellman equation is iteratively applied to update value estimates, ensuring convergence to the optimal value function by enforcing the principle of optimality. This recursive decomposition enables efficient computation of optimal policies by breaking down complex decision-making processes into manageable subproblems.

What is the Boltzmann policy?

The Boltzmann policy, often referred to as the softmax policy, is a probabilistic strategy used in reinforcement learning to select actions based on their estimated values. It assigns higher probabilities to actions with greater expected rewards while still allowing less optimal actions to be chosen with some likelihood, thereby balancing exploration and exploitation. By adjusting a temperature parameter, the Boltzmann policy can control the degree of randomness in action selection, enabling more flexible and adaptive learning processes. This approach is integral to algorithms that benefit from controlled stochasticity, enhancing their ability to efficiently learn and robustly perform in complex environments.

What is the TD update?

The Temporal Difference (TD) update is a fundamental technique in reinforcement learning used to estimate the value of states based on experiences. Unlike methods that wait until the end of an episode to make updates, TD learning updates value estimates continuously as the agent interacts with the environment. By using the difference between predicted and actual rewards, TD updates allow the agent to gradually improve its understanding of which actions lead to better outcomes, enabling more efficient and effective learning without needing a complete model of the environment.

What is the bias-variance trade-off, and why is it critical in machine learning model development?

The bias-variance trade-off balances underfitting (high bias) and overfitting (high variance). A model with high bias makes simplistic assumptions, leading to underfitting, whereas a model with high variance overfits the training data and fails to generalize. The optimal trade-off is found by minimizing both errors, typically through regularization, careful hyperparameter tuning, or model selection.

Can you outline the forward and reverse diffusion processes in diffusion models?

The forward process is designed to essentially push a sample off the data manifold turning it into noise and then the reverse process is trained to produce a trajectory back to the data manifold. Feller 1949 paper observes that in the limit of infinitesimal step sizes the true reverse process will have the same functional form as the forward process, so diffusion models leverage this observation parameterizing each learned step to also be a unimodal diagonal gaussian; thus allowing the model to undo forward steps individually as a markov chain.

How would you integrate a Behavioral Foundation Model into a planning pipeline for autonomous driving?

The model provides high-level representations (e.g., embedding of the environment, traffic participants, and driver intention). A planning module (possibly optimization-based or RL-based) queries these representations to generate feasible, socially compliant trajectories. Knowledge distillation or direct feature sharing can facilitate integration.

How do diffusion models handle multimodality in data, such as multiple plausible futures in driving scenarios?

They inherently capture probability distributions over data via the noise-to-sample mapping, allowing multiple plausible generations. This is particularly suitable for uncertain or multimodal domains (e.g., different possible driver intentions), as each reverse diffusion path can yield distinct outcomes.

How do DiTs handle conditional inputs (e.g., road geometry, traffic lights) for realistic driving simulation?

They typically concatenate or cross-attend to conditional tokens (maps, traffic signals) at each diffusion timestep. Alternatively, they can incorporate condition-specific encoders or latent embeddings, ensuring that domain context guides the denoising and generation process.

How do you train a Variational Autoencoder (VAE), and how does it differ from a standard autoencoder?

Training a Variational Autoencoder (VAE) involves optimizing a loss function that combines a reconstruction loss with a regularization term. The VAE architecture consists of an encoder that maps input data to a latent space defined by a mean and variance, and a decoder that reconstructs the data from sampled latent variables. The loss function includes: Reconstruction Loss: Measures how well the decoder reconstructs the input data, typically using Mean Squared Error (MSE) or Binary Cross-Entropy. Kullback-Leibler (KL) Divergence: Regularizes the latent space by ensuring that the learned distribution of latent variables approximates a prior distribution, usually a standard normal distribution. In contrast, a standard autoencoder solely focuses on minimizing the reconstruction loss without imposing any constraints on the latent space. This means that while standard autoencoders can effectively compress and reconstruct data, they do not encourage a structured latent space, making it difficult to generate new data samples.

Discuss common approaches to regularizing deep models to prevent overfitting.

Typical methods include L2 weight decay, dropout (randomly dropping neurons), early stopping, data augmentation (for image or audio tasks), and batch normalization. Additionally, advanced techniques like mixup or cutmix can further enhance generalization by artificially blending examples.

How do you handle missing or noisy data in a machine learning pipeline?

Typical strategies involve data imputation (mean, median, or model-based methods), outlier detection (using statistical thresholds or robust models), and noise-robust loss functions. Feature engineering (e.g., using embeddings or domain-specific transformations) can also mitigate the impact of missing or noisy inputs.

Can you discuss the differences between a "U-Net style" diffusion backbone and a Transformer-based diffusion backbone?

U-Net Backbone: Uses convolutional downsampling/upsampling to learn hierarchical feature maps with skip connections. Transformer Backbone: Relies on global self-attention across sequence tokens for representation learning. Transformers can be more flexible in capturing large-scale context but demand more compute and memory.

How can you incorporate uncertainty or multimodality into an imitation learning policy for driving?

Use stochastic policy outputs (e.g., mixture models or latent variable policies) and train them to match the distribution of expert data. Alternatively, incorporate trajectory-based IL, where the model can produce multiple candidate actions conditioned on an internal latent variable capturing driver intent diversity.

Explain the concepts of value functions, policies, and Q-functions in RL.

Value Function Vπ(s): Expected return when starting in state s and following policy π. Policy π: Mapping from states to actions, potentially stochastic. Q-function Q(s,a): Expected return from taking action a in state s and thereafter following π. Learning and iterating these functions underpins many RL algorithms.

When training a deep network, how would you identify and mitigate the vanishing or exploding gradient problem?

Vanishing or exploding gradients can be identified through abnormally small or large weight updates. Mitigation strategies include careful weight initialization (He or Xavier), batch normalization, residual connections, gradient clipping (for exploding gradients), and appropriate architecture choices (like LSTM or GRU for sequence data).

Can imitation learning be combined with reinforcement learning for autonomous driving, and if so, how?

Yes. One can initialize a policy via imitation learning for fast convergence and then refine it with RL to optimize task-specific rewards. This synergy leverages expert knowledge (IL) and self-exploration (RL) for better performance and robustness.

Can foundation models be specialized to specific driving tasks (e.g., overtaking, merging) without losing generality?

Yes. Using a multi-task architecture or conditional inputs (e.g., scenario descriptors) can specialize to specific tasks while retaining a core, shared backbone. Task-specific heads or adapters then build on a shared foundation, maintaining generalizability while achieving high performance on specialized tasks.


संबंधित स्टडी सेट्स

Health Science - Body Planes and Directions

View Set

Health Assessment: Nurse's Role in Health Assessment

View Set

Mental Health - Prep U - Chapter 23

View Set

Shakespeare I- Shakespeare Knowledge Q1

View Set

Chapter 2- Designing a Healthy Eating Pattern Assignment (study guide)

View Set

Ch 21 Regulating the Competitive Environment

View Set

Ch 8 Lifting and Moving Patients Quiz

View Set

HA Chapter 1: The Nurse's Role in Health Assessment Prep U questions

View Set