CAP4611 Final Review
What is a decision tree, and how is it used in data classification tasks?
A decision tree is a flowchart-like structure used for decision-making and classification tasks. It is used in data classification tasks by dividing the dataset into smaller subsets based on feature values.
What is the Gini index, and how is it used in building decision trees?
A metric to measure node impurity (probability of incorrectly classifying a randomly chosen data point). It identifies best split.
What is the K-Nearest Neighbors (KNN) algorithm, and in which scenarios is it commonly used?
A non-parametric algorithm that classifies a data point based on the majority label of its k nearest neighbors. useful for tasks where the relationships between data points are complex, and the decision boundary is non-linear.
Explain the components of a confusion matrix.
A table showing TP, FP, TN, FN counts.
Explain the concept of generalization and its importance in machine learning models.
Ability of a model to perform well on unseen data. This is a crucial aspect of model evaluation because the ultimate goal of most machine learning models is to make accurate predictions or decisions on new, real-world data.
How does PCA achieve dimensionality reduction? Why is dimensionality reduction important?
Achieved by projecting data onto fewer principal components with the highest variance.
What is the purpose of regularization, and how does it help in managing overfitting?
Adds penalty terms to loss (e.g., L1/L2). Prevents overfitting.
List and explain the advantages and limitations of using decision trees.
Advantages: Simple, interpretable, handles categorical/continuous data. Limitations: Prone to overfitting, unstable.
What is hierarchical clustering? Differentiate between agglomerative (bottom-up) and divisive (top-down) approaches.
Agglomerative Bottom-up: Starts with individual points and merges clusters. Divisive Top-down: Starts with one cluster and splits it.
Explain the primary goals of clustering. What is meant by maximizing intra-cluster similarity and minimizing inter-cluster similarity?
Aims to group data points into clusters. Maximizing Intra-Cluster Similarity: The goal is to ensure that points within a cluster are homogeneous or closely related. Minimizing Inter-Cluster Similarity: The goal is to maximize the distance or dissimilarity between clusters, ensuring that each cluster is distinct.
What is the purpose of evaluation metrics in machine learning?
Assessing model performance, comparing models, handling trade-offs, ensuring generalization.
What are the limitations of PCA? In what situations might PCA not perform well?
Assumes linear relationships. Sensitive to outliers.
How are attributes selected for splitting at each node in a decision tree?
Attributes are selected based on metrics like information gain or Gini index.
Compare and contrast bagging and boosting. Highlight their objectives, how they combine models, and their impact on bias and variance.
Bagging: Reduces variance (e.g., Random Forest). Boosting: Reduces bias by sequentially focusing on misclassified points (e.g., AdaBoost).
Discuss the trade-off between bias and variance in machine learning models.
Balancing complexity to minimize total error.
Describe the ID3 algorithm and its role in decision tree construction.
Builds decision trees by recursively selecting attributes that maximize information gain.
What are ensemble models in machine learning? Explain how combining multiple weak learners can lead to a strong learner.
By combining weak learners effectively, ensemble methods can produce a strong learner that generalizes better to unseen data.
What is the curse of dimensionality, and how does PCA address this issue?
Challenge of organizing data in high-dimensional spaces. PCA mitigates by reducing irrelevant features, focusing on key variances.
List the steps involved in the KNN algorithm for classification. Use an example to illustrate these steps.
Compute distances between the query and all training points. Select k nearest neighbors. Assign the majority label (classification) or average value (regression).
How do you decide how many principal components to retain after performing PCA?
Compute explained variance.
What factors should be considered when using evaluation metrics to compare different machine learning models?
Consider metrics, computation time, and domain-specific requirements.
How does a decision tree algorithm handle continuous and categorical variables differently?
Continuous variables are split into ranges; categorical variables are split by category.
What is the softmax function, and why is it commonly used in multi-class classification tasks?
Converts raw outputs into probabilities for multi-class classification.
Explain what leave-one-out cross-validation is and in what scenario it might be particularly useful.
Each data point is a test set while the rest form the training set. Useful when predicting the likelihood of a rare disease.
What roles do eigenvalues and eigenvectors play in PCA? How do they determine the principal components?
Eigenvalues indicate variance explained by components. Eigenvectors define the directions (principal components).
How does cross-validation help prevent overfitting?
Ensures the model is evaluated fairly across splits.
Define entropy and information gain in the context of decision trees and explain how they influence the construction of the tree.
Entropy: Measure of impurity or disorder. Information Gain: Reduction in entropy after a dataset split. These concepts influence the tree construction process by helping to identify the best feature to split on.
Name and explain three common distance metrics used in clustering algorithms. In what scenarios would each metric be preferred?
Euclidean: For geometric distances. Manhattan: For grid-like data. Cosine Similarity: For textual data.
Name and describe at least three distance metrics used in KNN. In what situations would you use each one?
Euclidean: Straight-line distance. Manhattan: Sum of absolute differences. Cosine Similarity: Measures angle between vectors (for text data).
KNN is often computationally intensive for large datasets. Why is this the case, and what strategies can be used to improve its efficiency?
Expensive for large datasets due to distance computations. Optimize with KD-trees or approximate methods.
You are given a dataset that does not follow a normal distribution and contains many outliers. Which method—parametric or non-parametric—would you choose for analysis? Justify your choice with reasoning.
For a dataset with outliers and no clear distribution, use a non-parametric method like KNN or Decision Trees.
You are tasked with grouping customers based on purchasing behavior. How would you apply clustering to solve this problem? Discuss the preprocessing steps and the choice of algorithm.
For customer segmentation, preprocess data (normalize, remove noise), and use K-Means or DBSCAN.
Suppose you are tasked with classifying emails as spam or not spam using KNN. Describe how you would preprocess the data and select an appropriate K value for the task.
For spam classification, preprocess by vectorizing emails, scaling features, and choosing k using cross-validation.
Provide examples of real-world applications where decision trees are effectively used.
Fraud detection, medical diagnosis, and customer segmentation.
What is the F1 score and how does it combine precision and recall?
Harmonic mean of precision and recall.
Describe the key differences between hierarchical and partitional clustering. Provide an example of each.
Hierarchical: Builds a tree of clusters. Ex: A biologist wants to group genes based on their expression levels across different conditions to understand genetic relationships. Partitional: Divides data into fixed k clusters (e.g., K-Means). Ex: A retail company wants to divide its customer base into distinct groups to design targeted marketing campaigns.
Explain the curse of dimensionality and its impact on clustering algorithms. What techniques can be used to address this issue?
High dimensions dilute clustering quality. Use PCA or t-SNE to reduce dimensions.
What indicators would suggest that a model is overfitting?
High training accuracy, low test accuracy.
Describe the symptoms and consequences of high variance in a machine learning model.
High training accuracy, poor test accuracy.
Explain the "curse of dimensionality" and how it affects KNN. What techniques can be used to mitigate this issue?
High-dimensional data dilutes distances ("curse of dimensionality"). Use PCA or feature selection to reduce dimensions.
Describe how a paired t-test is used to compare the performance of two models.
If the t value, is greater than t critical, conclude that models have significantly different performance. Otherwise, the performance difference is not statistically significant.
Why is data scaling important in KNN? Describe the difference between normalization and standardization and their roles in KNN.
Important because KNN relies on distances, which can be skewed by varying feature scales. Normalization: Scales data to [0,1]. Standardization: Scales data to mean 0, variance 1.
Why might accuracy not be a good metric in cases of class imbalance?
In imbalanced datasets, accuracy can be misleading due to dominance of the majority class and ignoring minority class performance.
Discuss the trade-off between precision and recall and how it affects model performance evaluation.
Increasing precision decreases recall and vice versa.
Outline the steps of the K-Means clustering algorithm. Provide an example to illustrate the process.
Initialize centroids. Assign points to nearest centroid. Update centroids. Repeat until convergence. Ex: A retail company wants to segment its customers into distinct groups based on their purchasing behavior to design targeted marketing strategies.
Describe the components of a decision tree, including internal nodes, branches, and leaf nodes.
Internal Nodes: Decision points for attributes. Branches: Outcomes of decisions. Leaf Nodes: Final predictions or classifications.
How does gradient boosting work? Explain the iterative process of improving predictions by minimizing residuals.
Iteratively trains models to minimize residual errors by adjusting predictions.
How do clustering algorithms handle outliers? Discuss the strengths and weaknesses of K-Means and DBSCAN in this context.
K-Means: Sensitive to outliers. DBSCAN: Robust to outliers.
What is meant by the term "lazy learner," and how does it apply to the KNN algorithm? Discuss its advantages and disadvantages.
KNN defers computation until prediction, leading to high prediction-time cost but no training time.
Explain why KNN is considered a non-parametric algorithm. How does this characteristic impact its performance?
KNN doesn't assume a specific data distribution. It memorizes the training data and makes predictions dynamically. Can model complex relationships without assumptions and easy to implement, but computationally complex, curse of dimensionality, and sensitive to noise and outliers.
Compare KNN with a decision tree-based classifier. What are the strength sand weaknesses of each algorithm in terms of interpretability and performance?
KNN: Better for continuous data, high computation cost. Decision Trees: Better for interpretability, prone to overfitting.
Compare L1 and L2 regularization. How do they help prevent overfitting in neural networks?
L1: Adds λ∑w encouraging sparsity. L2: Adds λ∑w^2 discouraging large weights.
Compare and contrast Lasso and Ridge regression in the context of regularization.
Lasso: Shrinks coefficients to zero (feature selection). Ridge: Shrinks coefficients without zeroing them.
What is meant by high bias in a machine learning model, and what are its typical consequences?
Leads to underfitting, with overly simplistic models.
What indicators would suggest that a model is underfitting?
Low accuracy on both training and test data.
How do decision trees handle missing values during the training process?
Methods include ignoring missing values, surrogate splits, assigning missing values to most frequent class, imputing before training.
Define accuracy in the context of binary classification.
Metric that measures proportion of correctly classified instances out of total instances.
How do evaluation metrics help in assessing the generalization performance of a machine learning model?
Metrics assess how well the model performs on unseen data.
Why is it important to standardize data before applying PCA? What happens if you skip this step?
Necessary to ensure features with large magnitudes don't dominate.
What is batch normalization, and how does it stabilize training and reduce overfitting in neural networks?
Normalizes activations within a batch, stabilizing training and improving performance.
Describe how principal components are derived. What properties do they have?
Obtained by transforming the original dataset into a new coordinate system where the axes (principal components) maximize the variance of the data. Properties: maximize variance, orthogonality, dimensionality reduction, linear combos.
Define overfitting in the context of neural networks. What are its symptoms, and why is it undesirable?
Occurs when the model fits the training data too closely, failing on unseen data.
How do outliers affect the predictions made by KNN? What strategies can be employed to handle outliers in the dataset?
Outliers distort distances. Handle using robust metrics (e.g., median-based) or remove outliers during preprocessing.
What are overfitting and underfitting in the context of machine learning?
Overfitting: Too complex; memorizes noise. Underfitting: Too simple; misses patterns.
Provide an example each of overfitting and underfitting in real-world machine learning applications.
Overfitting: overfit model fails to generalize because stock price movements often involve unpredictable external factors like news events or global economic changes, which were not part of the training data. Underfitting: underfit model is incapable of identifying patterns essential for accurate diagnosis, potentially missing critical cases (false negatives) and causing harm to patients.
Given a dataset with missing values, noisy features, and a large number of correlated variables, explain how PCA can be used to preprocess the data for machine learning.
PCA preprocesses data with noisy, correlated variables by reducing redundancy.
What is Principal Component Analysis (PCA), and why is it used in machine learning?
PCA reduces dataset dimensionality while retaining maximum variance by transforming features into uncorrelated components.
Compare PCA with feature selection techniques. What are the advantages and disadvantages of using PCA over traditional feature selection?
PCA transforms features, while feature selection chooses the most relevant ones. PCA loses interpretability.
What are parametric and non-parametric methods in statistical analysis? Highlight the key difference between their assumptions.
Parametric: Assumes a fixed distribution (e.g., normal). Non-Parametric: No distribution assumptions; adapts to data structure.
What are the main assumptions underlying parametric methods? How do these assumptions differ from those of non-parametric methods?
Parametric: Assumes specific distributions and relationships. Non-Parametric: Minimal assumptions.
Compare the advantages and disadvantages of parametric and non-parametric methods. When would you prefer one over the other?
Parametric: Faster, interpretable, but less flexible. Non-Parametric: Flexible, robust, but computationally intensive.
Provide two examples each of parametric and non-parametric methods. Explain how their assumptions influence their application.
Parametric: Linear Regression, Logistic Regression. Non-Parametric: KNN, Decision Trees.
Outline the steps of forward propagation in a neural network. Use an example to demonstrate how inputs are transformed through layers.
Pass inputs through layers, apply weights and biases, activate using functions, and compute the output.
Explain the difference between pre-pruning and post-pruning in decision tree algorithms.
Pre-Pruning: Halts tree growth early based on criteria. Post-Pruning: Removes branches after the tree is fully grown.
Define precision and recall. Why are these metrics important in binary classification tasks?
Precision measures proportion of correctly predicted positive instances out of all instances predicted as positive. Recall measures the proportion of correctly predicted positive instances out of all actual positive instances. Important for handling imbalanced datasets.
How does KNN perform regression tasks? Explain the role of averaging in this context and provide an example.
Predicts the average of k neighbors' values for a regression task. Example: predicting house prices.
Describe a scenario where boosting would be preferred over bagging. Include examples of algorithms like AdaBoost or Gradient Boosting and their advantages in that context.
Preferred for handling imbalanced datasets or when higher accuracy is required (e.g., fraud detection).
Compare and contrast decision trees with Random Forests.
Random forests use multiple decision trees to improve stability and accuracy through bagging.
Explain the dropout technique in neural networks. How does it improve generalization and prevent overfitting?
Randomly disables neurons during training to prevent over-reliance on specific pathways.
Explain how PCA can be used for visualizing high-dimensional datasets in two or three dimensions.
Reduces dimensions to 2D/3D for easier visualization of patterns.
What is the purpose of bias in a neural network? How does it enhance the learning process?
Shifts the activation function, enabling better model learning.
How does the choice of the parameter K affect the performance of a KNN model? What are the risks of choosing a value of K that is too small or too large?
Small k: Sensitive to noise. Large k: Over-smooths decision boundaries.
Describe the process and purpose of k-fold cross-validation in model evaluation.
Splits data into k subsets, training on k−1 and testing on 1. Used to evaluate performance.
You are given a dataset with 50 features. Describe how you would use PCA to reduce the number of features while retaining 90%.
Standardize features, handle missing values, compute covariance matrix, eigenvalue decomposition, calculate explained variance, select number of principle components to retain at least 90% variance, transform dataset, validate results.
Outline the steps involved in performing PCA on a dataset.
Standardize the data. Compute covariance matrix. Calculate eigenvalues and eigenvectors. Project data onto selected components.
Explain the steps involved in building a decision tree.
Steps include: selecting the best attribute (using metrics like entropy/information gain), splitting the dataset, and recursively repeating until all nodes are pure or stopping criteria are met.
Explain the concept of early stopping and how it helps prevent overfitting.
Stop training when validation error starts increasing.
How does increasing the complexity of a model affect its likelihood of overfitting?
The model learns not only the underlying patterns in the data but also the noise and irrelevant details.
Why are activation functions important in neural networks? Compare ReLU and sigmoid functions, including their advantages and limitations.
They introduce non-linearity into the model, enabling it to learn complex patterns and relationships in the data. ReLU: Efficient, avoids vanishing gradients. Sigmoid: Smooth but suffers from vanishing gradients.
What is a dendrogram, and how is it used in hierarchical clustering? How can you determine the number of clusters from a dendrogram?
Tree diagram representing cluster merges. Cut at a level to define clusters.
What is clustering, and how is it used in unsupervised learning? Provide examples of its applications.
Unsupervised learning technique to group data points based on similarity. Ex: retail company wants to segment its customers into distinct groups based on their shopping behavior to design targeted marketing campaigns.
How can cross-validation be used to determine the optimal value of K in KNN? Explain with an example.
Use k-fold cross-validation to test different k values and select the one minimizing validation error.
What methods are used to evaluate the performance of a decision tree?
Use metrics like accuracy, precision, recall, F1 score, or confusion matrix.
What strategies might be used to handle evaluation in imbalanced datasets?
Use metrics like precision-recall to handle imbalance.
What strategies can be employed to address underfitting in a machine learning model?
Use more features, reduce regularization, or increase model capacity.
Discuss the techniques used to prevent overfitting in decision trees.
Use pruning, limit tree depth, or ensure sufficient data at nodes.
List and describe three methods that can be used to prevent or reduce overfitting in machine learning models.
Use regularization, cross-validation, and pruning.
How does the Elbow Method help determine the optimal number of clusters in K-Means clustering? Explain with a diagram.
Use the Elbow Method: Plot cost vs. k and select the "elbow" point.
What is bootstrap sampling in bagging? Describe the steps involved in training a bagging model and how it reduces variance.
Uses bootstrap sampling to create diverse datasets for training models.
Explain the concepts of variance and covariance. How are they used in the PCA algorithm?
Variance measures data spread; covariance measures relationships between features.