bstt 426 midterm

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

You trained a binary classifier model which gives very high accuracy on the training data, but much lower accuracy on validation data. Which of the following may be true? 1. The training was not well regularized. 2. This is an instance of overfitting. 3. This is an instance of underfitting. 4. The training and testing examples are sampled from different distributions.

1 2 and 4

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA? 1. PCA is an unsupervised method 2. It searches for the directions that data have the largest variance 3. Maximum number of principal components <= number of features 4. All principal components are orthogonal to each other

1. PCA is an unsupervised method 2. It searches for the directions that data have the largest variance 3. Maximum number of principal components <= number of features 4. All principal components are orthogonal to each other

27. Which of the following cross validation versions may not be suitable for very large datasets with hundreds of thousands of samples? 1. k-fold cross-validation 2. Leave-one-out cross-validation 3. Holdout method

2

Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting? 1. Increase the amount of training data. 2. Improve the optimization algorithm being used for error minimization. 3. Decrease the model complexity 4. Reduce the noise in the training data.

2

What happens when you get features in lower dimensions using PCA? 1. The features will still have interpretability 2. The features will lose interpretability 3. The features must carry all information present in data 4. The features may not carry all information present in data

2

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA? PCA is an unsupervised method It searches for the directions that data have the largest variance Maximum number of principal components <= number of features All principal components are orthogonal to each other Allabove

5

Write a command to compare A & B where A is "SunSet " and B is "sunset".

A = "SunSet " B = "sunset" if A.strip().lower() == B.lower(): print("They are the same.") else: print("They are different.")

Eigenvector:

A non-zero vector that remains parallel after the application of a linear transformation. In PCA, eigenvectors of the covariance matrix define the directions of maximum variance (principal components).

Scree Plot:

A plot used in PCA which displays the eigenvalues associated with each principal component in descending order. It helps to determine the number of principal components to retain by visualizing the "elbow point" where the addition of more components has a diminished return.

Variance:

A statistical measure that describes the spread of numbers in a dataset. It quantifies the degree to which each number in the dataset deviates from the mean and, therefore, from every other number in the set.

Principal Component Analysis (PCA)

A statistical procedure that uses orthogonal transformation to convert correlated variables into a set of uncorrelated variables called principal components, often used for dimensionality reduction and data visualization.

Which of the following techniques would perform better for reducing dimensions of a data set? A. Removing columns which have too many missing values B. Removing columns which have high variance in data C. Removing columns with dissimilar data trends D. None of these

A.

Which of the following can act as possible termination conditions in K-Means? For a fixed number of iterations. Assignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum. Centroids do not change between successive iterations. Terminate when RSS falls below a threshold.

All

PCA works better if there is? A linear structure in the data If the data lies on a curved surface and not on a flat surface If variables are scaled in the same unit A. 1 and 2 B. 2 and 3 C. 1 and 3 D. 1 ,2 and 3

C

Centroid Usefulness

Centroids serve as a representative point for summarizing and understanding the cluster. In algorithms like K-Means, centroids are crucial for defining clusters and assigning data points to these clusters based on distance metrics.

. What is the minimum no. of variables/ features required to perform clustering?

Clustering can technically be performed with just one variable (univariate data), but the analysis becomes more insightful and meaningful with at least two variables (bivariate data). Here's a breakdown:

Outliers:

Data points that significantly differ from most others in the dataset. Outliers can skew statistical measures and can impact the results of data analysis, including PCA, by affecting the calculated principal components.

Silhouette Score Definition

Definition: The silhouette score is a measure for the quality of clustering, ranging from -1 to 1. A high silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If �(�)a(i) is the average distance from the �ith data point to the other data points in the same cluster, and �(�)b(i) is the smallest average distance from the �ith data point to data points in a different cluster, the silhouette score �(�)S(i) is given as: �(�)=�(�)−�(�)max⁡{�(�),�(�)}S(i)=max{a(i),b(i)}b(i)−a(i)​

PCA can be used for feature selection.

False, with a nuance. PCA is generally used for feature transformation or dimensionality reduction, not feature selection, per se.

1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is sensitive to the order of data points in the input dataset.

False. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is not sensitive to the order of data points in the input dataset. The algorithm processes data points in the order they appear in the dataset, but the final clustering result should not depend on the order of the data points.

1. Hierarchical clustering always produces the same number of clusters regardless of the linkage method used.

False. Hierarchical clustering does not produce a fixed number of clusters. The number of clusters that result from hierarchical clustering is determined by the level at which the dendrogram (a tree diagram representing the hierarchical relationships between clusters) is cut. Moreover, the linkage method used in hierarchical clustering can significantly impact the shape of the dendrogram and therefore the resulting clusters when it is cut at a particular level.

1. Hierarchical clustering always produces a flat structure with a fixed number of clusters.

False. Hierarchical clustering does not produce a flat structure with a fixed number of clusters. Instead, it creates a hierarchy or binary tree of clusters known as a dendrogram, which represents the nested grouping of patterns and similarity levels at each merging (or splitting) of clusters. Hierarchical clustering does not require specifying the number of clusters beforehand and does not constrain the data to a fixed number of clusters unless a specific level of the dendrogram is chosen to cut and form flat clusters.

1. When performing a K-Means cluster analysis, the algorithm will automatically choose the optimal number of clusters for you.

False. In the K-Means clustering algorithm, the number of clusters (k) is a required input parameter and is not determined automatically by the algorithm. The user must specify the number of clusters they wish to use before running the algorithm.

1. K-Means clustering guarantees convergence to the global optimum solution.

False. K-Means clustering does not guarantee convergence to the global optimum solution. The algorithm aims to minimize the within-cluster sum of squares (the sum of squared distances from each point to its assigned cluster centroid). However, K-Means is sensitive to the initial placement of the cluster centroids and can converge to a local minimum of the objective function rather than the global minimum.

1. K-means clustering can handle datasets with a high degree of noise and outliers effectively.

False. K-Means clustering is sensitive to noise and outliers in the dataset. The objective of K-Means is to minimize the within-cluster sum of squares, and thus it is largely influenced by extreme values or outliers because squaring the distances amplifies the impact of large values.

3. PCA always improves the performance of a machine learning model.

False. PCA does not always improve the performance of a machine learning model. While it can be a useful technique for dimensionality reduction and mitigating the curse of dimensionality, there are situations and models where applying PCA might not enhance and could even degrade model performance.

PCA reduces the dimensionality of the data by removing features.

False. PCA reduces dimensionality not by removing original features, but by transforming them into a new set of uncorrelated variables known as principal components. These principal components are linear combinations of the original features and are ordered by the amount of variance they explain from the data.

1. PCA is a supervised machine learning technique.

False. Principal Component Analysis (PCA) is an unsupervised machine learning technique.

1. Clustering algorithms always produce the same results, given the same data and parameters.

False. The consistency of the results produced by clustering algorithms, given the same data and parameters, can depend significantly on the algorithm being used.

3. The first principal component captures the least amount of variance in the data.

False. The first principal component of PCA captures the most variance in the data. The principal components are ordered by the amount of variance they explain from the data.

Difference between feature reduction and feature selection:

Feature Reduction: It involves transforming the original high-dimensional data into a lower-dimensional form, preserving as much information as possible (e.g., PCA). Feature Selection: It involves selecting a subset of the most important features that contribute to the predictive power of the model, discarding the redundant or less useful ones.

Feature scaling is an important step before applying K-Mean algorithm. What is reason behind this?

Feature scaling is crucial before applying the K-Means algorithm mainly due to the algorithm's dependency on distance measurements to assign data points to clusters and compute centroids. Here are simplified points: Equal Importance: Without scaling, variables with larger values will unduly influence the clustering, potentially overshadowing important patterns in variables with smaller scales. Distance Accuracy: K-Means uses distances between points to form clusters. Scaling ensures each variable contributes equally to the distance computation, preventing bias in cluster assignments. Efficiency: Properly scaled features can help the algorithm converge more quickly and provide more stable results. Interpretability: Scaled features help in forming clusters that reflect actual data patterns, making them more interpretable and meaningful.

Imagine, you have 1000 input features and 1 target feature in a machine learning problem. You have to select 100 most important features based on the relationship between input features and the target features. Which method you select? Dimension reduction or Feature selection

Feature selection

For two runs of K-Mean clustering is it expected to get same clustering results? Why?

For two runs of K-Means clustering, it is not necessarily expected to get the same clustering results, especially when different initial centroid placements are used or the data has different possible stable cluster assignments

Handling missing values & default missing value code in Python:

Handling: Common strategies include imputation (replacing missing values with statistical measures like mean, median), deleting rows, or using algorithms that can handle missing values. Default Missing Value Code in Python (pandas): NaN (Not a Number)

Eigenvalue:

In the context of PCA, it represents the magnitude of variance along a principal component (eigenvector). Larger eigenvalues indicate that more variance is captured along that component.

Centroid Definition

In the context of clustering, a centroid is the center of a cluster. In K-Means clustering, it is the mean position of all the points in a particular cluster, and in K-Medoids, it is the most centrally located data point in a cluster.

Rotation:

In the context of factor analysis and PCA, rotation is the transformation of factor or component loadings to make them easier to interpret. Various rotation methods (like Varimax) aim to maximize the loading of each variable on one of the extracted components while minimizing their loading on all other components.

Silhouette Score Usefulness

It is useful for determining how appropriately the data have been clustered and can be used to select the most appropriate number of clusters by maximizing the silhouette score.

You are given a cancer detection data set. Let's suppose when you build a classification model you achieved an accuracy of 96%. Why shouldn't you be happy with your model performance? What can you do about it? (3 sentences)

It seems that 96% is very good accuracy to be accepted. However, in a dataset like cancer where number of malignant patients are much smaller than benign patients, it is an imbalanced data problem. In this case, total accuracy is not an appropriate measure. So, we need to look for sensitivity and specificity which can identified correctly the positive and negative class separately.

People who bought this also bought...' recommendations seen on Amazon is based on which algorithm?

K-NN Classification

Principal Components:

Linear combinations of the original variables formed during PCA. These components are uncorrelated and are arranged in descending order of the variance they explain from the original data.

Within Cluster Sum of Squared Errors (WCSS) Usefulness

Lower WCSS values indicate that the data points are closer to the centroids of their respective clusters, which is generally desirable. However, it tends to always decrease with increasing number of clusters, so it is typically used in conjunction with other metrics or methods (such as the elbow method) to determine the optimal number of clusters.

Three stages to build the hypotheses or model in machine learning:

Model Building: Choose a model and specify the algorithm. Model Training: Train the model using a labeled dataset. Model Testing: Test the model's predictions against a new dataset to evaluate its performance.

Scikit-learn packages for PCA, clustering, and classification:

PCA: from sklearn.decomposition import PCA Clustering (e.g., K-Means): from sklearn.cluster import KMeans Classification (e.g., Logistic Regression): from sklearn.linear_model import LogisticRegression

Pros and Cons of PCA

PCA:Pros: Reduces dimensionality, can help mitigate the curse of dimensionality, helps in visualization. Cons: Assumes linear correlations between variables, loss of interpretability.

Basic principle of PCA & when it's useful:

Principle: PCA (Principal Component Analysis) reduces dimensionality by projecting data onto orthogonal vectors (principal components) that capture the most variance. Useful: When dealing with high-dimensional data, reducing noise, improving algorithm performance, or visualizing data.

Pros and Cons Hierarchical Clustering

Pros: Doesn't require specifying the number of clusters upfront, produces a dendrogram which can be useful for understanding hierarchical structure. Cons: More computationally expensive, doesn't work well for large datasets.

Pros and Cons K Means

Pros: Simple to understand, fast for large datasets. Cons: Assumes clusters to be spherical, sensitive to initial centroid placement, might converge to a local minimum.

a. What are Python dictionaries? Give an example.

Python dictionaries are unordered collections of data in a key:value pair form. dictionary = {'name': 'John', 'age': 25, 'occupation': 'Engineer'}

Dimensionality:

Refers to the number of variables or features in a dataset. "High dimensionality" refers to datasets with a large number of variables, which can be challenging to analyze and visualize.

Difference between supervised and unsupervised machine learning:

Supervised Learning: The algorithm is trained on a labeled dataset, meaning the algorithm is provided with input-output pairs. The goal is to learn a mapping from inputs to outputs. Unsupervised Learning: The algorithm is trained on an unlabeled dataset, trying to learn the underlying structure of the data, like clustering or dimensionality reduction.

Elbow Curve Usefulness

The "elbow" of the curve represents an inflection point where increasing the number of clusters leads to diminishing returns in reducing WCSS or explaining additional variance. Thus, it helps in determining a balance between minimizing WCSS and not overly segmenting the data into too many clusters.

Elbow Curve Definition

The elbow curve is a graphical tool used in determining the most appropriate number of clusters for a dataset in K-Means clustering. It involves plotting the explained variation (often using WCSS) as a function of the number of clusters, and picking the "elbow" of the curve as the number of clusters to use.

a. What is this type of function called and how is it useful? def func (x): if (x == 0): return 1 else: return (x + func (x-1))

The function provided is a recursive function. It is called recursive because it calls itself within its definition. Recursive functions are useful for breaking down complex problems into simpler sub-problems that can be solved with repeated application of the same process. In this specific function, it computes the sum of integers from 1 to x.

Data Compression:

The reduction of the size or quantity of data to save space or transmission time. In the context of PCA, it refers to reducing the number of dimensions (variables) while retaining as much variance or information as possible.

Cumulative Variance:

The total variance explained by a specific number of principal components. Often visualized or tabulated to assist in determining how many components to retain in order to explain a desired amount of total variance.

Loadings:

The weights or coefficients of the original variables used to form principal components. Loadings indicate the contribution of each original variable to each principal component.

How do you ensure you're not overfitting with a model?

Three main methods to avoid overfitting: 1. Keep the model simpler: how? Removing noise by taking into account fewer variables, parameters etc.. 2. Use cross-validation techniques such as k-folds cross-validation. 3. Use regularization techniques such as LASSO that penalize certain model parameters if they're likely to cause overfitting.

Accuracy formula

Tn+tp/n

Sensitivity Formula

Tp/ all +

Projection:

Transforming data from its original space to a new space. In PCA, data points are projected onto the principal components for dimensionality reduction, visualization, or to derive new features.

[ True or False ] Dimensionality reduction algorithms are one of the possible ways to reduce the computation time required to build a model.

True

[ True or False ] It is not necessary to have a target variable for applying dimensionality reduction algorithms.

True

When performing cluster analysis, you should always standardize the variables.

True, generally - but with some qualifications. Standardizing variables is often crucial in cluster analysis, especially when the variables are measured on different scales or have different units of measurement, as it prevents variables with larger scales from dominating the distance calculations. This is particularly important for clustering algorithms that utilize distance metrics for forming clusters, such as K-Means.

1. In hierarchical clustering, the dendrogram can be used to determine the optimal number of clusters.

True, with qualifications. A dendrogram, which is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering, can indeed be utilized to help determine the "optimal" number of clusters by visualizing the structure and helping to inform where to cut the tree to define the clusters. However, the determination of the optimal number of clusters, and thus where to cut the dendrogram, often involves subjective judgement

Cluster analysis can be performed using nominal categorical variables.

True, with some qualifications. Cluster analysis can be performed using nominal categorical variables, but not all clustering algorithms are well-suited for handling categorical data. Traditional clustering algorithms like K-Means are designed for continuous variables and use Euclidean distance as a measure of similarity, which is not meaningful when applied to categorical data.

) [ True or False ] PCA can be used for projecting and visualizing data in lower dimensions

True.

1. Clustering is a type of unsupervised learning.

True. Clustering is indeed a type of unsupervised learning. In machine learning, unsupervised learning refers to the type of problems where the algorithm is given input data without explicit instructions on what to do with it. The system tries to learn the patterns and the structure from the data without any labeled responses.

3. Eigenvalues and eigenvectors are used in PCA to compute the principal components.

True. Eigenvalues and eigenvectors are fundamental in PCA (Principal Component Analysis) to compute the principal components. Here's a breakdown of how they are utilized:

K-Means clustering seeks to minimize the distance from each point to the center of a fixed number of clusters.

True. K-Means clustering aims to partition a dataset into �k clusters, where �k is a predefined number of clusters specified by the user. The algorithm seeks to minimize the within-cluster sum of squares, which essentially is minimizing the squared Euclidean distance from each data point to the centroid of its assigned cluster.

3. PCA can be used for data compression while preserving most of the original information.

True. PCA (Principal Component Analysis) can indeed be used for data compression while preserving most of the original information.

3. PCA is useful for visualizing high-dimensional data.

True. PCA is indeed useful for visualizing high-dimensional data by reducing its dimensionality in a way that preserves as much of the data's variance as possible. Here's a breakdown:

3. PCA is sensitive to outliers in the data.

True. PCA is sensitive to outliers in the data.

3. PCA requires the data to be normalized or standardized before applying it.

True. PCA is sensitive to the variances of the original features because it aims to identify the principal components that maximize variance. If one feature has a very large variance while others do not, PCA might determine that the direction of maximal variance more closely aligns with the feature having larger variance, which might not be informative regarding the principal axes of the actual data structure.

1. Silhouette analysis is a method for evaluating the quality of clustering results, where higher silhouette scores indicate better-defined clusters.

True. Silhouette analysis is indeed a technique used to calculate the goodness of a clustering algorithm. It measures how close each point in one cluster is to the points in the neighboring clusters. Its values range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Within Cluster Sum of Squared Errors (WCSS) definition

WCSS is a metric used to evaluate the performance of a clustering algorithm. It is the sum of the squared distances between each point in a cluster and the centroid of that cluster. Mathematically, if ��Ck​ is the centroid of cluster �k and �X represents the data points, WCSS is defined as ∑�∑�∈��(�−��)2∑k​∑x∈Ck​​(x−Ck​)2.

Two cases when data needs to be standardized:

When features are on different scales and this scale difference can impact the performance of the machine learning algorithm (e.g., in distance-based algorithms like K-means clustering or k-NN). Before applying techniques that assume data has a Gaussian distribution, like PCA

Is it possible that Assignment of observations to clusters does not change between successive iterations in K-Means

Yes, it is possible and, in fact, is a key point during the execution of the K-Means clustering algorithm where the assignment of observations to clusters does not change between successive iterations. This situation is typically considered as reaching convergence and serves as an indication that the algorithm can be terminated since further iterations would not lead to any changes in the clustering.

Is K-Means algorithm sensitive to outliers?

Yes, the K-Means clustering algorithm is sensitive to outliers. This sensitivity is due to the algorithm minimizing the sum of squared deviations (variances) within each cluster, which inherently can be highly influenced by extreme values (outliers).

a. What is the output of print list [2:4] if list = [42, 52, 11, 55, 24, 12].

[11, 55] because it starts 0 1 2 3

21. Which of the following are true? a) Clustering analysis is negatively affected by multicollinearity of features b) Clustering analysis is negatively affected by heteroscedasticity

a

Feature scaling is an important step before applying K-Mean algorithm. What is reason behind this? a) In distance calculation it will give the same weights for all features b) You always get the same clusters. If you use or don't use feature scaling c) In Manhattan distance it is an important step but in Euclidian it is not d) None of these

a

Assume the distance function is given, what methods can be used to get a better k-Means clustering on a given dataset? a) Kmeans++ initialization b) Random start c) Hierarchical Clustering d) None

a and b

.Which of the following metrics, do we have for finding dissimilarity between two clusters in hierarchical clustering? a) Single-link b) Complete-link c) Average-link

a b and c

What is true about K-Mean Clustering? a) K-means is extremely sensitive to cluster center initializations b) Bad initialization can lead to Poor convergence speed c) Bad initialization can lead to bad overall clustering

a b and c

In which of the following cases will K-Means clustering fail to give good results? a) Data points with outliers b) Data points with different densities c) Data points with round shapes d) Data points with non-convex shapes

a b and d

Increasing k value in k-nearest neighbor, the model will _____ the bias and ______ the variance. a. Decrease, Decrease b. Increase, Decrease c. Decrease, Increase d. Increase, Increase

b

a peak in Silhoutte Coefficient and the elbow point show

best choice for number of clusters

a. The ______ statement terminates the loop and transfers execution to the statement following the loop.

break

a. Python has many pre-defined functions known as __________.

built-in functions

a. The process of identifying and removing logical errors and runtime errors is called ________.

debugging

Write a function to check if a number when subtracted by 1 is a prime number.

def is_prime(num): if num <= 1: return False for i in range(2, int(num**0.5)+1): if num % i == 0: return False return True def check_prime_minus_one(n): return is_prime(n-1) # Example usage: num = 4 if check_prime_minus_one(num): print(f"{num-1} is prime.") else: print(f"{num-1} is not prime.")

what should be the selected number of clusters based on a dendogram

determined by looking at the largest vertical distance that doesn't intersect any of the cluster's horizontal lines (or by cutting the dendrogram at a desired dissimilarity height)

a. ________ data structure is most suited to maintaining a book catalog.

dictionary

For a large k value the k-nearest neighbor model becomes _____ and ______ . i. Complex model, Overfit ii. Complex model, Underfit iii. Simple model, Underfit iv. Simple model, Overfit

iii

a. Select the correct option for i in range (5):` *missing code* Output: 0 1 2 3 4 i. print (i) ii. print (i, sep = '\t') iii. print (i, end = ' ') iv. print (i + " ")

print (i, end = ' ')

a. What is the difference between the read and the readlines command?

read: Reads the entire file contents and returns it as a single string. readlines: Reads the entire file and returns it as a list of strings, where each string is a line from the file.

a. The function to convert an object to a string in python is ________.

str()

Specifity

tn/ all -


संबंधित स्टडी सेट्स

Hemodialysis chapters 1,2,3,and 14

View Set

Solving Quadratic Equations by Factoring

View Set

Chapter 01: Assignment: The World of Innovative Management

View Set

Maternal: Pregnancy Unit 1 practice questions

View Set

WGU C278 - College Algebra (in progress as I study)

View Set

Evolve: Endo, Endocrine NCLEX, Endocrine Disorder chapter 50 NCLEX Questions, Endocrine NCLEX Practice Questions, Chapter 63 Care of Patients with Problems of the Thyroid Parathyroid Glands, Endocrine NCLEX Questions, NCLEX practice: endocrine, Chapt...

View Set

Bagus Sekali 1 Unit 1 What is your name?

View Set

Real Estate Unit 13 Gov't Involvement in Real Estate Financing

View Set