No Code AI - Data Exploration - Structured Data

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

If we obtain 4 Principal Components from PCA (from a 4-dimensional original dataset) and the percentage of variance explained by each of them is 55%, 20%, 15% and 10% respectively, how many Principal Components would we really require to explain 75% of the variance in our dataset? 1 2 3 4

2. In order to explain 75% of the variance in our dataset, we only really require the first Principal Component (PC1 explains 55%) and the second Principal Component (PC2 explains 20%), as 55 + 20 gives us 75%. That means, for this example problem, we could get by with just 2 dimensions out of 4, and we will have achieved dimensionality reduction to make the next steps in the data analysis process significantly easier.

While performing K-means Clustering, what is the ideal value of K to choose based on the above plot? 1 2 3 4

3

In which of the following scenarios would t-SNE be relatively better to use than PCA for dimensionality reduction, when working on a local machine with limited computational power? A. A dataset with 1 million entries and 300 features B. A dataset with 100,000 entries and 300 features C. A dataset with 1,000 entries and 8 features D. A dataset with 10,000 entries and 200 features

A dataset with 1,000 entries and 8 features

Which of the following are important limitations of K-means Clustering? A. The user has to specify K (the number of clusters) to the algorithm B. K-means Clustering is only good at detecting roughly spherical-shaped clusters and cannot detect other patterns like elongated clusters C. K-means Clustering is quite sensitive to any outliers present in the data D. All of the above

All of the above

Which of the following is a potential benefit of using clustering in marketing? Identifying customer segments with similar needs and preferences Identifying the most effective advertising channels Identifying the most profitable pricing strategy All of the above

All of the above

Which of the following represents the correct interpretation of what Principal Component Analysis (PCA) is trying to achieve? A. It maximizes the projection variance by identifying the directions in which the data varies the most. B. It minimizes the projection residuals by removing the directions that carry the least signal. C. It achieves spectral decomposition by identifying the eigenvectors corresponding to the largest eigenvalues of the covariance matrix. D. All of the above

All of the above

Which of the following statements is true about t-SNE? A. It uses Kullback-Leibler divergence to measure "distance" between distributions and minimizes this objective function B. It is a non-deterministic algorithm that tries to preserve local structure in the dataset C. It extracts non-linear embeddings to project a high-dimensional dataset into lower-dimensional space D. All of the above

All of the above

How is the optimal number of clusters determined in K-means clustering? A. By visual inspection of the data B. By trial and error C. By using elbow method or silhouette score D. None of the above

By using elbow method or silhouette score

How is the distance between a data point and a centroid typically calculated in the k-means clustering algorithm? A. Euclidean distance B. Manhattan distance C. Chebyshev distance D. None of the above

Euclidean distance

Which of the below statements is/are true about K-medoids Clustering (Partitioning Around Medoids)? 1. In this algorithm, the cluster center must be an actual observation from the data. 2. It is a distance-based algorithm like K-means Clustering. 3. It performs more poorly in the presence of outliers than K-means Clustering does. Only 1 and 2 Only 3 Only 2 Only 1 and 3

Only 1 and 2 K-medoids is a clustering algorithm resembling the K-means clustering technique. Its difference from K-Means is in terms of the way it selects the clusters' centers. K-means selects the arithmetic mean of a cluster's points as its center (which may or may not be one of the data points) while K-medoids always picks an actual data point from the clusters as their centers. This allows K-medoids to be more robust to outliers in comparison to K-means, but that comes at the expense of computational cost.

A level of significance of 5% would mean: There's a 5% chance we'll be wrong if we fail to reject the alternative hypothesis. There's a 5% chance we'll be wrong if we fail to reject the null hypothesis. There's a 5% chance we'll be wrong if we reject the null hypothesis. There's a 5% chance we'll be wrong if we reject the alternative hypothesis.

There's a 5% chance we'll be wrong if we reject the null hypothesis. The statistical terminology with respect to Hypothesis Testing is important to understand and internalize. We do not reject or fail to reject or make any decision with respect to the Alternate Hypothesis, our action on the Alternate Hypothesis is only a derivative of our decision with respect to the Null Hypothesis. With respect to the Null Hypothesis, at a significance level of 5%, there is a 5% likelihood that we would be wrong if we reject the Null Hypothesis.

What is the main advantage of t-SNE over PCA for visualizing data? A. They allow clusters to be preserved and separated in the low dimensional representation B. They are computationally faster C. They can reduce the number of dimensions in a dataset more effectively D. They are suitable for one dimensional datasets

They allow clusters to be preserved and separated in the low dimensional representation They allow clusters to be preserved and separated in the low-dimensional representation. Non-linear methods like t-SNE can more effectively preserve the structure and clusters present in a high-dimensional dataset in a low-dimensional representation, making it easier to visualize and analyze the data.

Which of these represents the primary use case of PCA? A. To extract important features and reduce the dimensionality of the data set B. To perform clustering on the data set and identify groups of samples more similar to each other than the rest C. To perform predictive modeling and group samples into different classes D. None of the above

To extract important features and reduce the dimensionality of the data set

What is the objective of K-means clustering? A. To minimize the within-cluster sum of squares B. To maximize the between-cluster sum of squares C. To maximize the within-cluster sum of squares D. To minimize the between-cluster sum of squares

To minimize the within-cluster sum of squares

When performing K-means Clustering, it is considered vital to scale the variables before clustering. This is done primarily to: 1. make the model less susceptible to outliers 2. convert all the features to a comparable scale so as to give each of them equal importance 3. avoid multicollinearity among the variables 4. treat missing values to make the data more robust for further analysis

convert all the features to a comparable scale so as to give each of them equal importance


Set pelajaran terkait

Graphing, Measurement and Scientific Method

View Set

Chapter 7: Portable Fire Extinguishers

View Set