bstt 426 midterm

27. Which of the following cross validation versions may not be suitable for very large datasets with hundreds of thousands of samples? 1. k-fold cross-validation 2. Leave-one-out cross-validation 3. Holdout method


PCA works better if there is? A linear structure in the data If the data lies on a curved surface and not on a flat surface If variables are scaled in the same unit A. 1 and 2 B. 2 and 3 C. 1 and 3 D. 1 ,2 and 3


Centroid Usefulness

Centroids serve as a representative point for summarizing and understanding the cluster. In algorithms like K-Means, centroids are crucial for defining clusters and assigning data points to these clusters based on distance metrics.

. What is the minimum no. of variables/ features required to perform clustering?

Clustering can technically be performed with just one variable (univariate data), but the analysis becomes more insightful and meaningful with at least two variables (bivariate data). Here's a breakdown:


Data points that significantly differ from most others in the dataset. Outliers can skew statistical measures and can impact the results of data analysis, including PCA, by affecting the calculated principal components.

Silhouette Score Definition

Definition: The silhouette score is a measure for the quality of clustering, ranging from -1 to 1. A high silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If �(�)a(i) is the average distance from the �ith data point to the other data points in the same cluster, and �(�)b(i) is the smallest average distance from the �ith data point to data points in a different cluster, the silhouette score �(�)S(i) is given as: �(�)=�(�)−�(�)max⁡{�(�),�(�)}S(i)=max{a(i),b(i)}b(i)−a(i)​

PCA can be used for feature selection.

False, with a nuance. PCA is generally used for feature transformation or dimensionality reduction, not feature selection, per se.

1. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is sensitive to the order of data points in the input dataset.

False. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is not sensitive to the order of data points in the input dataset. The algorithm processes data points in the order they appear in the dataset, but the final clustering result should not depend on the order of the data points.

1. Hierarchical clustering always produces the same number of clusters regardless of the linkage method used.

False. Hierarchical clustering does not produce a fixed number of clusters. The number of clusters that result from hierarchical clustering is determined by the level at which the dendrogram (a tree diagram representing the hierarchical relationships between clusters) is cut. Moreover, the linkage method used in hierarchical clustering can significantly impact the shape of the dendrogram and therefore the resulting clusters when it is cut at a particular level.

1. PCA is a supervised machine learning technique.

False. Principal Component Analysis (PCA) is an unsupervised machine learning technique.

1. Clustering algorithms always produce the same results, given the same data and parameters.

False. The consistency of the results produced by clustering algorithms, given the same data and parameters, can depend significantly on the algorithm being used.

3. The first principal component captures the least amount of variance in the data.

False. The first principal component of PCA captures the most variance in the data. The principal components are ordered by the amount of variance they explain from the data.

Difference between feature reduction and feature selection:

Feature Reduction: It involves transforming the original high-dimensional data into a lower-dimensional form, preserving as much information as possible (e.g., PCA). Feature Selection: It involves selecting a subset of the most important features that contribute to the predictive power of the model, discarding the redundant or less useful ones.

Imagine, you have 1000 input features and 1 target feature in a machine learning problem. You have to select 100 most important features based on the relationship between input features and the target features. Which method you select? Dimension reduction or Feature selection

Feature selection

For two runs of K-Mean clustering is it expected to get same clustering results? Why?

For two runs of K-Means clustering, it is not necessarily expected to get the same clustering results, especially when different initial centroid placements are used or the data has different possible stable cluster assignments

Handling missing values & default missing value code in Python:

Handling: Common strategies include imputation (replacing missing values with statistical measures like mean, median), deleting rows, or using algorithms that can handle missing values. Default Missing Value Code in Python (pandas): NaN (Not a Number)


In the context of PCA, it represents the magnitude of variance along a principal component (eigenvector). Larger eigenvalues indicate that more variance is captured along that component.

Centroid Definition

In the context of clustering, a centroid is the center of a cluster. In K-Means clustering, it is the mean position of all the points in a particular cluster, and in K-Medoids, it is the most centrally located data point in a cluster.


In the context of factor analysis and PCA, rotation is the transformation of factor or component loadings to make them easier to interpret. Various rotation methods (like Varimax) aim to maximize the loading of each variable on one of the extracted components while minimizing their loading on all other components.

Silhouette Score Usefulness

It is useful for determining how appropriately the data have been clustered and can be used to select the most appropriate number of clusters by maximizing the silhouette score.

People who bought this also bought...' recommendations seen on Amazon is based on which algorithm?

K-NN Classification

Principal Components:

Linear combinations of the original variables formed during PCA. These components are uncorrelated and are arranged in descending order of the variance they explain from the original data.

Scikit-learn packages for PCA, clustering, and classification:

PCA: from sklearn.decomposition import PCA Clustering (e.g., K-Means): from sklearn.cluster import KMeans Classification (e.g., Logistic Regression): from sklearn.linear_model import LogisticRegression

Pros and Cons of PCA

PCA:Pros: Reduces dimensionality, can help mitigate the curse of dimensionality, helps in visualization. Cons: Assumes linear correlations between variables, loss of interpretability.

Basic principle of PCA & when it's useful:

Principle: PCA (Principal Component Analysis) reduces dimensionality by projecting data onto orthogonal vectors (principal components) that capture the most variance. Useful: When dealing with high-dimensional data, reducing noise, improving algorithm performance, or visualizing data.

Elbow Curve Usefulness

The "elbow" of the curve represents an inflection point where increasing the number of clusters leads to diminishing returns in reducing WCSS or explaining additional variance. Thus, it helps in determining a balance between minimizing WCSS and not overly segmenting the data into too many clusters.

Elbow Curve Definition

The elbow curve is a graphical tool used in determining the most appropriate number of clusters for a dataset in K-Means clustering. It involves plotting the explained variation (often using WCSS) as a function of the number of clusters, and picking the "elbow" of the curve as the number of clusters to use.

a. What is this type of function called and how is it useful? def func (x): if (x == 0): return 1 else: return (x + func (x-1))

The function provided is a recursive function. It is called recursive because it calls itself within its definition. Recursive functions are useful for breaking down complex problems into simpler sub-problems that can be solved with repeated application of the same process. In this specific function, it computes the sum of integers from 1 to x.

Cumulative Variance:

The total variance explained by a specific number of principal components. Often visualized or tabulated to assist in determining how many components to retain in order to explain a desired amount of total variance.


The weights or coefficients of the original variables used to form principal components. Loadings indicate the contribution of each original variable to each principal component.

How do you ensure you're not overfitting with a model?

Three main methods to avoid overfitting: 1. Keep the model simpler: how? Removing noise by taking into account fewer variables, parameters etc.. 2. Use cross-validation techniques such as k-folds cross-validation. 3. Use regularization techniques such as LASSO that penalize certain model parameters if they're likely to cause overfitting.

Accuracy formula


Sensitivity Formula

Tp/ all +


Transforming data from its original space to a new space. In PCA, data points are projected onto the principal components for dimensionality reduction, visualization, or to derive new features.

[ True or False ] Dimensionality reduction algorithms are one of the possible ways to reduce the computation time required to build a model.


[ True or False ] It is not necessary to have a target variable for applying dimensionality reduction algorithms.


) [ True or False ] PCA can be used for projecting and visualizing data in lower dimensions


3. PCA is sensitive to outliers in the data.

True. PCA is sensitive to outliers in the data.

1. Silhouette analysis is a method for evaluating the quality of clustering results, where higher silhouette scores indicate better-defined clusters.

True. Silhouette analysis is indeed a technique used to calculate the goodness of a clustering algorithm. It measures how close each point in one cluster is to the points in the neighboring clusters. Its values range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Within Cluster Sum of Squared Errors (WCSS) definition

WCSS is a metric used to evaluate the performance of a clustering algorithm. It is the sum of the squared distances between each point in a cluster and the centroid of that cluster. Mathematically, if ��Ck​ is the centroid of cluster �k and �X represents the data points, WCSS is defined as ∑�∑�∈��(�−��)2∑k​∑x∈Ck​​(x−Ck​)2.

a. What is the output of print list [2:4] if list = [42, 52, 11, 55, 24, 12].

[11, 55] because it starts 0 1 2 3

21. Which of the following are true? a) Clustering analysis is negatively affected by multicollinearity of features b) Clustering analysis is negatively affected by heteroscedasticity


Feature scaling is an important step before applying K-Mean algorithm. What is reason behind this? a) In distance calculation it will give the same weights for all features b) You always get the same clusters. If you use or don't use feature scaling c) In Manhattan distance it is an important step but in Euclidian it is not d) None of these


Assume the distance function is given, what methods can be used to get a better k-Means clustering on a given dataset? a) Kmeans++ initialization b) Random start c) Hierarchical Clustering d) None

a and b

.Which of the following metrics, do we have for finding dissimilarity between two clusters in hierarchical clustering? a) Single-link b) Complete-link c) Average-link

a b and c

What is true about K-Mean Clustering? a) K-means is extremely sensitive to cluster center initializations b) Bad initialization can lead to Poor convergence speed c) Bad initialization can lead to bad overall clustering

a b and c

In which of the following cases will K-Means clustering fail to give good results? a) Data points with outliers b) Data points with different densities c) Data points with round shapes d) Data points with non-convex shapes

a b and d

Increasing k value in k-nearest neighbor, the model will _____ the bias and ______ the variance. a. Decrease, Decrease b. Increase, Decrease c. Decrease, Increase d. Increase, Increase


a peak in Silhoutte Coefficient and the elbow point show

best choice for number of clusters

a. The ______ statement terminates the loop and transfers execution to the statement following the loop.


a. Python has many pre-defined functions known as __________.

built-in functions

a. The process of identifying and removing logical errors and runtime errors is called ________.


Write a function to check if a number when subtracted by 1 is a prime number.

def is_prime(num): if num <= 1: return False for i in range(2, int(num**0.5)+1): if num % i == 0: return False return True def check_prime_minus_one(n): return is_prime(n-1) # Example usage: num = 4 if check_prime_minus_one(num): print(f"{num-1} is prime.") else: print(f"{num-1} is not prime.")

what should be the selected number of clusters based on a dendogram

determined by looking at the largest vertical distance that doesn't intersect any of the cluster's horizontal lines (or by cutting the dendrogram at a desired dissimilarity height)

a. ________ data structure is most suited to maintaining a book catalog.


For a large k value the k-nearest neighbor model becomes _____ and ______ . i. Complex model, Overfit ii. Complex model, Underfit iii. Simple model, Underfit iv. Simple model, Overfit


a. Select the correct option for i in range (5):` *missing code* Output: 0 1 2 3 4 i. print (i) ii. print (i, sep = '\t') iii. print (i, end = ' ') iv. print (i + " ")

print (i, end = ' ')

a. What is the difference between the read and the readlines command?

read: Reads the entire file contents and returns it as a single string. readlines: Reads the entire file and returns it as a list of strings, where each string is a line from the file.

a. The function to convert an object to a string in python is ________.



tn/ all -

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA? PCA is an unsupervised method It searches for the directions that data have the largest variance Maximum number of principal components <= number of features All principal components are orthogonal to each other Allabove


1. Hierarchical clustering always produces a flat structure with a fixed number of clusters.

False. Hierarchical clustering does not produce a flat structure with a fixed number of clusters. Instead, it creates a hierarchy or binary tree of clusters known as a dendrogram, which represents the nested grouping of patterns and similarity levels at each merging (or splitting) of clusters. Hierarchical clustering does not require specifying the number of clusters beforehand and does not constrain the data to a fixed number of clusters unless a specific level of the dendrogram is chosen to cut and form flat clusters.

1. When performing a K-Means cluster analysis, the algorithm will automatically choose the optimal number of clusters for you.

False. In the K-Means clustering algorithm, the number of clusters (k) is a required input parameter and is not determined automatically by the algorithm. The user must specify the number of clusters they wish to use before running the algorithm.

Within Cluster Sum of Squared Errors (WCSS) Usefulness

Lower WCSS values indicate that the data points are closer to the centroids of their respective clusters, which is generally desirable. However, it tends to always decrease with increasing number of clusters, so it is typically used in conjunction with other metrics or methods (such as the elbow method) to determine the optimal number of clusters.

Three stages to build the hypotheses or model in machine learning:

Model Building: Choose a model and specify the algorithm. Model Training: Train the model using a labeled dataset. Model Testing: Test the model's predictions against a new dataset to evaluate its performance.

Pros and Cons Hierarchical Clustering

Pros: Doesn't require specifying the number of clusters upfront, produces a dendrogram which can be useful for understanding hierarchical structure. Cons: More computationally expensive, doesn't work well for large datasets.

Pros and Cons K Means

Pros: Simple to understand, fast for large datasets. Cons: Assumes clusters to be spherical, sensitive to initial centroid placement, might converge to a local minimum.

a. What are Python dictionaries? Give an example.

Python dictionaries are unordered collections of data in a key:value pair form. dictionary = {'name': 'John', 'age': 25, 'occupation': 'Engineer'}


Refers to the number of variables or features in a dataset. "High dimensionality" refers to datasets with a large number of variables, which can be challenging to analyze and visualize.

Difference between supervised and unsupervised machine learning:

Supervised Learning: The algorithm is trained on a labeled dataset, meaning the algorithm is provided with input-output pairs. The goal is to learn a mapping from inputs to outputs. Unsupervised Learning: The algorithm is trained on an unlabeled dataset, trying to learn the underlying structure of the data, like clustering or dimensionality reduction.

Data Compression:

The reduction of the size or quantity of data to save space or transmission time. In the context of PCA, it refers to reducing the number of dimensions (variables) while retaining as much variance or information as possible.

When performing cluster analysis, you should always standardize the variables.

True, generally - but with some qualifications. Standardizing variables is often crucial in cluster analysis, especially when the variables are measured on different scales or have different units of measurement, as it prevents variables with larger scales from dominating the distance calculations. This is particularly important for clustering algorithms that utilize distance metrics for forming clusters, such as K-Means.

1. In hierarchical clustering, the dendrogram can be used to determine the optimal number of clusters.

True, with qualifications. A dendrogram, which is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering, can indeed be utilized to help determine the "optimal" number of clusters by visualizing the structure and helping to inform where to cut the tree to define the clusters. However, the determination of the optimal number of clusters, and thus where to cut the dendrogram, often involves subjective judgement

Cluster analysis can be performed using nominal categorical variables.

True, with some qualifications. Cluster analysis can be performed using nominal categorical variables, but not all clustering algorithms are well-suited for handling categorical data. Traditional clustering algorithms like K-Means are designed for continuous variables and use Euclidean distance as a measure of similarity, which is not meaningful when applied to categorical data.

1. Clustering is a type of unsupervised learning.

True. Clustering is indeed a type of unsupervised learning. In machine learning, unsupervised learning refers to the type of problems where the algorithm is given input data without explicit instructions on what to do with it. The system tries to learn the patterns and the structure from the data without any labeled responses.

3. Eigenvalues and eigenvectors are used in PCA to compute the principal components.

True. Eigenvalues and eigenvectors are fundamental in PCA (Principal Component Analysis) to compute the principal components. Here's a breakdown of how they are utilized:


(e.g., disguised missing data) Jan. 1 as everyone's birthday?

Smoothing by bin means:

- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

print(1, 2, 3, 4) print(1, 2, 3, 4, sep='*') print(1, 2, 3, 4, sep='#', end='&') output

1 2 3 4 1*2*3*4 1#2#3#4&

You trained a binary classifier model which gives very high accuracy on the training data, but much lower accuracy on validation data. Which of the following may be true? 1. The training was not well regularized. 2. This is an instance of overfitting. 3. This is an instance of underfitting. 4. The training and testing examples are sampled from different distributions.

1 2 and 4

The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA? 1. PCA is an unsupervised method 2. It searches for the directions that data have the largest variance 3. Maximum number of principal components <= number of features 4. All principal components are orthogonal to each other

1. PCA is an unsupervised method 2. It searches for the directions that data have the largest variance 3. Maximum number of principal components <= number of features 4. All principal components are orthogonal to each other

iGiven the number of desired clusters k, the k-means algorithm follows four steps:

1.Randomly assign objects to create k nonempty initial partitions (clusters) 2.Compute the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) 3.Assign each object to the cluster with the nearest centroid (reallocation step) 4.Go back to Step 2, stop when the assignment does not change

Suppose your model is overfitting. Which of the following is NOT a valid way to try and reduce the overfitting? 1. Increase the amount of training data. 2. Improve the optimization algorithm being used for error minimization. 3. Decrease the model complexity 4. Reduce the noise in the training data.


What happens when you get features in lower dimensions using PCA? 1. The features will still have interpretability 2. The features will lose interpretability 3. The features must carry all information present in data 4. The features may not carry all information present in data


PCA works best on dataset having how many dimensions

3 or more

how many keywords in python 3.7


Model Based Clustering Approach

4A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other

Elbow Method

4Calculate sum of within-cluster variance, W with different values of k. 4W is a cumulative measure how good the points are clustered in the analysis. 4Plotting the k values and their corresponding sum of within-cluster variance helps in finding the number of clusters.


4Determine the largest vertical distance that doesn't intersect any of the other clusters 4Draw a horizontal line at both extremities 4The optimal number of clusters is equal to the number of vertical lines going through the horizontal line

Post Processing

4Eliminate small clusters that may represent outliers 4Split 'loose' clusters, i.e., clusters with relatively high SSE 4Merge clusters that are 'close' and that have relatively low SSE

Variations of K Means usually differ in

4Selection of the initial k means 4Distance or similarity measures used 4Strategies to calculate cluster means

Silhouette Coefficient

4The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

hierarchical based clustering

4agglomerative: pairs of items or clusters are successively linked to produce larger clusters 4divisive: start with the whole set as a cluster and successively divide sets into smaller partitions

partinioning based clustering

4divide a set of N items into K clusters (top-down)

iThe quality of a clustering method depends on

4the similarity measure used 4its implementation, and 4Its ability to discover some or all of the hidden patterns

a, b, c = 5, 3.2, "Hello" print (a) print (b) print (c) what would be the output?

5 3.2 Hello

How to Assign Values to Variables in python


Write a command to compare A & B where A is "SunSet " and B is "sunset".

A = "SunSet " B = "sunset" if A.strip().lower() == B.lower(): print("They are the same.") else: print("They are different.")


A code block (body of a function, loop, etc.) starts with indentation and ends with the first unindented line

floatng point number

A floating-point number is a numerical representation that approximates a real number using a decimal point to denote fractions. It is used to represent numbers that have fractional parts or numbers that are very large or very small and can't be accurately represented using only integers.


A non-zero vector that remains parallel after the application of a linear transformation. In PCA, eigenvectors of the covariance matrix define the directions of maximum variance (principal components).

Scree Plot:

A plot used in PCA which displays the eigenvalues associated with each principal component in descending order. It helps to determine the number of principal components to retain by visualizing the "elbow point" where the addition of more components has a diminished return.


A statistical measure that describes the spread of numbers in a dataset. It quantifies the degree to which each number in the dataset deviates from the mean and, therefore, from every other number in the set.

Principal Component Analysis (PCA)

A statistical procedure that uses orthogonal transformation to convert correlated variables into a set of uncorrelated variables called principal components, often used for dimensionality reduction and data visualization.

Which of the following techniques would perform better for reducing dimensions of a data set? A. Removing columns which have too many missing values B. Removing columns which have high variance in data C. Removing columns with dissimilar data trends D. None of these


Which of the following can act as possible termination conditions in K-Means? For a fixed number of iterations. Assignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum. Centroids do not change between successive iterations. Terminate when RSS falls below a threshold.


Python Identifiers

An identifier is a name given to entities like class, functions, variables, etc. It helps to differentiate one entity from another.

Weakness of the K-means

Applicable only when mean is defined; what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers

Smoothing by bin boundaries:

Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Partition into equal-frequency (equi-depth) bins

Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34

data discretization methods

Binning Top-down split, unsupervised Histogram analysis Top-down split, unsupervised Clustering analysis (unsupervised, top-down split or bottom-up merge) Decision-tree analysis (supervised, top-down split) Correlation (e.g., c2) analysis (unsupervised, bottom-up merge)

Data Integration

Combines data from multiple sources into a coherent store

What is if...else statement in Python?

Decision making is required when we want to execute a code only if a certain condition is satisfied. The if...elif...else statement is used in Python for decision making.


Divide the range of a continuous attribute into intervals

A library is added by a load statement (T/F)


a. Python is not an OOP language. (T/F).


1. K-Means clustering guarantees convergence to the global optimum solution.

False. K-Means clustering does not guarantee convergence to the global optimum solution. The algorithm aims to minimize the within-cluster sum of squares (the sum of squared distances from each point to its assigned cluster centroid). However, K-Means is sensitive to the initial placement of the cluster centroids and can converge to a local minimum of the objective function rather than the global minimum.

1. K-means clustering can handle datasets with a high degree of noise and outliers effectively.

False. K-Means clustering is sensitive to noise and outliers in the dataset. The objective of K-Means is to minimize the within-cluster sum of squares, and thus it is largely influenced by extreme values or outliers because squaring the distances amplifies the impact of large values.

3. PCA always improves the performance of a machine learning model.

False. PCA does not always improve the performance of a machine learning model. While it can be a useful technique for dimensionality reduction and mitigating the curse of dimensionality, there are situations and models where applying PCA might not enhance and could even degrade model performance.

PCA reduces the dimensionality of the data by removing features.

False. PCA reduces dimensionality not by removing original features, but by transforming them into a new set of uncorrelated variables known as principal components. These principal components are linear combinations of the original features and are ordered by the amount of variance they explain from the data.

Feature scaling is an important step before applying K-Mean algorithm. What is reason behind this?

Feature scaling is crucial before applying the K-Means algorithm mainly due to the algorithm's dependency on distance measurements to assign data points to clusters and compute centroids. Here are simplified points: Equal Importance: Without scaling, variables with larger values will unduly influence the clustering, potentially overshadowing important patterns in variables with smaller scales. Distance Accuracy: K-Means uses distances between points to form clusters. Scaling ensures each variable contributes equally to the distance computation, preventing bias in cluster assignments. Efficiency: Properly scaled features can help the algorithm converge more quickly and provide more stable results. Interpretability: Scaled features help in forming clusters that reflect actual data patterns, making them more interpretable and meaningful.

Regression noisy data

Fit data to a function. Linear regression finds the best line to fit two variables. Multiple regression can handle multiple variables. The values given by the function are used instead of the original values.

if test expression: statement(s)

Here, the program evaluates the test expression and will execute statement(s) only if the test expression is True. If the test expression is False, the statement(s) is not executed. In Python, the body of the if statement is indicated by the indentation. The body starts with an indentation and the first unindented line marks the end. Python interprets non-zero values as True. None and 0 are interpreted as False.

Multi-line statement

In Python, the end of a statement is marked by a newline character. But we can make a statement extend over multiple lines with the line continuation character (\).

You are given a cancer detection data set. Let's suppose when you build a classification model you achieved an accuracy of 96%. Why shouldn't you be happy with your model performance? What can you do about it? (3 sentences)

It seems that 96% is very good accuracy to be accepted. However, in a dataset like cancer where number of malignant patients are much smaller than benign patients, it is an imbalanced data problem. In this case, total accuracy is not an appropriate measure. So, we need to look for sensitivity and specificity which can identified correctly the positive and negative class separately.

Python LIst

List is an ordered sequence of items. It is one of the most used datatype in Python and is very flexible. All the items in a list do not need to be of the same type. Declaring a list is pretty straight forward. Items separated by commas are enclosed within brackets [ ].

How to Decide how many components?

Look at Eigen Values and a bar plot or a scree plot

Pre Processing

Normalize the data Eliminate outliers

What are operators in python?

Operators are special symbols in Python that carry out arithmetic or logical computation. The value that the operator operates on is called the operand.

python keywords

Protected, special words (tells Python you are about to define a function)

Strength of K Means

Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n Often terminates at a local optimum

Clustering noisy data

Similar values are organized into groups (clusters). Values falling outside of clusters may be considered "outliers" and may be candidates for elimination

Python Strings

String is sequence of Unicode characters. We can use single quotes or double quotes to represent strings. Multi-line strings can be denoted using triple quotes, ''' or ""

Python Break and Continue

The break statement terminates the loop containing it. Control of the program flows to the statement immediately after the body of the loop. If the break statement is inside a nested loop (loop inside another loop), the break statement will terminate the innermost loop.

Python Continue statement

The continue statement is used to skip the rest of the code inside a loop for the current iteration only. Loop does not terminate but continues on with the next iteration. Syntax of Continue

Python for Loop

The for loop in Python is used to iterate over a sequence (list, tuple, string) or other iterable objects. Iterating over a sequence is called traversal.

Within Cluster Sum of Squares (WCSS)

The sum of the squared deviations from each observation and the cluster centroid.

While Loop

The while loop in Python is used to iterate over a block of code as long as the test expression (condition) is true.

They are used to define the syntax and structure of the Python Language


We cannot use a keyword as a variable name, function name or any other identifier.


a. In Python, a variable must be defined before it is assigned a value (T/F).


a. Python allows for dynamic data type. (T/F)


a. You use a while loop when the number of iterations is uncertain (T/F).


K-Means clustering seeks to minimize the distance from each point to the center of a fixed number of clusters.

True. K-Means clustering aims to partition a dataset into �k clusters, where �k is a predefined number of clusters specified by the user. The algorithm seeks to minimize the within-cluster sum of squares, which essentially is minimizing the squared Euclidean distance from each data point to the centroid of its assigned cluster.

3. PCA can be used for data compression while preserving most of the original information.

True. PCA (Principal Component Analysis) can indeed be used for data compression while preserving most of the original information.

3. PCA is useful for visualizing high-dimensional data.

True. PCA is indeed useful for visualizing high-dimensional data by reducing its dimensionality in a way that preserves as much of the data's variance as possible. Here's a breakdown:

3. PCA requires the data to be normalized or standardized before applying it.

True. PCA is sensitive to the variances of the original features because it aims to identify the principal components that maximize variance. If one feature has a very large variance while others do not, PCA might determine that the direction of maximal variance more closely aligns with the feature having larger variance, which might not be informative regarding the principal axes of the actual data structure.

Range() Function

We can generate a sequence of numbers using range() function. range(10) will generate numbers from 0 to 9 (10 numbers). We can also define the start, stop and step size as range(start, stop,step_size). step_size defaults to 1 if not provided.

Python Nested if statements

We can have a if...elif...else statement inside another if...elif...else statement. This is called nesting in computer programming. Any number of these statements can be nested inside one another. Indentation is the only way to figure out the level of nesting. They can get confusing, so they must be avoided unless necessary.

Two cases when data needs to be standardized:

When features are on different scales and this scale difference can impact the performance of the machine learning algorithm (e.g., in distance-based algorithms like K-means clustering or k-NN). Before applying techniques that assume data has a Gaussian distribution, like PCA

Is python a case sensitive language?


Is it possible that Assignment of observations to clusters does not change between successive iterations in K-Means

Yes, it is possible and, in fact, is a key point during the execution of the K-Means clustering algorithm where the assignment of observations to clusters does not change between successive iterations. This situation is typically considered as reaching convergence and serves as an indication that the algorithm can be terminated since further iterations would not lead to any changes in the clustering.

Is K-Means algorithm sensitive to outliers?

Yes, the K-Means clustering algorithm is sensitive to outliers. This sensitivity is due to the algorithm minimizing the sum of squared deviations (variances) within each cluster, which inherently can be highly influenced by extreme values (outliers).

Python Output Using print() function

a = 5 print('The value of a is', a) output: The value of a is 5

a = [5,10,15,20,25,30,35,40] # a[2] = 15 print("a[2] = ", a[2]) # a[0:3] = [5, 10, 15] print("a[0:3] = ", a[0:3]) # a[5:] = [30, 35, 40] print("a[5:] = ", a[5:]) what is the output

a[2] = 15 a[0:3] = [5, 10, 15] a[5:] = [30, 35, 40]

Second Principal Component

also a linear combination of original predictors which captures the remaining variance in the data set and is uncorrelated with the first component In other words, the correlation between first and second component should be zero

how is the range object lazy

because it doesn't generate every number that it "contains" when we create it. However, it is not an iterator since it supports in, len and __getitem__ operations. This function does not store all the values in memory; it would be inefficient. So it remembers the start, stop, step size and generates the next number on the go. To force this function to output all the items, we can use the function list().

How to Handle Noisy Data

binning nfirst sort data and partition into (equal-frequency) bins nthen one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression nsmooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection ndetect suspicious values and check by human (e.g., deal with possible outliers)

two major approaches to find optimal number of clusters:

domain knowledge or data driven method (such as elbow method)

the purpose of smoothing data is to

eliminate noise

Key words are not case sensitive


All Succeeding principal components

follow a similar concept i.e. they capture the remaining variation without being correlated with the previous component.

Syntax of For Loop

for val in sequence: Body of for Here, val is the variable that takes the value of the item inside the sequence on each iteration. Loop continues until we reach the last item in the sequence. The body of for loop is separated from the rest of the code using indentation.

What point of VIF do we drop variables

greater than 5

If many points have a low or negative value, then the clustering configuration may

have too many or too few clusters

Issues with K-Means

iBasic K-means algorithm can yield empty clusters

Cluster Centroid

iEach centroid can be seen as representing the "average observation" within a cluster across all the variables in the analysis.

Basic Steps to Develop a Clustering Task

iFeature selection / Preprocessing 4Select info concerning the task of interest 4Minimal information redundancy 4May need to do normalization/standardization iDistance/Similarity measure 4Similarity of two feature vectors iClustering criterion 4Expressed via a cost function or some rules iClustering algorithms 4Choice of algorithms iValidation of the results iInterpretation of the results with applications

Possible Termination Conditions in K Means

iFor a fixed number of iterations iAssignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum. iAssignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum. iTerminate when RSS (RESIDUAL SUM OF SQUARES) falls below a threshold.

Maximum Distance from Centroid

iThe maximum distance from observations to the cluster centroid is a measure of the variability of the observations within each cluster. A higher maximum value, especially in relation to the average distance, indicates an observation in the cluster that lies farther from the cluster centroid.

K Means Clustering Method

iclusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). iThis algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields. The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion

Distances Between Cluster Centroids

ifrom each other. A larger distance generally indicates a greater difference between the clusters.

Why Pre process data?

increase data quality

Python Input

input([prompt]) and then you enter what you want

incomplete data

lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation=" " (missing data)

As the number of observations increases, the sum of squares becomes

larger. Therefore, the within-cluster sum of squares is often not directly comparable across clusters with different numbers of observations.

Types of normalization

min max z score decimal scaling

Inconsistent Data

ncontaining discrepancies in codes or names, e.g., Age="42", Birthday="03/07/2010" Was rating "1, 2, 3", now rating "A, B, C" Indiscrepancy between duplicate records

Noisy Data

ncontaining noise, errors, or outliers e.g., Salary="−10" (an error)

Can an identifier start with a digit?


Can keywords be used as identifiers


can you use special symbols as identifiers


k represents what in k means

number of optimal clusters

print(range(10)) print(list(range(10))) print(list(range(2, 8))) print(list(range(2, 20, 3))) output

range(0, 10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [2, 3, 4, 5, 6, 7] [2, 5, 8, 11, 14, 17]

Line continuation is implied inside parentheses ( ), brackets [ ], and braces { }


How to figure out data type in python


Can Identifiers be a combination of letters in lowercase (a to z) or uppercase (A to Z) or digits (0 to 9) or an underscore


can an identifier be any length


can triple quotes be used for multi line comment


can we put multiple sattements in a line using semicolon?


Should we drop variables with low variance

yes and thats dimension reduction baby

Should we drop a variable if it has very high correlation with another variable

yes •it is not good to have multiple variables of similar information or variation also known as "Multicollinearity".

First Principal Component

• a linear combination of original predictor variables which captures the maximum variance in the data set. •It determines the direction of highest variability in the data. •Larger the variability captured in first component, larger the information captured by component. No other component can have variability higher than first principal component. •The first principal component results in a line which is closest to the data i.e. it minimizes the sum of squared distance between a data point and the line.

Why Dimension Reduction is important in machine learning & predictive modeling?

•Due to surge in data it has started gaining more importance •The way data gets captured, there can be a lot of redundancy •The more variables, more trouble!!

Steps for PCA

•First look at the data •Heatmap (for correlation) •Boxplot ( for different levels and arbitrary units) •Is it suitable for PCA? ( check multicollinearity) •If it is, center the data •Apply PCA •Draw a scree plot to select the number of components •Select the components and check which variables are important for each component •Get the final reduced matrix. •Plot the components ( 2 dimension) for visualization

What are the benefits of Dimension Reduction

•Helps in data compressing and reducing the storage space required Fastens the time required for performing same computations. •It takes care of multi-collinearity that improves the model performance. It removes redundant features. •Helpful in noise removal also and as a result can improve the performance of models. •Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely.

Forward Feature Construction

•This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance. Both algorithms, Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive.

Backward Feature Elimination

•the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. The classification is then repeated using n-2 features, and so on. Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm

The silhouette ranges from

−1 to +1, where a high value indicates that the object is well matched to its own cluster and well separated from other clusters. If most objects have a high value, then the clustering configuration is appropriate.

