Data Scientist Interview

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What Are Hyperparameters?

Hyperparameters are the "knobs and buttons" of the neural network. We tune these to get our model to fit and generalize as good as possible. Some examples of hyperparameters include: - Number of trees - Max Depth - Number of nodes in a hidden layer - Number of epochs to train for - C for logistic regression models

What are Recommender Systems?

Recommender systems are subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

What is the difference between Regression and Classification ML techniques?

Regression: Predicting a continuous variable Classification: Predicting a categorical variable by outputting probabilities.

What is regularization? Why is it useful?

Regularization is a mechanism for punishing a model more and more as it becomes more complex. It is very useful when trying to combat overfitting as it keeps the model from becoming more complex than it needs to be. Two regularization techniques are L1 and L2 regularization, also known as Lasso and Ridge regularization.

List the differences between supervised and unsupervised learning.

Supervised: - Has a target variable - Train/Test/Validation split - Used for prediction - Regression and Classification Unsupervised: - Does not have a target variable - No Train/Test/Validation split - Used for analysis / understanding the data - Clustering, Dimensionality Reduction, Density Estimation

What are the various steps involved in an analytics project?

1. Identification of the problem: In any analytics problem, you first need to clearly define the problem. Talk to business stakeholders, identify KPIs, identify possible solutions to the problem, identify resources to needed and in possession to tackle the problem. 2. Begin analysis: Beginning the analysis mainly consists of two parts, cleaning the data and performing initial exploration/analysis on the data. This will help you and the team identify possible avenues of completion and gain a deeper understanding of the project. This step will also continue throughout the project. You and the team will iterate on your exploration of the data as you gain new insights into it. 3. Depending on the type of project: Assuming this is a machine learning project, prototypes of a few models will begin development. Along the way these models will continuously be tested and iterated on as they are systems that require weeks to be understood. Assuming this is a dashboarding project, where a dashboard will be created for leaders to better understand their business. Prototypes of a dashboard will be created and iterated on based on managements' preferences. 4. Once the model/dashboard has been created, we will move onto validation. This will mainly include running the system on real time data and checking how it performs. If it does well, it is ready to move into productionalization. If not, the team will attempt to iterate further on the design or attempt the project from a different approach. 5. Once these prototypes are complete, we will begin moving to the next stage of the project. This will consist of deployment of the model/dashboard. This stage will require knowledge of machine learning deployment or dashboard automation depending on the type of project. Machine learning: The model will begin to be productionalized. This will consist of creating a model endpoint that it can be accessed from the required applications. We will also need to create automatic retraining for the model so it can stay up to date and make relevant, accurate predictions. In addition to this, we will need an automated data ingestion pipeline that takes in, cleans, and stores data for the model to be retrained on. Dashboarding project:

A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls?

50%

What Are Confounding Variables?

A confounding variable is a variable that has a effect on the independent and dependent variables. For example, say we are trying to predict volume of a system using temperature. The confounding variable here will be pressure as it effects the volume, but we are not modeling it.

What is a confusion matrix?

A confusion matrix is a two by two table for a binary classification model that shows true positives, true negatives, false positives, and false negatives. It gives you a sense of how your model is doing on each class. It can also be bigger if you have a multiclassification problem.

Describe MLE in detail.

A method in which to fit a distribution to your data. Plot likelihood of observing the data while shifting distribution over. Basically, we take a distribution, and calculate the probabilities while placing the distribution on different areas of the axis. In the example, we are using a normal distribution to fit to the data. FIRST, we calculate the likelihood for different means. THEN, once we have found the mean with the maximum likelihood, we find the likelihood for different standard deviations and pick the SD with the highest likelihood. --- ChatGPT wrote: Maximum likelihood estimates is a statistical method used for estimating the parameters of a statistical model. It is based on the principle of finding the parameter values that maximize the likelihood function, which represents the probability of observing the given data under a particular set of parameters. ---- Just did 2 examples. Basically find where the derivative of the likelihood is 0. This will be the place where

What is a p-value?

A p-value is a measure of the strength of your results when performing a hypothesis test. In other words, the probability of obtaining the result we have if the null hypothesis were true. For example; when running a linear regression, you can look at the p-values of the beta hats to determine if the beta hats are statistically significant. In other words, if they are DIFFERENT from 0.

What is a prior and a posterior distribution?

A prior distribution is your best estimate of a distribution before updating it. Example is a distribution of an unfair coin flip before updating it using your data. The prior would be a uniform distribution. The posterior would be the updated distribution, that is, the distribution that takes into account the new data you have.

What is a Random Forest? How does it work?

A random forest is a type of tree-based machine learning model that utilizes bagging. It can be used for classification or regression. A random forest works in the following way: First it creates many bootstrapped samples of size n of the dataset. It does this because it is using bagging, an ensemble machine learning method to generalize better. It then randomly selects a subset of the available features to be used in each of the trees. It does this to increase generalizability. After randomly selecting the features for each model, it fits each weak learning or individual tree. After fitting all trees, it averages the predictions, giving the user their final and most popular result.

What are Artificial Neural Networks?

An artificial neural network refers to a set of algorithms that have one or multiple input variables/types, hidden layers, and then one or multiple output variables.

What is an example of a data set with a Non-Gaussian Distribution?

An example of a data set with a non-gaussian distribution is a uniform distribution. When rolling a fair dice, there is an equal probability that you will land on any one of the 6 sides.

Can you cite some examples where a false positive is important than a false negative? And vice versa?

An example of when False positives are more important that false negatives. At united airlines, during my project on VFSGs, we were turning my model to prevent false positives. This was because there was a high cost associated with a false positive. We would have to send people out to repair the plane, wasting much of their time where we could have been working on another problem. An example of when a false negative is more important than a false positive is when predicting whether someone has cancer or not. If we predict that someone does not have cancer, they will not be checked. If we have a false positive, this will lead to a person with cancer to not be checked, and thus, more likely to develop later stage cancer.

Can you cite some examples where both false positive and false negatives are equally important?

Any situation where each prediction has equal weight. An example of this would be an automatic recycling robot. If the robot classifies that a piece of garbage is paper when it really is plastic, or visa versa, then the piece of trash will go into the wrong area. Misclassifying either is just as bad and require the same amount of effort to correct; sending the piece to another facility.

What Is the Law of Large Numbers?

As we observe more n from a population, the sample statistics will converge toward their actual, population values. According to the law, if we have a large enough, adequately taken sample from the population, our sample statistics will provide a good estimate for the true population parameters.

A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head?

Bayes Theorem: P(B|A) = (P(A|B) * P(B)) / P(A)) Remember you might need to do this more than once. Find probability of unfair and fair each given h_10. Multiply by their respective probabilities to get heads.

Why is image augmentation important in deep learning?

Better generalization: The model is exposed to a lot of types of the same photo. This helps with generalization because the model has seen multiple different representations of one photo. This means that if it sees any of those representations, it will still know what the photo is, unlike if we only used one original photo. Better feature learning: Prevents overfitting:

What are the most known ensemble algorithms?

Boosting and bagging are some well-known ensemble algorithms. Some machine learning models that use these include random forest (bagging) and XGBoost (boosting).

Describe in brief any type of Ensemble Learning.

Boosting: In boosting, we first fit a machine learning model to a dataset, we then get errors from the first model. We then try to predict the errors of the first model, reducing our total error. We keep doing this until we achieve a desired metric, whether this be accuracy, AUC, etc.. This in ensemble learning because we are stacking multiple models on top of one another to achieve a better prediction overall. Boosting Pros: - Takes care of the weightage of the higher accuracy and lower accuracy samples then gives the combined results. - Net error is evaluated in each learning step. Works well with interactions. *hmm - Boosting techniques help when dealing with bias or underfitting. Boosting Cons: - Can lead to overfitting (high variance). - Increases complexity of classification. - Heavy time and computation costs. Bagging: In bagging, we first create a lot of bootstrapped samples of the data. That is, we create a lot of samples of size n by randomly selecting **WITH REPLACEMENT** from the training set. We then fit a machine learning model to each of these bootstrapped samples and average their results. The end result is a strong predictor of the final result. Bagging Pros: - Helps prevent overfitting. - During the sampling of training data, there are many overlapping observations. The combination of many learners help overcome high variance. - A high amount of parallelization due to the distributed (divide and conquer) nature of the algorithm. Bagging Cons: - Can introduce bias into the model. - If we have a lot of variability in the data (a complex relationship), it might not be able to capture the relationship. (Loss of nuance because of averaging *according to chatgpt) (**OMG**: Because of the nature of bagging it tries to make a simpler model, hence, reducing variance. This results in not capturing the complexities of the data given a complex dataset.)

What is Collaborative filtering? And a content based?

Collaborative filtering: Collaborative filtering is when a system recommends content that other similar users consume. Content Based filtering: Content based filtering is when a system recommends content that is similar to content a user consumes on the platform

What is Cluster Sampling?

Cluster sampling is when the researcher only samples from a cluster of interest from the population, say, college students. This is generally done because the researcher's area of interest would be too hard to attain data on using just random sampling. Say they are looking to analyze college students but have to sample the whole population. This will be much more cost effective if they are able to only ask college students.

What Are the Different Layers on CNN?

Conv tolerate small shifts in the image, take advantage of close pixel correlation, and reduce the number of input nodes. First: Apply a filter to the input image. The filter is a 3 pixel by 3 pixel square. Intensity of each pixel in the filter is determined by back propagation. The window has weights assigned to each filter in the 3 x 3 window. These are calculated using back propagation. When an image is fed in, we then take the dot product of all weights times the pixel values in the filter. This will be the sum of 9 weights multiplied by their corresponding pixel values in a 3x3 filter. We can say the filter is convolved with the input. This is where we get the name CNN. We then add the bias term to the dot product. Then, put the convolution plus bias term into a feature map. The feature map helps take advantage of correlations there might be in the image. Now, we run the feature through a ReLu activation function. Now we apply another filter to the feature map we passed through the ReLu function. This time we are going to take the max of the 2x2 window sliding over the entire image. IT DOES NOT OVERLAP ITSELF HERE. This results in the MaxPooled feature map. How this works, the areas of the image with the highest "match" to the filter will be given the highest values in the feature map. This corresponds to where the filter did the "best job" at matching the input. When we pass this through the feature map, activation function, and MaxPooling layer, it results in a smaller feature map that shows the places where the most convolutional layer most matched the input. This results in us finding what the model things is the "most important" section of the image. These are how the CNN selects the "features" automatically. If we don't want to use max pooling, we can use something else such as Average Mean Pooling. Instead of taking the max, we take the mean of each section of the feature map which would represent the strength of the match in each section. After we have the max pooled or average pooled feature map, we convert it into a single vector and pass it into a regular dense or fully connected layer. Through a softmax function, and receive probabilities for each class.

What is correlation and covariance in statistics?

Correlation measures the strength of the linear relationship between two variables. It is standardized between -1 and 1. 1 being a perfect linear relationship, 0 being no linear relationship, and -1 having a perfectly opposite linear relationship. Covariance measures just the relationship between two variables. It is the same as correlation besides the fact that the magnitude of the relationship is not between -1 and 1 and can not be interpreted as easily.

Explain cross-validation.

Cross validation is a type of re-sampling. It is used when we want to evaluate a machine model using limited data. We split the training data into k folds. We then fit the model k times, holding out one fold each time. We then test the model on the held out fold. After doing this, we will be able to see the model's performance averaged over many samples.

What is Data Science?

Data science is an interdisciplinary field that integrates three core domains: - Computer Science: This serves as the technological backbone of data science. It involves the design and implementation of algorithms and data structures to efficiently gather, store, and process large and complex datasets. Programming languages like Python and R are commonly used tools that enable data scientists to manipulate data, implement machine learning models, and conduct analyses. - Statistics: This is the theoretical foundation of data science. Statistics involves various methodologies for summarizing, interpreting, and making inferences from data. Techniques such as hypothesis testing, regression analysis, and machine learning fall under this domain. It allows data scientists to derive meaningful insights from raw information and to create predictive models. - Business: Often considered the application layer of data science, this domain is essential for translating technical findings into actionable insights for organizations. Data scientists must understand the business context of their analyses to recommend data-driven solutions effectively. This can involve anything from optimizing operational processes to identifying new market opportunities.

What do you mean by Deep Learning?

Deep learning refers to a subset of machine learning. The only way that deep learning different is that we are using a neural network with multiple hidden layers within it. These hidden layers can be for example; convolutional layers, long-short term memory layers, fully-connected layers, etc..

How will you define the number of clusters in a clustering algorithm?

Depending on the clustering algorithm you're using, there are different ways to select the cutoff for number of clusters. Generally, there is a method to select the number of clusters for each algorithm. These will not be useful if the business for example gives you a predefined number of clusters to set based on their or the business needs. K-Means Clustering: When selecting the number of clusters for the K-means algorithm, we use something called the elbow plot. This is a graph that plots the number of clusters on the x-axis and the within-cluster sum of squares on the y axis. Generally, there is a point on this plot known as the "elbow." This point is where increasing the number of clusters doesn't result in a very dramatic reduction in WCSS (within cluster sum of squares). At this point is where we usually select the number of clusters. DB-Scan Clustering: This is a form of Hierarchical clustering. It starts by finding the two closest points and creates a cluster. It keeps doing this until one big cluster is formed. It will then output a dendrogram. When looking at the dendrogram, on the y-axis we can see the Euclidian distance between two clusters. We draw a horizontal line across the longest vertical lines as those represent the clusters that are furthest away from one another or "best" separation in other words.

How Regularly Must an Algorithm be Updated?

Depending on the data and model. We must monitor KPIs related to the model. These include model performance metrics such as accuracy, ROC, f1 score, etc... As we are monitoring these metrics, we can see over time how they change. If they are starting to decay over time, then it might be time to retrain the model. An approach broadly taken for this is retraining on a set time interval, say every week or two, to keep the model up to date. We would also want to update the model if the underlying data is changing. This is kind of similar to staying with the times.

What are Eigenvectors and Eigenvalues?

Eigenvectors: Vectors that, when a transformation is applied to it, will only change by a scalar and not change direction. Eigenvalues: The value by which the eigenvector is stretched when a transformation is applied to it.

What is Ensemble Learning?

Ensemble learning is the process of taking multiple "weak" machine learning models, pooling their predictions together, and then combining their results in some way. Some examples of ensemble learning algorithms: Book Def: Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Ensembles are a **divide-and-conquer approach** used to improve performance. The main principle behind ensemble methods is that a group of "weak learners" can come together to form a "strong learner". Each classifier, individually, is a "weak learner," while all the classifiers taken together are a "strong learner".

What Is The Difference Between Epoch, Batch, and Iteration in Deep Learning?

Epoch: A single epoch is one training run through all of the data. Batch: A batch is the number of training samples to consider when calculating one gradient update for all of the weights. Iteration: An iteration is the number of batches you need to run through the entire training dataset once. For example, if we have 10,000 samples and a batch size of 50, we will have 10,000/50 or 200 iterations.

Python or R - Which one would you prefer for text analytics?

For text analytics I prefer python. The reason for this is, I am comfortable using python for everything. I also feel it is more widely supported by the community which is important when it comes to packages with NLP functionality or any other functionality. A bigger community will lead to better packages you can use to quicken your analysis of text.

How does data cleaning play a vital role in the analysis?

Garbage in equals garbage out for anything, including analyses. An analyst needs to clean his or her data because it needs to be in a specific format to perform analysis on. For example, lets say we have a column that has important numerical information in the form of text. Are we just going to drop that so we can make our linear regression work? No that would not be very smart. We need to clean that column so we can analyze the information in it. All in all, without cleaning, we wouldn't have as good of data to analyze which will lead to sub-optimal results.

What is bias-variance trade-off?

High Variance: Variance is how much the predictions vary for the same x when trained on different datasets. Variance is a model's sensitivity to changed in the training data. In a model with high variance, the model is too complex, fitting the training data very closely, including its noise. This leads to poor performance on new, unseen data. High Bias: Bias is how much the average prediction of the model is off from the ground truth value. In a model with high bias, the model is too simple to capture the complexities of the data, leading to underfitting. This results in poor performance on both the training data and new, unseen data. The Bias-Variance Tradeoff: In machine learning, the goal is to find a balance between bias and variance to achieve good generalization. You introduce enough bias to prevent overfitting, making the model simpler and more robust to new data. Conversely, you allow enough variance for the model to adapt to the complexities in the data. The ideal model minimizes the total error, which is a combination of bias, variance, and irreducible error. Examples of this include: KNN: K=1 has highest variance as it just connects to the nearest neighbor. K=n has high bias as its just predicting the mean each time. You need to find a k that balances bias and variance for the best generalizable model. Decision Tree: As you increase max depth, you decrease bias and increase variance as it is able to fit more and more closely to the training dataset. Find optimal max depth such that the model fits training data and generalizes well.

What cross-validation technique would you use on a time series data set?

It depends on your problem. One continuous dataset: You will need to "chain" the time series data together. What this means is that you need to fit on a set of data, predict the next, then do the same again on and on until you reach the latest time in that dataset. This looks like this: train: [1], test: [2] -- train [1,2], test [3] -- train [1,2,3], test[4] -- etc... Another method is fitting on a subset of the samples. For example if you have 10 people in a test. You do cross fold validation on all of them. Train on 8, predict on 2, and on just as regular cross validation goes.

What is Linear Regression?

Linear regression is finding the best beta hats so that you find a line that fits the data while minimizing the average distance between each observation and the line. In other words, finding the line of best fit for a dependent variable to a given set of data using features of that data.

What is linear regression? What do the terms p-value, coefficient, and R-squared value mean? What is the significance of each of these components?

Linear regression is one of the simplest forms of machine learning. It works by taking multidimensional data (X_1, ..., X_n) and finding the line of best fit. This also means minimizing the SSE (sum of square error). The meaning of different components are as follows: - P-Values: P-values are a measure of the significance of the effect of the X variable in question on the dependent variable, y. The p-value is from a t test of significance of the X term being different than 0. If the p-value is lower than 0.05 or 5%, generally as a rule of thumb, we say that the X is different than 0. In more laminas terms, this value measures the strength that we can say the value is significant. - Coefficient(s): The coefficients of a linear regression measure the effect of a one unit change in X, on y; holding everything else constant. - R-square: The r-square value in general measures the percent of variance explained by the model. An example; say we are trying to explain demand and we have a model that has a r-square value of 96%. We would say this model explains 96% of the variance in demand. What is the significance of these components: - P-Values: P-values are significant because they allow us to determine the statistical significance of each of a model's features. More simply, p-values allow us to see which variables matter for what we're trying to predict. - Coefficients: Coefficients are significant because they because they tell us the effect of each feature of the dependent variable. These effects, for a multidimensional problem, come together and form the model's ability to determine y for any given x(s). - R-square: R-squared is significant for a few reasons. One, it allows us to determine how the model is performing relative to other iterations of the model. Two, if we add/subtract a feature from the model, we are able to see the effect is has on the r-square value. This means we are able to see the percent of variance each feature explains which is very powerful when determining next steps for the business.

What is logistic regression? State an example when you have used logistic regression recently.

Logistic regression is linear regression with a sigmoid activation function attached on the end. Logistic regression is a method in which to obtain probabilities from a linear combination of feature or predictor variables. Similar to linear regression, we can then look at each of the features and determine, for a one increase in x_i, holding everything else constant, what is the increase or decrease in the log odds of y, the dependent variable or positive class. To interpret the coefficients, we say that a one unit increase in an X results in an e^(Beta_i) increase in the odds. The formula for odds is (p/(1-p)) Example of when I used logistic regression: When doing the titanic survival challenge on Kaggle. I used a logistic regression model to predict if someone was going to survive or not. One interesting feature that I created from names was the title of the passenger. I recall that women with a Mrs. title had 50000000 higher odds. I thought this was very funny as all of the other coefficients were between 0 and about 3. One other thing I learned was that people with more than 2 other siblings had almost 2 less log(odds) of surviving. Another interesting feature includes fare. For every dollar increase in x your log odds increase by about 0.05

What is the difference between long and wide format data?

Long format: When the table contains a lot of rows. Wide format: When the table contains a lot of columns or dimensions. Long format is usually better when it comes to machine learning to avoid the curse of dimensionality.

Describe MAP in detail.

MAP is taking the posterior probability density function, multiplying it by the likelihood function, taking the derivative of the log likelihood function, setting it equal to 0, and then solving it for the theta of interest. What is this doing in english? We are taking what we currently know regarding the distribution and updating it based on new observations we currently have. How this works is we find theta that has maximum likelihood of occurring given our prior and new observations.

How does MLE differ from MAP?

MLE is finding the maximum likelihood of some theta given your data, while MAP is also finding the maximum likelihood of some theta given your data, however, you also incorporate a prior belief that you update using the new data.

Name loss functions and how they work.

MSE: mean of residuals. Pretty simple RMSE: sqrt mean of residuals. More robust to outliers than MSE because of sqrt. MAE: mean of abs value of residuals. Also more robust to outliers than MSE. MAPE: Percentage of error relative to y true multiplied by 100. Binary Crossentropy: Basically 0 always when the prediction is correct, however when prediction is incorrect we will get a large loss value. Loss is high because you're taking the log of some small epsilon value. Actually we use the probabilities for this. Basically the general formula is a - sum of y_c * log(y_hat_c) over all classes. Categorical Crossentropy: - sum of y_c * log(y_hat_c) over all classes. This works because we negate the entire formula, turning this into a minimization problem. So basically, we multiply the y true value by the log of the probability of that value being selected by the model. Hinge loss: loss function for maximizing the margins. Usually used when training SVMs. NEED TO GO MORE IN DETAIL HERE.

What is Machine Learning?

Machine learning is the practice of using mathematics to try to detect "patterns" in data that can tell us what might happen given a set of information. For example, say we have information regarding people. Their height, weight, sex, etc.. We want to predict whether or not they are at risk for a heart attack. We can apply machine learning to a lot of patients' data, learning what could be the causes of a heart attack. Using this new found information, we could help people who are more likely to get heart attacks take preventative measures and living longer lives.

What do you understand by the term Normal Distribution?

Normal distribution is a term coined by statistics. This is the distribution that has a bell shaped curve. What comes to mind is the 68%, 95%, 99% splits of the distribution. What that means is the percent of samples that fall into 1, 2, and 3 standard deviations from the mean. While not all naturally occurring distributions fall into this category, some do and it can be very useful when it does! Characteristics of the normal distribution: - One mean - Symmetrical around the mean - Mean is the same as median

What are the differences between over-fitting and under-fitting?

Over-fitting: Model is too complex, fits the training data very well but performs poorly on all other data. Under-fitting: Model is not complex enough. Performs poorly on both training and testing data.

How to combat Overfitting and Underfitting?

Over-fitting: There are a few ways to combat over fitting. - Reduce model complexity. - Stop model training earlier. - Perform cross validation. - Regularization - Dropout Under-fitting: - Increase complexity of model. - Train model longer

What is PCA? When do you use it?

PCA or Principal component analysis is a dimensionality reduction technique. When you have a very wide data frame, it can be hard to manage all the columns, especially if you want to fit a machine learning model, visualize the data, or perform a variety of different analyses on it. One method to reduce the dimension of the data is to use PCA. PCA reduces the dimension of the data by first scaling the features. This is done because of how the principal components are calculated. Since they use distance from the mean, the distance does matter. One greater magnitude vector would "weigh" more and mistakenly "explain" more variance than a smaller magnitude vector. After scaling, the algorithm will find the best fitting line that "explains" more variance than any other. This will be done again and again until you have as many PCs as original features. Now, you have principal components that explain less and less as you go to lower ranked components. Note; the principal components are orthogonal to one another, or perfectly uncorrelated. These create an orthogonal basis. PCA is often used as a tool in exploratory data analysis and in machine learning. More information: https://en.wikipedia.org/wiki/Principal_component_analysis

What is the difference between Point Estimates and Confidence Interval?

Point estimate: An estimate of a population parameter. I.e. a sample mean is a point estimate for the population mean. Confidence interval: A confidence interval is an interval which we can say with x level of confidence that it contains the true value. I.e. we can create an interval using a 95% confidence rate, here it will be the mean height of males. We can create an interval that we are 95% confident contains the true mean of the population.

In any 15-minute interval, there is a 20% probability that you will see at least one shooting star. What is the probability that you see at least one shooting star in the period of an hour?

Poisson Distribution: E(X) = Lambda => E(Seeing at least one star) = 1 - E(Not seeing any stars) = 1 - 0.8 ^ 4 = 1 - 0.4096 = 0.5904

What are Entropy and Information gain in Decision tree algorithm?

Prefix: The splitting will be done until we have all pure nodes, or entropy = 0 for all leaf nodes. Entropy: Entropy is what we use to find the place to split that will give us the most information gain. We want to get to the leaf node quickly. We need to select the right parameters to do this. We need to maximize the purity split for each and every split. Formula: H(s) = -P(+) * log(P(+)) - P(-) * log(P(-)) 0 Entropy means it is a completely pure subset. Think -0/5 * -inf - 5/5 * log(0) 1 Entropy means it is a completely impure subset. Think -3/6 * log_2(0.5) - 3/6 * log_2(0.5) log_2(0.5) = -1 Information Gain: We use the formula below to calculate information gain. For a decision tree, we go through different combinations of splits, each time calculating the information gain. We then take the split with the highest information gain and use it to split the data. This is where the training actually happens. Formula: Gain(s,a) = H(s) - SUM((ABS(S_v) / ABS(S)) * H(S_v)) Example: 9y, 5n Split into 6y, 2n and 3y/3n Gain = .91 (H(s) for s) - 8/14 * 0.81 (H(S_v) for subset) - 6/14 * 1 (H(S_v) for subset) S_v (Subset after splitting) S (All samples)

What is pruning in Decision Tree?

Pruning a decision tree refers to the practice of removing sections of the tree that provide lower predictive power. This process is the opposite of splitting. We remove the splits that give us the lowest information gain.

What is reinforcement learning?

Reinforcement learning is a method in deep learning in which we teach networks to maximize the reward from a given environment. Going more in depth: Step 1: Create the environment. We need to determine what we will pass into the agent as the state, action, and reward. A simple example is pong. We can take the pixels as the state, the direction we move the paddle as the action, and whether we score, get scored on, or nothing happens the reward. Step 2: We select a loss function that is able to update the gradients based on the states, actions, and rewards. For example, we can use policy gradient's policy loss to update the weights of the network based on the three mentioned parameters above. Step 3: We let the environment run through multiple scenarios. Every time, or every couple times, we stop the environment from running, calculate the gradients from the history of the game, and update the weights of the agent, usually a deep neural network, based on the rewards. The theory is, over time, the agent will improve at the game until it either converges by getting the max score every time, or is unable to further improve.

Explain SVM algorithm in detail.

SVMs draw a hyperplane in between two categories that maximizes the distance between the points in each cluster/category. The space between the hyperplane and each category of data is called the margin. The lines that are on the ends of each margin are the support vectors. Another name for this hyperplane is called the maximum-margin hyperplane.

What is sampling? How many sampling methods do you know?

Sampling is the process of randomly (or not randomly) picking observations out of the general population. The purpose of sampling is to either observe or treat the small proportion of the population. After this, theoretically, given a large enough sample size, you will be able to apply your findings to the whole population and attain or observe similar results. BOOK DEFINITION: Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. Sampling methods (My knowledge): - Random sampling (Self explanatory) More from PDF - Stratified sampling: Groups with a common factor (i.e. race) and samples are randomly drawn from each of the groups. - Cluster sampling: Clusters are made (i.e. based on race) and then the analyst will sample only from the clusters of interest. - Multistage sampling: Similar to clustering. Once initial clusters are formed (i.e. on race), we then cluster once more on a second factor (i.e. gender) and analyze the clusters of interest. - Systematic sampling: A sample is created by setting an interval in which to extract data from (i.e. every 10th row). For 200 rows, you will then have 20 observations. - Convenience sampling: Data is collected at the convenience of participants. An easily accessible group. - Consecutive sampling: Data is collected from each subject that meets the required condition until the number of required samples is met. - Quota sampling: Researchers ensure equal representation of all subgroups within all sample groups.

What is Selection Bias? What is under coverage bias?

Selection bias is when a sample is unrepresentative of the full population. This can occur by non-random sampling from a population or a skewed sampling method. An example of this is when you ask only math students what they like to study. The sample will theoretically be heavily biased towards math! Under coverage bias occurs when a subset of a population is underrepresented in a sample and this results in biases during the final analysis.

What is Selection Bias?

Selection bias refers to bias introduced into a sample due to the sampling methodology. A clear example of this could be; you are trying to determine the proportion of interests in a subject but only ask math students about their opinions. Examples: - Sampling Bias: Choose sample that supports your claim - Time Bias: Choose time that supports your claim - Data Bias: Choose the data that supports your claim - Attrition: Loss of participants resulting is biased data

What is Supervised Learning?

Supervised learning is a certain type of machine learning. In supervised learning, we have data (rows), and labels associated with those rows. The rows having "labels" or a dependent column is what makes it supervised. The model can learn by "watching" how it does; adjusting its weights by predicting incorrectly or correctly.

What is Statistical Interaction?

Statistical interaction is when the effect of one factor (input variable) on the dependent variable depends on another factor. Temperature and pressure interact when trying to predict volume, they are not independent.

How Do You Work Towards a Random Forest?

Step 1: Sample N cases at random with replacement. Usually about 66% of the total set. Step 2: For some number m (m predictor variables), m predictor variables are selected from all predictor variables. The predictor that provides the highest information gain is selected to split. This is done all the way until m predictor variables are split on. Step 3: All trees "vote" on an answer and it is presented to the user. Some things to note: - The greater the inter-tree correlation, the greater the random forest error rate, so one pressure on the model is to have the trees as uncorrelated as possible. - As m goes down, both inter-tree correlation and the strength of individuals goes down. Some optimal value of m needs to be found using cross validation. Random Forest Pros: - Fast run times - Able to deal with missing data. Cons: - Not able to predict beyond the training data for regression tasks.

Explain R-CNN in detail.

Steps to R-CNN: 1. Generating region proposal: Part of the input image where there is a possibility of finding an image. The boxes where there is a higher than normal probability of finding an object. We use selective search for this. To perform selective search, we take an image that has been over segmented, and merge some of the segments based on a similarity score. His example is level 0 to 1 to 2. Output for selective search will be a data frame with region, the top left corner, height and width, and the level of the merged segments. Some regions will be level 0 some level 1 etc.. 2. Getting CNN features: First, we need to remove the last layer of a big image classifier like Alexnet, add a new layer that corresponds with the number of classes we have in our images, and train the new last layer. The ground truth region proposals have the top left x1, y1 and the bottom right x2, y2 so we know where they are. We then calculate the intersection over union for each region proposal. IOU can be between 0 and 1 and is pretty much just the percentage of pixels that overlap in the region proposal and the ground truth. If the IOU is > 0.5 it is positive for the class, else it is negative. The selective search algorithm usually gives a lot of region proposals. We save the IOU for training later models. Since the CNN takes in constant shaped images, we need to scale the region proposals as they are usually rectangular. 3. Classifying region proposals (RP) using SVMs: Once we have fine tunes Alexnet (or another nn), we remove the last layer, pass in all region proposals, and then now we have a dataset of features for each region proposal. Each feature vector has a size of 4096 (the size of the second to last Alexnet box). We will need to train an SVM for each class in our dataset. Each SVM will classify if the one object it was trained on is present in that region proposal. For the dog region proposal, we pass it through the cnn, get a feature vector with 4096 features, the output of the last hidden layer of the CNN. We then pass this feature vector into the SVM to give the picture a probability of having a dog in it. To train the SVM, we take all regions with an IOU of > 0.3 and label them as NON-NEGATIVE

What are the various algorithms?

Supervised (Classification and Regression): - Linear Regression - Logistic Regression (Linear with a Sigmoid Activation Function) - Random Forest - XGBoost - SVM - Deep Learning Models

What is Survivorship Bias?

Survivorship bias occurs when there are incorrect conclusions drawn from a sample where the failures are not being considered because they have been excluded or dropped out of the sample. A clear example of this is celebrities. Everyone can look at them and draw incorrect conclusions, thinking it is easy when it really is not. - Extra Def - Survivorship bias is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not work because of their lack of prominence.

What is Systematic Sampling?

Systematic Sampling is when researchers select samples from the population at a regular interval. I.e. every 15th person that walks into a coffee shop. This could be useful for the following reason, say you have a group of friends that are walking together. Collecting data from all of them could introduce bias into your sample because they might share some values/features that would sway your analysis one way or another.

What is TF/IDF vectorization?

TF/IDF is short for term frequency-inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The formula is here: TF = # "word" in doc / total # words in doc IDF = log(# docs with "word" in it / total docs in collection) TF = Proportion of "word" in a document / IDF = log(Proportion of documents it appears in)

Explain a transformer in detail.

Takes in numerical values. In the case of ChatGPT, it takes in encoded vectors. Use word embedding model for each word in the sentence. This gives us flexibility to handle input sentences with different lengths. Weights of embedding network are calculated using back propagation. Positional encoding used to keep track of word order. One method to do this is: - Convert words in sentence into word encodings. This is done using a combination of sine and cosine functions. We get as many variables from sine and cosine as the length of the embedding vector of each word. So say word 1 in the sentence has 4 numbers in the encoding vector, we would have a combination of sine and cosine functions to give the word a vector of length 4. This vector is the positional encoding. We then do this for every word in the sentence. The sine and cosine functions get wider for every extra embedding position. - We then add the position values to the embedding values. This results in the word embeddings plus positional embeddings for the whole sentence. NOTE: IF we switch the positions of words in the sentence, the positional encodings will change, showing they are in different spots in the sentence! Next is self attention!!!::: Self attention calculates the similarity between each word and all other words in the sentence, including itself. We then multiply the position + embedding values each by the self attention weights (we multiply each embedding individual value by an attention weight and cross add them, he is calling the cross multiplied values query values, for lets). SELF ATTENTION IS RELATIONSHIP AMONG THE WORDS. The cross multiplication I was talking about earlier is doing this. Let's has two encodings + positional encodings. To get the query value, you multiply the each encoding value by a set of weights. You do this for each encoding plus positional encoding. Every individual value in the encoding + positional encoding vector has its own weights. Every word will use the same weights for the same encoding + positional encoding individual number like index 1 in lets uses same weights as index 1 in go. After doing this, we have 2 new values to represent lets. So now its (encoding + positional encoding) * weights. Now we

What is 'Naïve' in a Naïve Bayes?

The "Naïve" in Naïve Bayes means that the model assumes all feature variables are independent. This is not true but useful in practice. In other words, it is the assumption of conditional independence between every pair of features given the value of the class variable. The Naïve Bayes classifier then uses a maximum a posteriori estimation to estimate P(y) and P(y | x_i). The former is then the relative frequency of class y in the training set. MAP is a method that involves calculating a conditional probability of observing the data given a model weighted by a prior probability of belief about the model. MAP argmax(p(theta | data)) is the opposite of MLE argmax(p(data | theta))

What is a Box-Cox Transformation?

The Box-Cox transformation is a method for transforming non-normal dependent variables into a normal shape so that the residuals are more normally distributed. Improve the predictive results from linear models like linear regression. Improve normality of the data Standardize the variance and make data more closely follow a normal distribution. This allows you to run a larger number of tests that require a normal distribution to be run.

Explain how a AUC ROC curve works?

The ROC curve works by multiplying the True positive rate, or sensitivity (TP/TP+FN) and the false positive rate, or 1-specificity (FP/FP+TN). We then take the integral of this between 0 and 1. This will give us the AUC ROC curve. What this curve tells us is the best we can do for any combination of true positive rate and false positive rate for our given model. Lets say we have a perfect classifier. We would expect the integral under the curve to be 1. This is because when we have a true positive rate of 1, the false positive rate will be 0. This will result in an integral from 0 to 1 where y is always 1.

What do you understand by statistical power of sensitivity and how do you calculate it?

The article is wrong here. It says this is recall (sensitivity) but no. This actually means the probability that the null hypothesis is actually false if we reject it. Which is recall now that I think about it hahaha. The formula is TP/(TP+FN). WOW

What is the Central Limit Theorem and why is it important?

The central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution, irrespective of the shape of the original distribution. Example: Imagine you are a quality control engineer at a factory that produces light bulbs. Each light bulb is supposed to have a lifespan of approximately 1,000 hours, but the lifespan varies due to minor manufacturing imperfections. The lifespan distribution of a single bulb is not normally distributed; it's skewed to the right, meaning there are more bulbs that last slightly less than 1,000 hours than there are bulbs that last significantly longer. Your goal is to ensure that the average lifespan of light bulbs produced in your factory stays as close to 1,000 hours as possible. To do this, you periodically take samples of n=30 bulbs from the production line and measure their lifespans. The Central Limit Theorem tells you that even if the individual lifespans are not normally distributed, the distribution of the sample means (averages) will approach a normal distribution as long as the sample size is sufficiently large (usually n≥30 is considered large enough for most practical purposes). 1. Sampling: Every day, you take a sample of 30 bulbs and calculate the average lifespan of that sample. 2. Normal Distribution: Thanks to CLT, you know that these sample averages will be approximately normally distributed around the actual average lifespan of all bulbs produced. 3. Control Limits: Knowing the distribution is normal, you can calculate control limits—values within which you expect the sample mean to fall a certain percentage of the time, say 95% or 99%. 4. Monitoring: If you find a sample mean that falls outside these control limits, it's a signal that something might be wrong in the manufacturing process, and you can take corrective action.

What Is the Cost Function?

The cost function tells us and the model how good or bad it is performing. The cost function is what tells the neural network how to update the weights and biases based on the results.

What is the Curse of Dimensionality?

The curse of dimensionality is when we have too many dimensions in the data. This results in very sparse observations which makes it hard for a machine learning model to learn on. It is hard to learn because it doesn't have enough examples of each scenario. In other words, as you increase the dimensions, you need more data.

What is the goal of A/B testing?

The goal of A/B testing is to determine if a is different from b and which one is better. We can do this by collecting data, running a hypothesis on the metrics that we are interested in, and see which process, method, design, etc. is better or if they make a difference at all. An example of this would be testing out if certain website designs lead to more sales. We would randomly give site visitors one version and run a hypothesis test to see if the average number of sales is different enough to be statistically significant.

What is the difference between machine learning and deep learning?

The main difference between machine learning and deep learning is due to the way data is presented to the system. Machine learning algorithms almost always require structured data, while deep learning networks rely on layers of artificial neural networks. Machine learning algorithms are designed to "learn" to act by understanding labeled data and then use it to produce new results with more datasets. However, when the result is incorrect, there is a need to "teach them." Because machine learning algorithms require bulleted data, they are not suitable for solving complex queries that involve a huge amount of data. Deep learning networks do not require human intervention, as multilevel layers in neural networks place data in a hierarchy of different concepts, which ultimately learn from their own mistakes. However, even they can be wrong if the data quality is not good enough. Data decides everything. It is the quality of the data that ultimately determines the quality of the results. Garbage in equals garbage out. Both of these subsets of AI are somehow connected to data, which makes it possible to represent a certain form of "intelligence." However, you should be aware that deep learning requires much more data than a traditional machine learning algorithm. The reason for this is that deep learning networks can identify different elements in neural network layers only when more tan a million data points interact. Machine learning algorithms, on the other hand, are capable of learning by pre-programmed criteria.

Explain Star Schema.

The star schema is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central table on an ID column or any other key pair.

Can you explain the difference between a Validation Set and a Test Set?

The validation set is used to choose a model, tune hyperparameters, and prevent overfitting with. This set is used to perform these tasks during training, usually involved in cross validation. A testing set is a set of data that the model doesn't see until the very end. This is used to test the model's generalizability and is meant to be the "unseen" data.

What Are the Types of Biases That Can Occur During Sampling?

There are a few different examples of sampling bias; - Selection bias: Selection bias occurs when you are incorrectly representing a population using your sample. This can arise from non-random forms of sampling. For example, asking only math students what their favorite subject is will not be representative of the population. - Under coverage bias: Under coverage bias occurs when you exclude a sub population from the sample. For example; if you are gathering data to analyze from a neighborhood that is predominately white, you are excluding all other races from that sample. And therefore, the other races are under-represented in a population. - Survivorship bias: Survivorship bias occurs when conclusions are drawn from a sample where the failures have dropped out / are no longer being considered. An example is a successful software engineer. People could draw incorrect conclusions based on him and not look at everyone who failed the C.S. path. They might think it's easier than it actually is, etc..

How can outlier values be treated?

There are a few different ways you can treat outliers. 1. You can simply remove them from your dataset. This could be a good thing. However, in the process of doing this. You will lose data. 2. Understand why they are outliers. By understanding why they are outliers. You can more accurately asses what you should do with them. If you decide that you want to keep them in your dataset, you could choose to use a model that is more robust to outliers such as a type of decision tree. 3. You can transform the data, resulting in outliers having less of an effect on the overall result. For example, standardization could be a method that controls outliers.

What, in your opinion, is the reason for the popularity of Deep Learning in recent times?

There are a few possible reason for the recent rise in popularity of deep learning. 1. ChatGPT: ChatGPT has brought a TON of attention to deep learning in the past year. It is able to almost perfectly mimic human speech and serves a lot of people as a personal assistant; helping with creative and logical tasks alike. 2. Data Volume: Deep learning models need a lot of data, and luckily, we have more data than ever. Devices and the internet constantly generate massive amounts of data. Deep learning excels at using this data to improve, so as data availability grows, the usefulness and effectiveness of deep learning will too. 3. Increase in hardware performance: With the growing boom in computing resource speed, we will be able to meet the steep hardware requirements to train these models.

What are the different kernels in SVM?

There are many different kernels that SVM supports so that it can accurately capture different types of separation boundaries. These include: - Linear kernel: straight hyperplane - Polynomial kernel: non-linear relationships - Radial Basis kernel: more complex relationships such as a circle within a circle - Sigmoid kernel: Specifically for two class classification.

What Are the Drawbacks of the Linear Model?

There are many drawbacks of a linear model. 1. They require many assumptions to be true to use them to their fullest extent. 2. They are "Linear" and can only capture linear relationships between variables. This greatly limits their flexibility as most relationships are non-linear.

What are the assumptions of Linear Regression?

There are multiple assumptions you need to verify to be able to use linear regression. - Independent features (Multicollinearity): This is important because without it, we will not be able to reliably distinguish the effect of the correlated features. (i.e. if x and z are highly correlated and being used to predict y, how will we know each effects' on y?) - Linearity of the data: Because of the structure of the model, we need to assume each of the features have a linear relationship with the dependent variable. If we do not have this, it will cause the model to be less reliable. - Homoscedasticity (Constant variance of the errors): If we don't have this, the model will be "pulled" more by the section of the data with higher variance, and thus, not be a reliable for the whole dataset. - Normality of errors (mean 0): We are looking for our errors (residuals) to be normally distributed with mean 0. IF this assumption is false, we can say two things. 1. The p values and t values for the model will be unreliable because they are calculated using the residuals. 2. That the confidence interval for the predictions will be less reliable. 3. Mean needs to be 0. If not, this could be an indication of bias in the model (under or overestimating y), missing an important features - Independent and Identically Distributed observations: Linear regression also requires the errors to be I.I.D. This assumption is to help indicate the model is the Best Linear Unbiased Estimator (BLUE). If it is not, this could mean we are missing some important variables and will not be able to fully trust the p-values. Also, the errors must be iid so we can assume the dependent variable is iid. This allows us to assume the y variable only depends on the x variables we have given the model and simplifies interpretation/generalization. - Omitted variable bias

How would you build a recommendation system?

There are multiple moving parts within a recommendation system. I will go through them part by part. Recommendation methods: At the core of any recommendation system is the recommendation model we use (collaborative, content-based, popularity-based, classification-based, demographic-based, etc.). System: We first go from the client, to a load balancer, to a distributed recommendation system node. Once at this node, we need a few things to happen. First, since the user is just initializing their session, we will need to quickly pull a few videos to feed the user. This can be done by retrieving user history from a user db, and finding videos with similar embeddings, and quickly feeding them to the user. This method is very quick and will be useful for initialization of the session. After the session has begun and the user has started scrolling through videos, we can employ a second, more advanced recommendation model. It could be for example a hybrid collaborative filtering and conten

During analysis, how do you treat missing values?

There are multiple ways in which we can treat outliers. Depending on the case, we will treat them differently. Option 1, Analysis: Before deciding what to do with the missing values, we can analyze them. If there is a pattern in which values are missing for example, we could turn that into a valuable business insight. Try to figure out why they are missing, if there is a pattern, and determine what to do with them based on the analysis. If most of them are missing, then it might make more sense to just drop the entire column. Option 2, Remove them: We can simply remove the rows with missing values. This will be a good option if we 1. have a low percentage of missing data relative to total data. 2. Can afford to collect more data. Option 3, Imputation: The second and more common method of filling missing values is imputation. Imputation can be done in many different ways. For example, mean imputation; this is where we use the mean of a column as a fill for all missing values in said column. This can be inaccurate as all rows are different and just taking the mean could result in inaccuracies. 2. Using a machine learning model to predict the missing values. While this can be more accurate, it will require more time and computing resources.

How Are Weights Initialized in a Network?

There are two ways, assigning them randomly (but very close to 0) and to all be 0. However, randomly is most common. There is a problem with initializing the weights to be 0. Since we usually use stochastic gradient descent, or some variant of it, we require that the model weights be randomly initialized from either a uniform or normal distribution. If they are not, each weight will be updated in the same way and thus, be redundant. This will result in a model that will be nor more useful than a linear model.

Describe the structure of Artificial Neural Networks?

There can be various structures, one simple structure is a set of input variables, one fully connected hidden state, and then outputs. More specifically, the input is connected to a hidden layer using a set of weights and biases, and then that hidden layer is also connected to the output using a set of weights and biases and an activation function.

Differentiate between univariate, bivariate and multivariate analysis?

These are just what they sound like: Univariate: This is when we are only analyzing one column/feature. For example, if we are wondering why students are getting into college and just look at their GPA. This would be an example of univariate. A histogram would be an example of a type of univariate analysis - checking a given distribution for a feature. Bivariate: This is when we analyze two variables. Continuing from the previous example. This would be looking at GPA and ACT/SAT score when analyzing why students are or are not getting into school. An example of a bivariate analysis is a scatter plot. Plotting the GPA on one axis and ACT/SAT on another. Multivariate: When we are using multiple variables in our analysis. An example of this would be creating a logistic regression model to predict if a student is going to get into a specific school. No matter how many features we plug into the model, we could see the effect of each on the admission rate given we have a sound model.

What are the support vectors in SVM?

They are the lines that lie on the edge of each margin and on the edge of each category.

If you are having 4GB RAM in your machine and you want to train your model on 10GB data set. How would you go about this problem? Have you ever faced this kind of problem in your machine learning/data science experience so far?

This is an important part of machine learning as in some cases, you won't have access to the required hardware to up just train your model. You will have to use alternate techniques to train on all data. Method 1: Lazy loading. This method is used when you have a large amount of data that can not be stored in memory. This method only loads in data as it needs it, avoiding storing the entire dataset in memory.

What Will Happen If the Learning Rate Is Set inaccurately (Too Low or Too High)?

Too low: If the learning rate is set too low, the model will take an unnecessary amount of time to train. However, the model will still learn. Another situation the model might find itself in is being stuck at a local minimum. Having low learning rates will make it harder for a model to climb out of a local minimum because it doesn't update the weights enough to get the model out of that crevice in space (if the loss function is non-convex). Too high: If the learning rate is too high, the gradient updates will have too high of an impact on the model weights. This will result in catastrophic forgetting. This is where the model's established weights and paths will be adjusted too much and cause the model to "forget" what it has learned. It will also make it harder for the model to converge. If we are updating the gradients with larger numbers, there will be a lot of oscillations in the model. The model will constantly over and undershoot the optimal values.

What is the difference between type I vs type II error?

Type I (False Positive): This is when we think there is going to be a True value but then there ends up being a false value that occurs. Type II (False Negative): This is when we think there is going to be a False value but there ends up being a True value that occurs.

What's the difference between unidirectional LSTM and bidirectional LSTM

Unidirectional: Only a forward pass to update weights. This only enables the model to learn from past data. Bidirectional: Both forward and backward pass to update model weights. This allows the model to learn from past data as well as future data. Needs a buffer if it's going to be used for predictions.

What is Unsupervised learning?

Unsupervised learning is modeling data without using a label. Some examples of unsupervised learning algorithms are Clustering, Dimensionality Reduction, and Density Estimation. Why are these useful? It may not be immediately obvious why they are useful. Clustering, for example, is useful to group samples into different clusters. These clusters will have different sample statistics. By looking at different groups, you can learn what makes them similar. For example, lets say you are a business and you just clusters your customers. One of the clusters has a higher average spending. Now you know some other statistics on that group and can do further research.

Why we generally use Soft-max (or sigmoid) non-linearity function as last operation in-network? Why RELU in an inner layer?

We generally use the soft-max function as the last function in a given neural network for example because we want a probability distribution. The soft-max function takes in a vector of x values from the hidden layer and outputs a probability distribution. We use RELU on inner layers to avoid the vanishing gradient problem. Because the derivative of the RELU function can either be 0 or 1, the nodes with a greater than 0 x value will not vanish at all because when back propagating, you will multiply by 1 each time, not a small decimal number that will cause the gradient to quickly approach 0. Another benefit of the RELU activation function is that it leads to sparsity. This means that many of the nodes within a neural network will be "dead" nodes. This means we can just ignore those and more quickly compute the output. It also means we get a more simple model because we can just ignore all nodes with a value less than 0.

Explain Decision Tree algorithm in detail.

When you fit a decision tree, it will recursively find a feature to make a split on that produced the highest reduction in SSE. It will keep doing this until usually it hits the maximum depth we have set for the tree. We need to specify max depth or else the tree will massively overfit to the training data. To find the highest SSE reduction split, we use entropy and information gain.

Why is Re-sampling done?

Why resampling depends on the use case. Bootstrapping: Involves resampling from a distribution n times with replacement to estimate the sample distribution of any desired statistic. Cross Validation: Involved splitting a dataset into k clunks and iterating over each chunk, holding it out as the test set. This helps out models become more generalizable. Up-sampling (Balancing classes): Allows us to balance data by randomly sampling data from the minority distribution.

How can you generate a random number between 1 - 7 with only a die?

You can roll the die twice. If you roll the die twice, you have 36 different possibilities. If you subtract one from this (say the roll 6, 6) you will have 35 different possibilities. This can be equally divided into 7 different sections of 5. Therefore, each number 1 through 7 will have a 5/35 or 1/7 chance of being rolled.


Ensembles d'études connexes

Sociology Final pt.1, Sociology Final pt.2, Sociology Final pt.3, Sociology Final pt.4

View Set

American History, Becoming a World Power, 1872-1917, Chapter 5, Lesson 3

View Set

4.1 Behavior Genetics: Predicting Individual Differences

View Set

History of Rock & Roll Final Study 2

View Set