Data Science Review

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Deep Learning

Deep Learning is a kind of Machine Learning, in which neural networks are used to imitate the structure of the human brain, and just like how a brain learns from information, machines are also made to learn from the information that is provided to them. Deep Learning is an advanced version of neural networks to make the machines learn from data. In Deep Learning, the neural networks comprise many hidden layers (which is why it is called 'deep' learning) that are connected to each other, and the output of the previous layer is the input of the current layer.

random forest

It combines multiple models together to get the final output or, to be more precise, it combines multiple decision trees together to get the final output. So, decision trees are the building blocks of the random forest model.

law of large numbers

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

undercoverage Bias

when part of the population has a reduced chance of being included in a sample

You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true?

{grape, apple} must be a frequent itemset

boosting

Boosting is one of the ensemble learning methods. Unlike bagging, it is not a technique used to parallelly train our models. In boosting, we create multiple models and sequentially train them by combining weak models iteratively in a way that training a new model depends on the models trained before it. In doing so, we take the patterns learned by a previous model and test them on a dataset when training the new model. In each iteration, we give more importance to observations in the dataset that are incorrectly handled or predicted by previous models. Boosting is useful in reducing bias in models as well.

Data Analytics vs Data Science

Data Analytics is a subset of Data Science. The goal of data analytics is to illustrate the precise details of retrieved insights. It focuses on just finding the solutions. A data analyst's job is to analyze data in order to make decisions. Data Science is a broad technology that includes various subsets such as Data Analytics, Data Mining, Data Visualization, etc. The goal of data science is to discover meaningful insights from massive datasets and derive the best possible solutions to resolve issues. Data Science not only focuses on finding the solutions but also predicts the future with past patterns or insights. A data scientist's job is to provide insightful data visualizations from raw data that are easily understandable.

stacking

Just like bagging and boosting, stacking is also an ensemble learning method. In bagging and boosting, we could only combine weak models that used the same learning algorithms, e.g., logistic regression. These models are called homogeneous learners. However, in stacking, we can combine weak models that use different learning algorithms as well. These learners are called heterogeneous learners. Stacking works by training multiple (and different) weak models or learners and then using them together by training another model, called a meta-model, to make predictions based on the multiple outputs of predictions returned by these multiple weak models.

precision

Precision: When we are implementing algorithms for the classification of data or the retrieval of information, precision helps us get a portion of positive class values that are positively predicted. Basically, it measures the accuracy of correct positive predictions. Below is the formula to calculate precision: true positives / (true positives + false positives)

long format data vs wide format data

Long Format Data A long format data has a column for possible variable types and a column for the values of those variables. Each row in the long format represents one time point per subject. As a result, each topic will contain many rows of data. This data format is most typically used in R analysis and for writing to log files at the end of each experiment. Wide Format Data Wide data has a column for each variable. The repeated responses of a subject will be in a single row, with each response in its own column, in the wide format. This data format is most widely used in data manipulations, stats programs for repeated measures ANOVAs and is seldom used in R analysis.

p-value

P-value is the measure of the statistical importance of an observation. It is the probability that shows the significance of output to the data. We compute the p-value to know the test statistics of a model. Typically, it helps us choose whether we can accept or reject the null hypothesis.

outlier

You can drop outliers only if it is a garbage value. Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed. If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point. If you cannot drop outliers, you can try the following: Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model. Try normalizing the data. This way, the extreme data points are pulled to a similar range. You can use algorithms that are less affected by outliers; an example would be random forests.

select k for k-means

We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where 'k' is the number of clusters. Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

p-value

We use the p-value to understand whether the given data really describes the observed effect or not. We use the below formula to calculate the p-value for the effect 'E' and the null hypothesis 'H0' is true: Pvalue = P(E | H0)

collaborative filtering

Collaborative filtering is a technique used to build recommender systems. In this technique, to generate recommendations, we make use of data about the likes and dislikes of users similar to other users. This similarity is estimated based on several varying factors, such as age, gender, locality, etc. If User A, similar to User B, watched and liked a movie, then that movie will be recommended to User B, and similarly, if User B watched and liked a movie, then that would be recommended to User A. In other words, the content of the movie does not matter much. When recommending it to a user what matters is if other users similar to that particular user liked the content of the movie or not.

eigenvalue vs eigenvector?

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching. Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

Convenience Sampling

It is a type of sampling that doesn't depend on chance and is often used in research studies. This sampling technique involves choosing people who are easy for the researcher to reach and get in touch with. Instead of picking people at random from a certain population, convenience sampling involves picking the people who are easiest for the researcher to get information from. Convenience sampling is often used when other types of sampling methods are hard or impossible to use because of time, cost, or other issues. Even though it can be a quick and easy way to get data, it can also have biases and limitations that can affect how well the results can be used in the real world and how reliable they are. - Non-Probability Sampling

ROC curve

It stands for Receiver Operating Characteristic. It is basically a plot between a true positive rate and a false positive rate, and it helps us to find out the right tradeoff between the true positive rate and the false positive rate for different probability thresholds of the predicted values. So, the closer the curve to the upper left corner, the better the model is. In other words, whichever curve has greater area under it that would be the better model. You can see this in the below graph:

We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case?

Logistic Regression

Purposive Sampling

Purposive sampling refers to a group of non-probability sampling techniques in which units are selected because they have characteristics that you need in your sample. In other words, units are selected "on purpose" in purposive sampling. Also called judgmental sampling, this sampling method relies on the researcher's judgment when identifying and selecting the individuals, cases, or events that can provide the best information to achieve the study's objectives. Purposive sampling is common in qualitative research and mixed methods research. It is particularly useful if you need to find information-rich cases or make the most out of limited resources, but is at high risk for research biases like observer bias.Purposive sampling is best used when you want to focus in depth on relatively small samples. Perhaps you would like to access a particular subset of the population that shares certain characteristics, or you are researching issues likely to have unique cases. The main goal of purposive sampling is to identify the cases, individuals, or communities best suited to helping you answer your research question. For this reason, purposive sampling works best when you have a lot of background information about your research topic. The more information you have, the higher the quality of your sample.

Quota Sampling

Quota sampling is a non-probability sampling method that relies on the non-random selection of a predetermined number or proportion of units. This is called a quota. You first divide the population into mutually exclusive subgroups (called strata) and then recruit sample units until you reach your quota. These units share specific characteristics, determined by you prior to forming your strata. The aim of quota sampling is to control what or who makes up your sample. Your design may: Replicate the true composition of the population of interest, Include equal numbers of different types of respondents, Over-sample a particular type of respondent, even if population proportions differ

Snowball Sampling (Chain-Referral Sampling)

Snowball sampling or chain-referral sampling is defined as a non-probability sampling technique in which the samples have rare traits. This is a sampling technique, in which existing subjects provide referrals to recruit samples required for a research study. For example, if you are studying the level of customer satisfaction among the members of an elite country club, you will find it extremely difficult to collect primary data sources unless a member of the club agrees to have a direct conversation with you and provides the contact details of the other members of the club. This sampling method involves a primary data source nominating other potential data sources that will be able to participate in the research studies. Snowball sampling method is purely based on referrals and that is how a researcher is able to generate a sample. Therefore this method is also called the chain-referral sampling method. Snowball sampling is a popular business study method. The snowball sampling method is extensively used where a population is unknown and rare and it is tough to choose subjects to assemble them as samples for research. This sampling technique can go on and on, just like a snowball increasing in size (in this case the sample size) till the time a researcher has enough data to analyze, to draw conclusive results that can help an organization make informed decisions.

Specificity

Specificity is the metric that evaluates a model's ability to predict true negatives of each available category. These metrics apply to any categorical model. TN / (TN + FP)

drawbacks of the linear model

The assumption of linearity of the errors It can't be used for count outcomes or binary outcomes There are overfitting problems that it can't solve

information gain

When building a decision tree, at each step, we have to create a node that decides which feature we should use to split data, i.e., which feature would best separate our data so that we can make predictions. This decision is made using information gain, which is a measure of how much entropy is reduced when a particular feature is used to split the data. The feature that gives the highest information gain is the one that is chosen to split the data.

confusion matrix

The confusion matrix is a table that is used to estimate the performance of a model. It tabulates the actual values and the predicted values in a 2×2 matrix.

How to deal with missing values

The following are ways to handle missing data values: If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values. For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas' data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).

A/B testing

A/B testing is a kind of statistical hypothesis testing for randomized experiments with two variables. These variables are represented as A and B. A/B testing is used when we wish to test a new feature in a product. In the A/B test, we give users two variants of the product, and we label these variants as A and B. The A variant can be the product with the new feature added, and the B variant can be the product without the new feature. After users use these two products, we capture their ratings for the product. If the rating of product variant A is statistically and significantly higher, then the new feature is considered an improvement and useful and is accepted. Otherwise, the new feature is removed from the product.

collaborative filtering vs content-based filtering

Content-based filtering is considered to be better than collaborative filtering for generating recommendations. It does not mean that collaborative filtering generates bad recommendations. However, as collaborative filtering is based on the likes and dislikes of other users we cannot rely on it much. Also, users' likes and dislikes may change in the future. For example, there may be a movie that a user likes right now but did not like 10 years ago. Moreover, users who are similar in some features may not have the same taste in the kind of content that the platform provides. In the case of content-based filtering, we make use of users' own likes and dislikes that are much more reliable and yield more positive results. This is why platforms such as Netflix, Amazon Prime, Spotify, etc. make use of content-based filtering for generating recommendations for their users.

normal distribution

Data distribution is a visualization tool to analyze how data is spread out or distributed. Data can be distributed in various ways. For instance, it could be with a bias to the left or the right, or it could all be jumbled up. Data may also be distributed around a central value, i.e., mean, median, etc. This kind of distribution has no bias either to the left or to the right and is in the form of a bell-shaped curve. This distribution also has its mean equal to the median. This kind of distribution is called a normal distribution.

dimensionality reduction

Dimensionality reduction is the process of converting a dataset with a high number of dimensions (fields) to a dataset with a lower number of dimensions. This is done by dropping some fields or columns from the dataset. However, this is not done haphazardly. In this process, the dimensions or fields are dropped only after making sure that the remaining information will still be enough to succinctly describe similar information.

dimensionality reduction

Dimensionality reduction reduces the dimensions and size of the entire dataset. It drops unnecessary features while retaining the overall information in the data intact. Reduction in dimensions leads to faster processing of the data. The reason why data with high dimensions is considered so difficult to deal with is that it leads to high time consumption while processing the data and training a model on it. Reducing dimensions speeds up this process, removes noise, and also leads to better model accuracy.

F1 score

F1 score helps us calculate the harmonic mean of precision and recall that gives us the test's accuracy. If F1 = 1, then precision and recall are accurate. If F1 < 1 or equal to 0, then precision or recall is less accurate, or they are completely inaccurate. See below for the formula to calculate the F1 score: 2 x [(Precision x Recall) / (Precision + Recall)]

entropy

In a decision tree algorithm, entropy is the measure of impurity or randomness. The entropy of a given dataset tells us how pure or impure the values of the dataset are. In simple terms, it tells us about the variance in the dataset.For example, suppose we are given a box with 10 blue marbles. Then, the entropy of the box is 0 as it contains marbles of the same color, i.e., there is no impurity. If we need to draw a marble from the box, the probability of it being blue will be 1.0. However, if we replace 4 of the blue marbles with 4 red marbles in the box, then the entropy increases to 0.4 for drawing blue marbles.

Cluster Sampling

In cluster sampling, researchers divide a population into smaller groups known as clusters. They then randomly select among these clusters to form a sample. Cluster sampling is a method of probability sampling that is often used to study large populations, particularly those that are widely geographically dispersed. Researchers usually use pre-existing units such as schools or cities as their clusters. - Probability Sampling

kernel function

In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable (cannot be separated using a straight line) into one that is linearly separable.

Stationary time-series

In the first graph, the variance is constant with time. Here, X is the time factor and Y is the variable. The value of Y goes through the same points all the time; in other words, it is stationary. In the second graph, the waves get bigger, which means it is non-stationary and the variance is changing with time.

Naive Bayes

Naive Bayes is a Data Science algorithm. It has the word 'Bayes' in it because it is based on the Bayes theorem, which deals with the probability of an event occurring given that another event has already occurred. It has 'naive' in it because it makes the assumption that each variable in the dataset is independent of the other. This kind of assumption is unrealistic for real-world data. However, even with this assumption, it is very useful for solving a range of complicated problems, e.g., spam email classification, etc

RMSE

RMSE stands for the root mean square error. It is a measure of accuracy in regression. RMSE allows us to calculate the magnitude of error produced by a regression model. The way RMSE is calculated is as follows: First, we calculate the errors in the predictions made by the regression model. For this, we calculate the differences between the actual and the predicted values. Then, we square the errors. After this step, we calculate the mean of the squared errors, and finally, we take the square root of the mean of these squared errors. This number is the RMSE, and a model with a lower value of RMSE is considered to produce lower errors, i.e., the model will be more accurate.

reinforcement learning

Reinforcement learning is a kind of Machine Learning, which is concerned with building software agents that perform actions to attain the most number of cumulative rewards. A reward here is used for letting the model know (during training) if a particular action leads to the attainment of or brings it closer to the goal. For example, if we are creating an ML model that plays a video game, the reward is going to be either the points collected during the play or the level reached in it. Reinforcement learning is used to build these kinds of agents that can make real-world decisions that should move the model toward the attainment of a clearly defined goal.

resampling

Resampling is done in any of these cases: Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points Substituting labels on data points when performing significance tests Validating models by using random subsets (bootstrapping, cross-validation)

root cause analysis

Root cause analysis is the process of figuring out the root causes that lead to certain faults or failures. A factor is considered to be a root cause if, after eliminating it, a sequence of operations, leading to a fault, error, or undesirable result, ends up working correctly. Root cause analysis is a technique that was initially developed and used in the analysis of industrial accidents, but now, it is used in a wide variety of areas.

root cause analysis

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

sampling

Sampling is defined as the process of selecting a sample from a group of people or from any particular kind for research purposes. It is one of the most important factors which decides the accuracy of a research/survey result. Mainly, there are two types of sampling techniques: Probability sampling: It involves random selection which makes every element get a chance to be selected. Probability sampling has various subtypes in it, as mentioned below: Simple random sampling: A simple random sample is a subset of a statistical population in which each member of the subset has an equal probability of being chosen. A simple random sample is meant to be an unbiased representation of a group. Stratified sampling: Stratified random sampling is a method of sampling that involves the division of a population into smaller subgroups known as strata. In stratified random sampling, or stratification, the strata are formed based on members' shared attributes or characteristics, such as income or educational attainment. Stratified random sampling has numerous applications and benefits, such as studying population demographics and life expectancy. Stratified random sampling is also called proportional random sampling or quota random sampling. Systematic sampling: Systematic sampling is a type of probability sampling method in which sample members from a larger population are selected according to a random starting point but with a fixed, periodic interval. This interval, called the sampling interval, is calculated by dividing the population size by the desired sample size.1 Despite the sample population being selected in advance, systematic sampling is still thought of as being random if the periodic interval is determined beforehand and the starting point is random. When carried out correctly on a large population of a defined size, systematic sampling can help researchers, including marketing and sales professionals, obtain representative findings on a huge group of people without having to reach out to each and every one of them. Cluster Sampling: In cluster sampling, researchers divide a population into smaller groups known as clusters. They then randomly select among these clusters to form a sample. Cluster sampling is a method of probability sampling that is often used to study large populations, particularly those that are widely geographically dispersed. Researchers usually use pre-existing units such as schools or cities as their clusters. Multi-stage Sampling: In multistage sampling, or multistage cluster sampling, you draw a sample from a population using smaller and smaller groups (units) at each stage. It's often used to collect data from a large, geographically spread group of people in national surveys. In multistage sampling, you divide the population into smaller and smaller groupings to create a sample using several steps. You can take advantage of hierarchical groupings (e.g., from state to city to neighborhood) to create a sample that's less expensive and time-consuming to collect data from. Non- Probability Sampling: Non-probability sampling follows non-random selection which means the selection is done based on your ease or any other required criteria. This helps to collect the data easily. The following are various types of sampling in it: Convenience sampling: It is a type of sampling that doesn't depend on chance and is often used in research studies. This sampling technique involves choosing people who are easy for the researcher to reach and get in touch with. Instead of picking people at random from a certain population, convenience sampling involves picking the people who are easiest for the researcher to get information from. Convenience sampling is often used when other types of sampling methods are hard or impossible to use because of time, cost, or other issues. Even though it can be a quick and easy way to get data, it can also have biases and limitations that can affect how well the results can be used in the real world and how reliable they are. Purposive sampling: Purposive sampling refers to a group of non-probability sampling techniques in which units are selected because they have characteristics that you need in your sample. In other words, units are selected "on purpose" in purposive sampling. Also called judgmental sampling, this sampling method relies on the researcher's judgment when identifying and selecting the individuals, cases, or events that can provide the best information to achieve the study's objectives. Purposive sampling is common in qualitative research and mixed methods research. It is particularly useful if you need to find information-rich cases or make the most out of limited resources, but is at high risk for research biases like observer bias.Purposive sampling is best used when you want to focus in depth on relatively small samples. Perhaps you would like to access a particular subset of the population that shares certain characteristics, or you are researching issues likely to have unique cases. The main goal of purposive sampling is to identify the cases, individuals, or communities best suited to helping you answer your research question. For this reason, purposive sampling works best when you have a lot of background information about your research topic. The more information you have, the higher the quality of your sample. Quota sampling: Quota sampling is a non-probability sampling method that relies on the non-random selection of a predetermined number or proportion of units. This is called a quota. You first divide the population into mutually exclusive subgroups (called strata) and then recruit sample units until you reach your quota. These units share specific characteristics, determined by you prior to forming your strata. The aim of quota sampling is to control what or who makes up your sample. Your design may: Replicate the true composition of the population of interest, Include equal numbers of different types of respondents, Over-sample a particular type of respondent, even if population proportions differ Referral /Snowball Sampling: Snowball Sampling: Definition, Method, Pros & Cons Snowball Sampling: Snowball sampling or chain-referral sampling is defined as a non-probability sampling technique in which the samples have rare traits. This is a sampling technique, in which existing subjects provide referrals to recruit samples required for a research study. For example, if you are studying the level of customer satisfaction among the members of an elite country club, you will find it extremely difficult to collect primary data sources unless a member of the club agrees to have a direct conversation with you and provides the contact details of the other members of the club. This sampling method involves a primary data source nominating other potential data sources that will be able to participate in the research studies. Snowball sampling method is purely based on referrals and that is how a researcher is able to generate a sample. Therefore this method is also called the chain-referral sampling method. Snowball sampling is a popular business study method. The snowball sampling method is extensively used where a population is unknown and rare and it is tough to choose subjects to assemble them as samples for research. This sampling technique can go on and on, just like a snowball increasing in size (in this case the sample size) till the time a researcher has enough data to analyze, to draw conclusive results that can help an organization make informed decisions.

How regularly must an algorithm be updated?

You will want to update an algorithm when: You want the model to evolve as data streams through infrastructure The underlying data source is changing There is a case of non-stationarity

feature vectors

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that's easy to analyze.

You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 percent. Why shouldn't you be happy with your model performance? What can you do about it?

Cancer detection results in imbalanced data. In an imbalanced dataset, accuracy should not be based as a measure of performance. It is important to focus on the remaining four percent, which represents the patients who were wrongly diagnosed. Early diagnosis is crucial when it comes to cancer detection, and can greatly improve a patient's prognosis. Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class wise performance of the classifier.

Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?

K-means clustering Linear regression K-NN (k-nearest neighbor) Decision trees The K nearest neighbor algorithm can be used because it can compute the nearest neighbor and if it doesn't have a value, it just computes the nearest neighbor based on all the other features. When you're dealing with K-means clustering or linear regression, you need to do that in your pre-processing, otherwise, they'll crash. Decision trees also have the same problem, although there is some variance.

Markov chains

Markov Chains defines that a state's future probability depends only on its current state. Markov chains belong to the Stochastic process type category. The below diagram explains a step-by-step model of the Markov Chains whose output depends on their current state. A perfect example of the Markov Chains is the system of word recommendation. In this system, the model recognizes and recommends the next word based on the immediately previous word and not anything before that. The Markov Chains take the previous paragraphs that were similar to training data-sets and generates the recommendations for the current paragraphs accordingly based on the previous word.

Sensitivity vs. Specificity

One key difference is that sensitivity is more affected by the prevalence of the positive class, while specificity is more affected by the prevalence of the negative class. This means that sensitivity is more likely to be affected by imbalanced data sets, while specificity is less likely to be affected. Another difference is that sensitivity and specificity are inversely related: as sensitivity increases, specificity decreases, and vice versa. This means that it is not possible to optimize both measures simultaneously. When choosing a machine learning model, it is important to consider both sensitivity and specificity in order to select the best model for the task at hand. In general, sensitivity is more important than specificity when the objective is to maximize the number of positive examples that are correctly classified. However, specificity is more important than sensitivity when the objective is to minimize the number of negative examples that are incorrectly classified.

error vs residual error?

An error occurs in values while the prediction gives us the difference between the observed values and the true values of a dataset. Whereas, the residual error is the difference between the observed values and the predicted values. The reason we use the residual error to evaluate the performance of an algorithm is that the true values are never known. Hence, we use the observed values to measure the error using residuals. It helps us get an accurate estimate of the error.

Your organization has a website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decisions. Which analysis method should you use?

One-way ANOVA

decision tree

A decision tree is a supervised learning algorithm that is used for both classification and regression. Hence, in this case, the dependent variable can be both a numerical value and a categorical value. each node denotes the test on an attribute, and each edge denotes the outcome of that attribute, and each leaf node holds the class label. So, in this case, we have a series of test conditions which give the final decision according to the condition.

error vs residual error

Error The difference between the actual value and the predicted value is called an error. Some of the popular means of calculating data science errors are - Root Mean Squared Error (RMSE) Mean Absolute Error (MAE) Mean Squared Error (MSE) An error is generally unobservable. An error is how actual population data and observed data differ from each other. Residual Error The difference between the arithmetic mean of a group of values and the observed group of values is called a residual error. A residual error is used to show how the sample population data and the observed data differ from each other. A residual error can be represented using a graph

Data science vs programming

In traditional programming paradigms, we used to analyze the input, figure out the expected output, and write code, which contains rules and statements needed to transform the provided input into the expected output. As we can imagine, these rules were not easy to write, especially, for data that even computers had a hard time understanding, e.g., images, videos, etc. Data Science shifts this process a little bit. In it, we need access to large volumes of data that contain the necessary inputs and their mappings to the expected outputs. Then, we use Data Science algorithms, which use mathematical analysis to generate rules to map the given inputs to outputs. This process of rule generation is called training. After training, we use some data that was set aside before the training phase to test and check the system's accuracy. The generated rules are a kind of a black box, and we cannot understand how the inputs are being transformed into outputs. However, If the accuracy is good enough, then we can use the system (also called a model). As described above, in traditional programming, we had to write the rules to map the input to the output, but in Data Science, the rules are automatically generated or learned from the given data. This helped solve some really difficult challenges that were being faced by several companies.

logistic regression done

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

NLP

NLP is short for Natural Language Processing. It deals with the study of how computers learn a massive amount of textual data through programming. A few popular examples of NLP are Stemming, Sentimental Analysis, Tokenization, removal of stop words, etc.

Do gradient descent methods always converge

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

bias-variance trade-off

When building a model using Data Science or Machine Learning, our goal is to build one that has low bias and variance. We know that bias and variance are both errors that occur due to either an overly simplistic model or an overly complicated model. Therefore, when we are building a model, the goal of getting high accuracy is only going to be accomplished if we are aware of the tradeoff between bias and variance.Bias is an error that occurs when a model is too simple to capture the patterns in a dataset. To reduce bias, we need to make our model more complex. Although making the model more complex can lead to reducing bias, and if we make the model too complex, it may end up becoming too rigid, leading to high variance. So, the tradeoff between bias and variance is that if we increase the complexity, the bias reduces and the variance increases, and if we reduce complexity, the bias increases and the variance reduces. Our goal is to find a point at which our model is complex enough to give low bias but not so complex to end up having high variance.

steps in making a decision tree

1. Take the entire data set as input 2. Calculate entropy of the target variable, as well as the predictor attributes 3. Calculate your information gain of all attributes (we gain information on sorting different objects from each other) 4. Choose the attribute with the highest information gain as the root node 5. Repeat the same procedure on every branch until the decision node of each branch is finalized It is clear from the decision tree that an offer is accepted if: Salary is greater than $50,000 The commute is less than an hour Incentives are offered

Machine Learning vs Deep Learning

A field of computer science, Machine Learning is a subfield of Data Science that deals with using existing data to help systems automatically learn new skills to perform different tasks without having rules to be explicitly programmed. Deep Learning, on the other hand, is a field in Machine Learning that deals with building Machine Learning models using algorithms that try to imitate the process of how the human brain learns from the information in a system for it to attain new capabilities. In Deep Learning, we make heavy use of deeply connected neural networks with many layers.

random forest

A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together. Steps to build a random forest model: Randomly select 'k' features from a total of 'm' features where k << m Among the 'k' features, calculate the node D using the best split point Split the node into daughter nodes using the best split Repeat steps two and three until leaf nodes are finalized Build forest by repeating steps one to four for 'n' times to create 'n' number of trees

recommender system

A recommender system is a system that many consumer-facing, content-driven, online platforms employ to generate recommendations for users from a library of available content. These systems generate recommendations based on what they know about the users' tastes from their activities on the platform. For example, imagine that we have a movie streaming platform, similar to Netflix or Amazon Prime. If a user has previously watched and liked movies from action and horror genres, then it means that the user likes watching the movies of these genres. In that case, it would be better to recommend such movies to this particular user. These recommendations can also be generated based on what users with a similar taste like watching.

RNN (recurrent neural network)

A recurrent neural network, or RNN for short, is a kind of Machine Learning algorithm that makes use of the artificial neural network. RNNs are used to find patterns from a sequence of data, such as time series, stock market, temperature, etc. RNNs are a kind of feedforward network, in which information from one layer passes to another layer, and each node in the network performs mathematical operations on the data. These operations are temporal, i.e., RNNs store contextual information about previous computations in the network. It is called recurrent because it performs the same operations on some data every time it is passed. However, the output may be different based on past computations and their results.

Simple Random Sampling

A simple random sample is a subset of a statistical population in which each member of the subset has an equal probability of being chosen. A simple random sample is meant to be an unbiased representation of a group. - Probability Sampling

After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study?

As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering (answer A) is the most appropriate algorithm for this study.

bagging

Bagging is an ensemble learning method. It stands for bootstrap aggregating. In this technique, we generate some data using the bootstrap method, in which we use an already existing dataset and generate multiple samples of the N size. This bootstrapped data is then used to train multiple models in parallel, which makes the bagging model more robust than a simple model. Once all the models are trained, when it's time to make a prediction, we make predictions using all the trained models and then average the result in the case of regression, and for classification, we choose the result, generated by models, that have the highest frequency.

bias

Bias is a type of error that occurs in a Data Science model because of using an algorithm that is not strong enough to capture the underlying patterns or trends that exist in the data. In other words, this error occurs when the data is too complicated for the algorithm to understand, so it ends up building a model that makes simple assumptions. This leads to lower accuracy because of underfitting. Algorithms that can lead to high bias are linear regression, logistic regression, etc

bias-variance trade-off

Bias: Due to an oversimplification of a Machine Learning Algorithm, an error occurs in our model, which is known as Bias. This can lead to an issue of underfitting and might lead to oversimplified assumptions at the model training time to make target functions easier and simpler to understand. Some of the popular machine learning algorithms which are low on the bias scale are - Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees. Algorithms that are high on the bias scale - Logistic Regression and Linear Regression. Variance: Because of a complex machine learning algorithm, a model performs really badly on a test data set as the model learns even noise from the training data set. This error that occurs in the Machine Learning model is called Variance and can generate overfitting and hyper-sensitivity in Machine Learning models. While trying to get over bias in our model, we try to increase the complexity of the machine learning algorithm. Though it helps in reducing the bias, after a certain point, it generates an overfitting effect on the model hence resulting in hyper-sensitivity and high variance. Bias-Variance trade-off: To achieve the best performance, the main target of a supervised machine learning algorithm is to have low variance and bias. The following things are observed regarding some of the popular machine learning algorithms - The Support Vector Machine algorithm (SVM) has high variance and low bias. In order to change the trade-off, we can increase the parameter C. The C parameter results in a decrease in the variance and an increase in bias by influencing the margin violations allowed in training datasets. In contrast to the SVM, the K-Nearest Neighbors (KNN) Machine Learning algorithm has a high variance and low bias. To change the trade-off of this algorithm, we can increase the prediction influencing neighbors by increasing the K value, thus increasing the model bias.

Point Estimates vs Confidence Interval

Confidence Interval: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient (or Confidence level) is denoted by 1-alpha, which gives the probability or likeness. The level of significance is given by alpha. Point Estimates: An estimate of the population parameter is given by a particular value called the point estimate. Some popular methods used to derive Population Parameters' Point estimators are - Maximum Likelihood estimator and the Method of Moments. To conclude, the bias and variance are inversely proportional to each other, i.e., an increase in bias results in a decrease in the variance, and an increase in variance results in a decrease in bias.

content-based filtering

Content-based filtering is one of the techniques used to build recommender systems. In this technique, recommendations are generated by making use of the properties of the content that a user is interested in. For example, if a user is watching movies belonging to the action and mystery genre and giving them good ratings, it is a clear indication that the user likes movies of this kind. If shown movies of a similar genre as recommendations, there is a higher probability that the user would like those recommendations as well. In other words, here, the content of the movie is taken into consideration when generating recommendations for users.

cross-validation

Cross-validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice. The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

Data modeling vs Database design

Data Modeling: It can be considered as the first step towards the design of a database. Data modeling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying data modeling techniques. Database Design: This is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking, database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.

Data Science vs Machine Learning

Data Science and Machine Learning are two terms that are closely related but are often misunderstood. Both of them deal with data. However, there are some fundamental distinctions that show us how they are different from each other. Data Science is a broad field that deals with large volumes of data and allows us to draw insights out of this voluminous data. The entire process of Data Science takes care of multiple steps that are involved in drawing insights out of the available data. This process includes crucial steps such as data gathering, data analysis, data manipulation, data visualization, etc. Machine Learning, on the other hand, can be thought of as a sub-field of Data Science. It also deals with data, but here, we are solely focused on learning how to convert the processed data into a functional model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input and tell us if that image contains a flower as an output. In short, Data Science deals with gathering data, processing it, and finally, drawing insights from it. The field of Data Science that deals with building models using algorithms is called Machine Learning. Therefore, Machine Learning is an integral part of Data Science.

Data Science

Data Science combines statistics, maths, specialised programs, artificial intelligence, machine learning etc. Data Science is simply the application of specific principles and analytic techniques to extract information from data used in strategic planning, decision making, etc. Simply, data science means analysing data for actionable insights.

Data Science

Data Science is a field of computer science that explicitly deals with turning data into information and extracting meaningful insights out of it.

k-fold cross-validation

In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop over the entire dataset k times. In each iteration of the loop, one of the k parts is used for testing, and the other k − 1 parts are used for training. Using k-fold cross-validation, each one of the k parts of the dataset ends up being used for training and testing purposes.

Multi-Stage Sampling

In multistage sampling, or multistage cluster sampling, you draw a sample from a population using smaller and smaller groups (units) at each stage. It's often used to collect data from a large, geographically spread group of people in national surveys. In multistage sampling, you divide the population into smaller and smaller groupings to create a sample using several steps. You can take advantage of hierarchical groupings (e.g., from state to city to neighborhood) to create a sample that's less expensive and time-consuming to collect data from. - Probability Sampling

How to deal with outliers?

Outliers can be dealt with in several ways. One way is to drop them. We can only drop the outliers if they have values that are incorrect or extreme. For example, if a dataset with the weights of babies has a value 98.6-degree Fahrenheit, then it is incorrect. Now, if the value is 187 kg, then it is an extreme value, which is not useful for our model. In case the outliers are not that extreme, then we can try: A different kind of model. For example, if we were using a linear model, then we can choose a non-linear model Normalizing the data, which will shift the extreme values closer to other data points Using algorithms that are not so affected by outliers, such as random forest, etc.

overfitting

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting: Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data Use cross-validation techniques, such as k folds cross-validation Use regularization techniques, such as LASSO, that penalize certain model parameters if they're likely to cause overfitting

pruning

Pruning a decision tree is the process of removing the sections of the tree that are not necessary or are redundant. Pruning leads to a smaller decision tree, which performs better and gives higher accuracy and speed.

recall

Recall: It is the set of all positive predictions out of the total number of positive instances. Recall helps us identify the misclassified positive predictions. We use the below formula to calculate recall: true positives / (true positives + false negatives)

How to select k in k-means

Selecting the correct value of k is an important aspect of k-means clustering. We can make use of the elbow method to pick the appropriate k value. To do this, we run the k-means algorithm on a range of values, e.g., 1 to 15. For each value of k, we compute an average score. This score is also called inertia or the inter-cluster variance. This is calculated as the sum of squares of the distances of all values in a cluster. As k starts from a low value and goes up to a high value, we start seeing a sharp decrease in the inertia value. After a certain value of k, in the range, the drop in the inertia value becomes quite small. This is the value of k that we need to choose for the k-means clustering algorithm.

sampling biases

Selection bias Undercoverage bias Survivorship bias

selection bias

Selection bias is the bias that occurs during the sampling of data. This kind of bias occurs when a sample is not representative of the population, which is going to be analyzed in a statistical study.

Normalisation vs Standardization

Standardization The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0. Standardization takes care that the standard normal distribution is followed by the data. Standardization formula - X' = (X - 𝞵) / 𝞼 Normalization The technique of converting all data values to lie between 1 and 0 is known as Normalization. This is also known as min-max scaling. The data returning into the 0 to 1 range is taken care of by Normalization. Normalization formula - X' = (X - Xmin) / (Xmax - Xmin) Here, Xmin - feature's minimum value, Xmax - feature's maximum value.

Stratified Sampling

Stratified random sampling is a method of sampling that involves the division of a population into smaller subgroups known as strata. In stratified random sampling, or stratification, the strata are formed based on members' shared attributes or characteristics, such as income or educational attainment. Stratified random sampling has numerous applications and benefits, such as studying population demographics and life expectancy. Stratified random sampling is also called proportional random sampling or quota random sampling. - Probability Sampling

supervised vs unsupervised learning

Supervised Learning Works on the data that contains both inputs and the expected output, i.e., the labeled data Used to create models that can be employed to predict or classify things Commonly used supervised learning algorithms: Linear regression, decision tree, etc. Unsupervised Learning Works on the data that contains no mappings from input to output, i.e., the unlabeled data Used to extract meaningful information out of large volumes of data Commonly used unsupervised learning algorithms: K-means clustering, Apriori algorithm, etc.

survivorship Bias

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

Systematic Sampling

Systematic sampling is a type of probability sampling method in which sample members from a larger population are selected according to a random starting point but with a fixed, periodic interval. This interval, called the sampling interval, is calculated by dividing the population size by the desired sample size.1 Despite the sample population being selected in advance, systematic sampling is still thought of as being random if the periodic interval is determined beforehand and the starting point is random. When carried out correctly on a large population of a defined size, systematic sampling can help researchers, including marketing and sales professionals, obtain representative findings on a huge group of people without having to reach out to each and every one of them.

dimensionality reduction

The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there's no point in storing a value in two different units (meters and inches).

Two candidates, Aman and Mohan appear for a Data Science Job interview. The probability of Aman cracking the interview is 1/8 and that of Mohan is 5/12. What is the probability that at least one of them will crack the interview?

The probability of Aman getting selected for the interview is 1/8 P(A) = 1/8 The probability of Mohan getting selected for the interview is 5/12 P(B)=5/12 Now, the probability of at least one of them getting selected can be denoted at the Union of A and B, which means P(A U B) =P(A)+ P(B) - (P(A ∩ B)) ...........................(1) Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for the job. To calculate the final answer, we first have to find out the value of P(A ∩ B) So, P(A ∩ B) = P(A) * P(B) 1/8 * 5/12 5/96 Now, put the value of P(A ∩ B) into equation (1) P(A U B) =P(A)+ P(B) - (P(A ∩ B)) 1/8 + 5/12 -5/96 So, the answer will be 47/96.

maintain a deployed model

The steps to maintain a deployed model are: Monitor Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it's doing what it's supposed to do. Evaluate Evaluation metrics of the current model are calculated to determine if a new algorithm is needed. Compare The new models are compared to each other to determine which model performs the best. Rebuild The best-performing model is re-built on the current state of data.

random forest

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are: Build several decision trees on bootstrapped training samples of data On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors Rule of thumb: At each split m=p√m=p Predictions: At the majority rule This exhaustive list is sure to strengthen your preparation for data science interview questions.

assumptions required for linear regression

There are several assumptions required for linear regression. They are as follows: The data, which is a sample drawn from a population, used to train the model should be representative of the population. The relationship between independent variables and the mean of dependent variables is linear. The variance of the residual is going to be the same for any value of an independent variable. It is also represented as X. Each observation is independent of all other observations. For any value of an independent variable, the independent variable is normally distributed.

feature selection methods

There are two main methods for feature selection, i.e, filter, and wrapper methods. Filter Methods This involves: Linear discrimination analysis ANOVA Chi-Square The best analogy for selecting features is "bad data in, bad answer out." When we're limiting or selecting the features, it's all about cleaning up the data coming in. Wrapper Methods This involves: Forward Selection: We test one feature at a time and keep adding them until we get a good fit Backward Selection: We test all the features and start removing them to see what works better Recursive Feature Elimination: Recursively looks through all the different features and how they pair together Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

Confounding variables

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

assumptions required for linear regression are violated

These assumptions may be violated lightly (i.e., some minor violations) or strongly (i.e., the majority of the data has violations). Both of these violations will have different effects on a linear regression model. Strong violations of these assumptions make the results entirely redundant. Light violations of these assumptions make the results have greater bias or variance.

time series data is stationary

Time series data is considered stationary when variance or mean is constant with time. If the variance or mean does not change over a period of time in the dataset, then we can draw the conclusion that, for that period, the data is stationary.

How to handle missing data

To be able to handle missing data, we first need to know the percentage of data missing in a particular column so that we can choose an appropriate strategy to handle the situation. For example, if in a column the majority of the data is missing, then dropping the column is the best option, unless we have some means to make educated guesses about the missing values. However, if the amount of missing data is low, then we have several strategies to fill them up. One way would be to fill them all up with a default value or a value that has the highest frequency in that column, such as 0 or 1, etc. This may be useful if the majority of the data in that column contains these values. Another way is to fill up the missing values in the column with the mean of all the values in that column. This technique is usually preferred as the missing values have a higher chance of being closer to the mean than to the mode. Finally, if we have a huge dataset and a few rows have values missing in some columns, then the easiest and fastest way is to drop those columns. Since the dataset is large, dropping a few columns should not be a problem anyway.

Sensitivity

True positive rate: In Machine Learning, true-positive rates, which are also referred to as sensitivity or recall, are used to measure the percentage of actual positives which are correctly identified. Sensitivity (true positive rate) is the probability of a positive test result, conditioned on the individual truly being positive. Formula: TruePositive/Positive or TruePositive/TruePositive+FalseNegative

variance

Variance is a type of error that occurs in a Data Science model when the model ends up being too complex and learns features from data, along with the noise that exists in it. This kind of error can occur if the algorithm used to train the model has high complexity, even though the data and the underlying patterns and trends are quite easy to discover. This makes the model a very sensitive one that performs well on the training dataset but poorly on the testing dataset, and on any kind of data that the model has not yet seen. Variance generally leads to poor accuracy in testing and results in overfitting.

univariate, bivariate, and multivariate analyses

When we are dealing with data analysis, we often come across terms such as univariate, bivariate, and multivariate. Let's try and understand what these mean. Univariate analysis: Univariate analysis involves analyzing data with only one variable or, in other words, a single column or a vector of the data. This analysis allows us to understand the data and extract patterns and trends out of it. Example: Analyzing the weight of a group of people. Bivariate analysis: Bivariate analysis involves analyzing the data with exactly two variables or, in other words, the data can be put into a two-column table. This kind of analysis allows us to figure out the relationship between the variables. Example: Analyzing the data that contains temperature and altitude. Multivariate analysis: Multivariate analysis involves analyzing the data with more than two variables. The number of columns of the data can be anything more than two. This kind of analysis allows us to figure out the effects of all other variables (input variables) on a single variable (the output variable). Example: Analyzing data about house prices, which contains information about the houses, such as locality, crime rate, area, the number of floors, etc.

logistic regression

is a classification algorithm that can be used when the dependent variable is binary.


Set pelajaran terkait

Cellular Respiration & Photosynthesis

View Set

The skull bones sutures and structure

View Set

Module 1 - Introduciton to Ethical Hacking

View Set