Data Science Questions

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What are the assumptions for linear regression?8

1. There is a linear relationship between the dependent variables and the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable.

What is a confusion matrix?8

2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. False positive, true negative, true positive, false negative. Error Rate = (FP+FN)/(P+N) Accuracy = (TP+TN)/(P+N) Sensitivity(Recall or True positive rate) = TP/P Specificity(True negative rate) = TN/N Precision(Positive predicted value) = TP/(TP+FP) F-Score(Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b²PREC+REC) where b is commonly 0.5, 1, 2.

What is entropy and information gain in decision tree algorithms?8

A decision tree is built top-down from a root node and involve partitioning of data into homogenious subsets. ID3 uses enteropy to check the homogeneity of a sample. If the sample is completely homogenious then entropy is zero and if the sample is an equally divided it has entropy of one. The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attributes that returns the highest information gain.

If a table contains duplicate rows, does a query result display the duplicate values by default? How can you eliminate duplicate rows from a query result?8

A query result displays all rows including the duplicate rows. To eliminate duplicate rows in the result, the DISTINCT keyword is used in the SELECT clause.

Explain the difference between L1 and L2 regularization methods.8

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term: L1 = absolute value, L2 = squared

What is the difference between VARCHAR2 AND CHAR datatypes?8

ARCHAR2 represents variable length character data, whereas CHAR represents fixed length character data.

What are the different types of keys in a relational database?8

Alternate keys are candidate keys that exclude all primary keys. Artificial keys are created by assigning a unique number to each occurrence or record when there aren't any compound or standalone keys. Compound keys are made by combining multiple elements to develop a unique identifier for a construct when there isn't a single data element that uniquely identifies occurrences within a construct. Also known as a composite key or a concatenated key, compound keys consist of two or more attributes. Foreign keys are groups of fields in a database record that point to a key field or a group of fields that create a key of another database record that's usually in a different table. Often, foreign keys in one table refer to primary keys in another. As the referenced data can be linked together quite quickly, it can be critical to database normalization. Natural keys are data elements that are stored within constructs and utilized as primary keys. Primary keys are values that can be used to identify unique rows in a table and the attributes associated with them. For example, these can take the form of a Social Security number that's related to a specific person. In a relational model of data, the primary key is the candidate key. It's also the primary method used to identify a tuple in each possible relation. Super keys are defined in the relational model as a set of attributes of a relation variable. It holds that all relations assigned to that variable don't have any distinct tuples. They also don't have the same values for the attributes in the set. Super keys also are defined as a set of attributes of a relational variable upon which all of the functionality depends.

What is Bias and Variance Tradeoff?8

Bias is error introduced in your model due to over simplification of machine learning algorithm." It can lead to under fitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand. Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs bad on test data set." It can lead high sensitivity and over fitting. Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance. The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance. The k-nearest neighbours algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbours that contribute to the prediction and in turn increases the bias of the model. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

What is bias and what types of bias can occur during sampling?8

Bias is the difference between the average prediction of a model and the correct value that you are trying to predict. A model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training data. The three types of bias that can occur are selection, under coverage and survivorship bias.

Describe the difference between covariance and correlation.8

Covariance gives the direction of a linear relationship while correlation gives both strength and direction.

What are cross-correlations with time lags in time series models?8

Cross-correlation: is the degree of similarity between two time series in different times or space while lag can be considred when time is under investigation. Auto-correlation: is the cross-correlation of a time series while investitigating the persitance between lagged times of the same time series or signal.

What are various DCL commands in SQL? Give brief description of their purposes.8

Data Control Language commands in SQL − GRANT − it gives a privilege to user. REVOKE − it takes back privileges granted from user.

What are various DDL commands in SQL? Give brief description of their purposes.8

Data Definition Language commands in SQL − CREATE − it creates a new table, a view of a table, or other object in database. ALTER − it modifies an existing database object, such as a table. DROP − it deletes an entire table, a view of a table or other object in the database

What are various DML commands in SQL? Give brief description of their purposes.8

Data Manipulation Language commands in SQL − SELECT − it retrieves certain records from one or more tables. INSERT − it creates a record. UPDATE − it modifies records. DELETE − it deletes records.

What is the normal distribution?8

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The random variables are distributed in the form of an symmetrical bell shaped curve.

What are some core steps to take for data preprocessing?8

Data preprocessing involves giving structure to the data for better understanding and decision making related to the data. Some key steps in data pre-processing includes: Data discovery and acquisition: Gathering data from available sources and trying to understand and make sense of it. Data structuring and transformation: Taking different data set formats and sizes and giving it a consistent size and shape when merged together. Data cleaning: Imputing null values and treating outliers/anomalies in the data to make it usable for further analysis. Exploratory Data Analysis: Finding patterns in the dataset and extracting new features from the given data in order to optimize the performance of a model. Validating: Verifying data consistency and quality. Publishing/Modeling: Processing the data further with an algorithm or machine learning model.

What is a Box Cox transformation?8

Dependent variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-normal dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests. A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques, if your data isn't normal, applying a Box-Cox means that you are able to run a broader number of tests.

How do you define the number of clusters in a clustering algorithm?8

Elbow plot: within groups sum of squares vs number of clusters, look for bend, that point is where k for k-means exists

List disadvantages of linear models8

Errors in linearity assumptions Lacks autocorrelation It can't solve overfitting problems You can't use it to calculate outcomes or binary outcomes

Why Is Re-sampling done?8

Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points Substituting labels on data points when performing significance tests Validating models by using random subsets (bootstrapping, cross-validation)

If you have 4GB RAM but want to train your model on 10GB data how can you do this?8

For Neural networks: Batch size with Numpy array will work. Steps: Load the whole data in Numpy array. Numpy array has property to create mapping of complete data set, it doesn't load complete data set in memory. You can pass index to Numpy array to get required data. Use this data to pass to Neural network. Have small batch size. For SVM: Partial fit will work Steps: Divide one big data set in small size data sets. Use partial fit method of SVM, it requires subset of complete data set. Repeat step 2 for other subsets.

You are given a dataset consisting of variables having more than 30% missing values. How will you deal with this?8

If the data set is huge, you can remove the rows that have missing data values. This is the quickest way to deal with this. If the dataset is small, you can substitute missing values with the mean of the rest of the data using pandas dataframe in python i.e. df.mean()dr.fillna(mean).

What are has table collisions?8

If the range of key values is larger than the size of our hash table, which is usually always the case, then we must account for the possibility that two different records with two different keys can hash to the same table index. There are a few different ways to resolve this issue. In hash table vernacular, this solution implemented is referred to as collision resolution

What is the difference between inductive, deductive, and abductive learning?8

Inductive learning describes smart algorithms that learn from a set of instances to draw conclusions. In statistical ML, k-nearest neighbor and support vector machine are good examples of inductive learning. There are three literals in (top-down) inductive learning: Arithmetic literals Equality and inequality Predicates In deductive learning, the smart algorithms draw conclusions by following a truth-generating structure (major premise, minor premise, and conclusion) and then improve them based on previous decisions. In this scenario, the ML algorithm engages in deductive reasoning using a decision tree. Abductive learning is a DL technique where conclusions are made based on various instances. With this approach, inductive reasoning is applied to causal relationships in deep neural networks.

What cross-validation technique would you use on a time series data set?8

Instead of using k-fold cross-validation, you should be aware to the fact that a time series is not randomly distributed data — It is inherently ordered by chronological order. In case of time series data, you should use techniques like forward chaining — Where you will be model on past data then look at forward-facing data. fold 1: training[1], test[2] fold 1: training[1 2], test[3] fold 1: training[1 2 3], test[4] fold 1: training[1 2 3 4], test[5]

What are the algorithm techniques in ML?8

Learning to learn Reinforcement learning (deep adversarial networks, q-learning, and temporal difference) Semi-supervised learning Supervised learning (decision trees, linear regression, naive bayes, nearest neighbor, neural networks, and support vector machines) Transduction Unsupervised learning (association rules and k-means clustering)

What are some situations where a general linear model fails?8

Linear regressions are sensitive to outliers. E.g. if most of your data lives in the range (20,50) on the x-axis, but you have one or two points out at x= 200, this could significantly swing your regression results.Similarly, if you build your regression on the range x in (20,50), and then try to use that model to predict a y-value for x = 200, this is pretty significant extrapolation and is not necessarily going to be accurate. Overfitting - It is easy to overfit your model such that your regression begins to model the random error (noise) in the data, rather than just the relationship between the variables. This most commonly arises when you have too many parameters compared to the number of samples Linear regressions are meant to describe linear relationships between variables. So, if there is a nonlinear relationship, then you will have a bad model. However, you can sometimes compensate for this by transforming some of the parameters with a log, square root, etc. transformation.

What is p-value?

P = Probability Probability that the result obtained was due to chance. Generally a P-value < 0.05 (and sometimes < 0.01 or other values, depending on the trial design) indicated statistical significance. If P < 0.05 that means there is a < 5% probability that the result occurred by chance.

Explain the 80/20 rule8

People usually tend to start with a 80-20% split (80% training set - 20% test set) and split the training set once more into a 80-20% ratio to create the validation set

What is RMSE and MSE in linear regression models?8

RMSE stands for root mean square error and MSE stands for mean square error. They are the most common measures of accuracy for a linear regression model. The formulas are below.

Explain what precision and recall are. How do they relate to the ROC curve?8

Recall describes what percentage of true positives are described as positive by the model. Precision describes what percent of positive predictions were correct. The ROC curve shows the relationship between model recall and specificity-specificity being a measure of the percent of true negatives being described as negative by the model. Recall, precision, and the ROC are measures used to identify how useful a given classification model is

How do you select features for a model? What do you look for?8

Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise. · Improves Accuracy: Less misleading data means modeling accuracy improves. · Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

Methods to avoid overfitting8

Regularization (eg Lasso) that penalize some parameters, cross-validation techniques (eg k-folds cross validation), keep model simple by using fewer variables

Explain SVM8

SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyper planes to separate out different classes based on the provided kernel function.

What is selection bias?8

Selection (or 'sampling') bias occurs in an 'active,' sense when the sample data that is gathered and prepared for modeling has characteristics that are not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e., non-randomly) excluded from analysis

How do you deal with sparsity?8

Sparse data means incomplete or lack of input data or data with missing values, on which we train machine learning models to predict.

What is sampling?8

Statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

What is the difference between SQL and MySQL or SQL Server?8

Structured Query Language. It's a standard language for accessing and manipulating databases. MySQL is a database management system, like SQL Server, Oracle, Informix, Postgres, etc

What is the difference between structured and unstructured data?8

Structured data is highly-organized and formatted in a way so it's easily searchable in relational databases. Unstructured data has no pre-defined format or organization, making it much more difficult to collect, process, and analyze.

What is the difference between supervised and unsupervised ML?8

Supervised machine learning requires training labelled data, unsupervised doesn't require labelled data

What is the purpose of the condition operators BETWEEN and IN?8

The BETWEEN operator displays rows based on a range of values. The IN condition operator checks for values contained in a specific set of values.

Explain how a ROC curve works8

The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It is often used as a proxy for the trade-off between the sensitivity(true positive rate) and false positive rate

What is the difference between cross joins and natural joins?8

The cross join produces the cross product or Cartesian product of two tables. The natural join is based on all the columns having same name and data types in both the tables.

What is the difference between univariate, bivariate and multi-variate analysis?8

The difference is in the number of variables used. Univariate uses 1 variable. Its purpose is to describe the data and find patterns in it. Bivariate analysis uses two variables. Its purpose is to find a relationship between the two variables. Multi-variate analysis uses more than two variables. Its purpose is to

A couple has two children, at least one of which is a girl. What is the probability that they have two girls?8

There are 4 equally likely possibilities : BB, BG, GB and GG; where B = Boy and G = Girl and the first letter denotes the first child. You can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & GG, you find the probability of the case with two girls. The probability of having 2 girls, given one girl is 1/3.

How would you maintain a deployed model?8

There are four essential steps: Monitor to determine the performance accuracy of the model Calculate evaluation metrics of the current model to determine if a new algorithm is needed Compare the two models to determine which model performs the best Rebuild the best performing model using the current state of data.

What is a hash table?8

There are two parts to a hash table. The first is an array, or the actual table where the data is stored, and the other is a mapping function that's known as the hash function. It's a data structure that implements an associative array abstract data type that can map key values. It can also compute an index into an array of slots or buckets where the desired value can be found.

What are the feature selection methods to select the right variables?8

There are two types of methods: Filter methods include linear discriminant analysis, ANOVA and Chi-square (most commonly used). These methods are meant to pull the bad data out. Wrapper Methods include forward selection, backward selection and recursive feature elimination.

Sensitivity8

True positive rate (TP/P)

What is regularization?8

When you have underfitting or overfitting issues in a statistical model, you can use the regularization technique to resolve it. Regularization techniques like LASSO help penalize some model parameters if they are likely to lead to overfitting.

If a table contains duplicate rows, does a query result display the duplicate values by default? How can you eliminate duplicate rows from a query result?8

Yes. One way you can eliminate duplicate rows with the DISTINCT clause

What steps would you take to evaluate the effectiveness of your ML model?8

You have to first split the data set into training and test sets. You also have the option of using a cross-validation technique to further segment the data set into a composite of training and test sets within the data. Then you have to implement a choice selection of the performance metrics like the following: Confusion matrix Accuracy Precision Recall or sensitivity Specificity F1 score For the most part, you can use measures such as accuracy, confusion matrix, or F1 score. However, it'll be critical for you to demonstrate that you understand the nuances of how each model can be measured by choosing the right performance measure to match the problem.

What is a random forest?8

a data construct that's applied to ML projects to develop a large number of random decision trees while analyzing variables. These algorithms can be leveraged to improve the way technologies analyze complex data sets. The basic premise here is that multiple weak learners can be combined to build one strong learner.

r-squared value8

a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. R2 = 1 - SSreg/SStotal

What is an exact test?8

a test where all assumptions, upon which the derivation of the distribution of the test statistic is based, are met as opposed to an approximate test (in which the approximation may be made as close as desired by making the sample size big enough). This will result in a significance test that will have a false rejection rate always equal to the significance level of the test. For example an exact test at significance level 5% will in the long run reject true null hypotheses exactly 5% of the time

What are Z-scores and how are they useful?8

a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. z = (x - mean)/SD

What is the default ordering of data using the ORDER BY clause? How could it be changed?8

ascending. It can be changed using the DESC keyword, after the column name in the ORDER BY clause.

What is Naive in a Naive Bayes?8

based on the Bayes Theorem. Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The Algorithm is 'naive' because it makes assumptions that may or may not turn out to be correct.

What is ensemble learning?8

combine models to improve stability and predictive power of the model: - bagging (implement models on small sample populations and take mean of predictions, reduces variance) - boosting (iterative technique to adjust weight of observation based on last classification, decreases bias error but may overfit training data)

What is overfitting and how can you avoid overfitting of your model?8

condition where a model begins to describe the random error in the data rather than the relationships between variables. It reduces the model's usefulness outside the original dataset. This problem occurs when the model is too complex. There are 3 main ways to avoid overfitting a model: 1. Keep the model simply by taking into account fewer variables, which reduces some of the noise in the training data. 2. Use cross-validation techniques such as k-folds. 3. Use regularization techniques such as LASSO that penalize certain model parameters if they are likely to cause overfitting.

What is the purpose of the NVL function?8

converts a NULL value to an actual value

What criteria would you use to select a representative sample?8

diversity, consistency, and transparency. The sample must be as diverse as the data set. Any changes observed in the sample data should also be reflected in the true population. A discussion should be had within the analytics team to decide the appropriate sample size and structure that is a true representative of the full data set.

overfit8

does well with training data but no test data

What is an eigenvalue? Eigenvector?8

eigenvalue: direction along which a particular linear transformation compresses, flips or stretches eigenvector: vector for linear transformation

What is caching and why do you use it in data science?8

enables content to be retrieved faster because an entire network round trip is not necessary. Caching can be necessary to save various data files when the process of loading and/or manipulating data takes a considerable amount of time. There will be caching on the server where already computed elements may not need to be recomputed. When you want to access some data that takes a lot of time and resources to look up, you cache it so that the next time you want to look up that same data, the process of doing so is more efficient.

What is random forest?8

ensemble learning method where we grow multiple trees. To classify each tree gives classification and forest chooses the one with the most votes.

For given points, how will you calculate the Euclidian Distance in Python? Given points : plot1 = [1,3} plot2 = [2,5]8

euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

What is the purpose of the group functions in SQL?8

get summary statistics of a data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT

Explain inner join, left join, right join, and union8

inner join is when both tables have a match, a left join is when there is a match in the left table and the right table is null, a right join is the opposite of a left join, and a full join is all of the data combined

What are the types of sorting algorithms in R?8

insertion, bubble, selection

How is k-NN different from k-means clustering?8

k-NN: classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means: clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data.

What is selection bias and what are the different types?8

kind of error that occurs when a model builder decides what data is going to be used in a way that doesn't allow for randomization. It is the distortion of statistical analysis accuracy resulting from the non-randomized method of collecting samples.

What is linear regression?8

linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables)

gini impurity8

measure of how impure nodes on a decision tree are: gini = 1 - (prob of no)^2 - (prob of yes)^2

In python how is memory managed?8

memory is managed in a private heap space. This means that all the objects and data structures will be located in a private heap. However, the programmer won't be allowed to access this heap. Instead, the Python interpreter will handle it. At the same time, the core API will enable access to some Python tools for the programmer to start coding. The memory manager will allocate the heap space for the Python objects while the inbuilt garbage collector will recycle all the memory that's not being used to boost available heap space

mean absolute error8

model evaluation metric used with regression models. The mean absolute error of a model with respect to a test set is the mean of the absolute values of the individual prediction errors on over all instances in the test set. Each prediction error is the difference between the true value and the predicted value for the instance.

What is the binomial probability formula?8

n!/(k!(n-k)!)

What are the data objects in R?8

numeric (both integer and double), character and logical

What is Logistics Regression?8

process that measures the difference between a dependent variable (what you want to predict) and one or more independent variables (features) by estimating the probabilities using its underlying logistics function (i.e. sigmoid). This technique used to predict a binary outcome that is either zero or one, or a yes or no. The two types of logistics regression are binary and multinomial. Binary deals with two categories whereas multinomial deals with three or more categories.

What is tree pruning?8

remove sub-nodes of a decision tree

What does UNION do? What is the difference between UNION and UNION ALL?8

removes duplicate records (where all columns in the results are the same), UNION ALL does not

What is A/B testing?8

statistical hypothesis testing process whereby a hypothesis is made about the relationship between two data sets and those data sets are then compared against each other to determine if there is a statistically significant relationship or not. A prediction is made that dataset B will perform better than dataset A. Then both data sets are observed and compared to determine if B is a statistically significant improvement over A.

Explain random sampling, stratified sampling, and cluster sampling.8

stratified random sample, a population is divided into stratum, or sub-populations, before sampling. At first glance, the two techniques seem very similar. However, in cluster sampling the actual cluster is the sampling unit; in stratified sampling, analysis is done on elements within each strata

Explain decision trees8

supervised machine learning algorithm mainly used for the Regression and Classification.It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision tree can handle both categorical and numerical data

What are recommender systems?8

system that predicts the rating or preference that a user would give to a product (or other choice). There are two different types of recommender systems: collaborative filtering and content-based filtering. Collaborative makes recommendations based on other users with similar interests. Content-based filtering uses the properties of the product to recommend products with similar properties.

What are the data types in python?8

text (str), numeric (int, float, complex), sequence types (list, tuple, range), mapping (dict), set types (set), boolean (bool), binary (byte, bytearray)

What is regularization?8

the process of adding tunning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often the L1(Lasso) or L2(ridge). The model predictions should then minimize the loss function calculated on the regularized training set.

Root8

top of decision tree

What is the difference between a tuple and a list in python?8

tuples are immutable

What is the difference between type I and type II error?8

type I error: when the null hypothesis is true, but is rejected. type II error: when the null hypothesis is false, but erroneously fails to be rejected

What are common probability distributions?8

uniform, bernoulli, binomial, poisson, normal, log normal, students t, chi squared, gamma, beta, exponential

impurity8

when decision tree has inner nodes that are neither 100% one outcome or the other

What is a statistical interaction?8

when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor

How would you create a logistic regression model?8

xxx

I have two models of comparable accuracy and computational performance. Which one should I choose for production and why?8

xxx

How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?8

xxxx

In your opinion, which is more important when designing a machine learning model: model performance or model accuracy?8

xxxx

What is one way that you would handle an imbalanced data set that's being used for prediction (i.e., vastly more negative classes than positive classes)?8

xxxx


Kaugnay na mga set ng pag-aaral

Relationships between f, f prime and f double prime

View Set

ALL APGO UWISE - SHANE, Uwise Comprehensive, OBGYN uwise, UWise, APGO Part 1, Unit 1: Approach to the patient, Obsetrics&Gynecology, U-wise Maternal-Fetal Physiology, uWISE Unit 2: Obstetrics A - Normal Obstetrics

View Set

OB chapter 15, Neonatal Period: Discharge Planning and Teaching

View Set

Chapter 9: Policy in Public Health

View Set

Solving Polynomial Equations using Technology

View Set

Med Surg Ch 45 Oral and Esophageal Disorders

View Set

ITIL V4 Review Session 4 (Objectives 7)

View Set