DATA SCIENCE

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Gradient Descent

A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.

Supervised vs. Unsupervised learning

SUPERVISED = response variable available NON-SUPERVISED = no response variable available.

Overfitting

In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex like Decision trees which are prone to overfitting.

Probability Disjoint

Two events are mutually exclusive if they cannot occur at the same time. Another word that means mutually exclusive is disjoint. If two events are disjoint, then the probability of them both occurring at the same time is 0. Disjoint: P(A and B) = 0

What do we mean by the variance and bias of a statistical learning method?

Variance refers to the amount by which ˆf would change if we estimated it using a different training data set. Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

Hyperparameter

choices about the algorithm that we set rather then learn. What is the best value of k to use? What is the best distance to use? Must try all to see which ones work best for a specific problem

Sample Standard Deviation Variance Formula

s = sample standard deviation Σ = sum of... X(w/dash) = sample mean n = number of scores in sample.

Mean Squared Error (MSE)

the average of the squared differences between the forecasted and observed values

Expected test MSE, for a given value x0, can always be decomposed into the sum of three fundamental quantities:

the variance of ˆf(x0), the squared bias of ˆf(x0) and the variance of the error term e.

chi-square test for independence

uses the frequency data from a sample to evaluate the relationship between two variables in the population

Population Standard Deviation Variance Formula

σ = population standard deviation Σ = sum of... = population mean n = number of scores in sample.

dependent and independent variables

—The dependent variable is generally the y-variable, b/c it's value depends on the x-variable chosen. The x-variable is generally independent, b/c it can be chosen at random.

Dimension Reduction

Process of reducing the number of variables to consider in a data-mining approach.

Types of variables

Categorical (qualitative) - limited number of distinct categories a) Ordinal - categorical values can be further subdivided if they represent a finite number of values within a given range, often measured and with an inherited order. e.g. results of survey(satisfied, neutral, unsatisfied) - if levels of categorical variables do not have an inherited order then they are simply categorical variables. (male or female) Numerical(quantitative) - numerical values a) Continuous - infinite number of values within a given range, often measured. e.g height b) Discrete - specific set of numeric values that can be counted or enumerated, often counted. e.g number of pets in the household

Parametric vs Non-parametric methods

Continuous data arise in most areas of medicine. Familiar clinical examples include blood pressure, ejection fraction, forced expiratory volume in 1 second (FEV1), serum cholesterol, and anthropometric measurements. Methods for analysing continuous data fall into two classes, distinguished by whether or not they make assumptions about the distribution of the data. Theoretical distributions are described by quantities called parameters, notably the mean and standard deviation.1 Methods that use distributional assumptions are called parametric methods, because we estimate the parameters of the distribution assumed for the data. Frequently used parametric methods include t tests and analysis of variance for comparing groups, and least squares regression and correlation for studying the relation between variables. All of the common parametric methods ("t methods") assume that in some way the data follow a normal distribution and also that the spread of the data (variance) is uniform either between groups or across the range being studied. For example, the two sample t test assumes that the two samples of observations come from populations that have normal distributions with the same standard deviation. The importance of the assumptions for t methods diminishes as sample size increases. Alternative methods, such as the sign test, Mann-Whitney test, and rank correlation, do not require the data to follow a particular distribution. They work by using the rank order of observations rather than the measurements themselves. Methods which do not require us to make distributional assumptions about the data, such as the rank methods, are called non-parametric methods. The term non-parametric applies to the statistical method used to analyse data, and is not a property of the data.1 As tests of significance, rank methods have almost as much power as t methods to detect a real difference when samples are large, even for data which meet the distributional requirements. Non-parametric methods are most often used to analyse data which do not meet the distributional requirements of parametric methods. In particular, skewed data are frequently analysed by non-parametric methods, although data transformation can often make the data suitable for parametric analyses.2

Conditional Probability

Lets say P(A) = 0.6 P(B) = 0.5 P(A|B) 0.7

5 Assumptions of Linear Regression

Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions: Linear relationship Multivariate normality No or little multicollinearity No auto-correlation Homoscedasticity L Linear relationship I Independent observations N Normally distributed around line E Equal variance across X's https://www.coursera.org/lecture/regression-modeling-practice/lesson-4-linear-regression-assumptions-ZUFlh https://www.jmp.com/en_us/statistics-knowledge-portal/what-is-regression/simple-linear-regression-assumptions.html

Logistc Regression vs Linear Regression

Logistic regression can handle all sorts of relationships, because it applies a non-linear log transformation to the predicted odds ratio. Secondly, the independent variables do not need to be multivariate normal - although multivariate normality yields a more stable solution. Also the error Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms - particularly regarding linearity, normality, homoscedasticity, and measurement level. Firstly, it does not need a linear relationship between the dependent and independent variables. terms (the residuals) do not need to be multivariate normally distributed. Thirdly, homoscedasticity is not needed. Logistic regression does not need variances to be heteroscedastic for each level of the independent variables. Lastly, it can handle ordinal and nominal data as independent variables. The independent variables do not need to be metric (interval or ratio scaled).

Simple Linear Regression Model

A method of finding the best model for a linear relationship between the explanatory and response variable. Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight. The values b0 and b1 must be chosen so that they minimize the error. If sum of squared error is taken as a metric to evaluate the model, then goal to obtain a line that best reduces the error.

Confidence Interval (CI)

A range of values, calculated from the sample observations, that is believed, with a particular probability, to contain the true value of a population parameter. A 95% confidence interval, for example, implies that were the estimation process repeated again and again, then 95% of the calculated intervals would be expected to contain the true parameter value. Note that the stated probability level refers to properties of the interval and not to the parameter itself which is not considered a random variable. It is not quite correct to ask about the probability that the interval contains the population mean. It either does or it doesn't. There is no chance about it. What you can say is that if you perform this kind of experiment many times, the confidence intervals would not all be the same, you would expect 95% of them to contain the population mean, you would expect 5% of the confidence intervals to not include the population mean, and that you would never know whether the interval from a particular experiment contained the population mean or not.

Z test

A z-test is used for testing the mean of a population versus a standard, or comparing the means of two populations, with large (n ≥ 30) sample size . Its based on the Z-statistic, which follows the standard normal distribution under the null hypothesis. A one-sample location test, two-sample location test, paired difference test and maximum likelihood estimate , where you may perform z-tests. You can also use Z-tests to determine whether predictor variables in probit analysis and logistic regression have a significant effect on the response. The null hypothesis states that the predictor is not significant. For n<30, you may perform T-test instead !

K-nearest neighbor (K-NN)

An improved adaptation of Nearest Neighbor Classifier that instead of copying label from the nearest neighbor, takes the majority vote from K closest point (K is a number chosen by us to specify the number of points from which vote should be taken). By specifying the distance metric the algorithm can be applied to different things. for example pictures difference in pixels. A text-specific function that compares one paragraph to another.

Explain bias/variance trade off

As we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases. FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var() (dashed line), and test MSE (red curve) for the three data sets in Figures 2.9-2.11. The vertical dotted line indicates the flexibility level corresponding to the smallest test MSE.

Student t Distribution vs. Normal Distribution

Basically student t-distribution approximates normal distribution for small sample sizes. Visually, the Student's t distribution looks much like a normal distribution but generally has fatter tails. Fatter tails as you may remember allows for a higher dispersion of variables, as there is more uncertainty. In the same way that the z-statistic is related to the standard normal distribution, the t-statistic is related to the Student's t distribution. The formula that allows us to calculate it is: t with n-1 degrees of freedom and a significance level of alpha equals the sample mean minus the population mean, divided by the standard error of the sample. As you can see, it is very similar to the z-statistic; after all, this is an approximation of the normal distribution. The last characteristic of the Student's t-statistic is that there are degrees of freedom. Usually, for a sample of n, we have n-1 degrees of freedom. So, for a sample of 20 observations, the degrees of freedom are 19.\ https://www.youtube.com/watch?v=32CuxWdOlow

Backpropagation

Before we get to the actual adjustments, think of what would be needed at each neuron in order to make a meaningful change to a given weight. Since we are talking about the difference between actual and predicted values, the error would be a useful measure here, and so each neuron will require that their respective error be sent backward through the network to them in order to facilitate the update process; hence, backpropagation of error. Updates to the neuron weights will be reflective of the magnitude of error propagated backward after a forward pass has been completed.

Setting Hyperparamenters/How data should be split

Data must be split in 3 parts. TRAIN, TEST, VALIDATION. If you split data int two parts test and train. You ran into a risk of tuning hyperparameters in a way that works just for the test data. That is why data must be split into three parts. By tuning the algorithm on the validation set. And obtaining the final score on the test set, we can get a true measure of how the algorithm performs on never seen data. Alternatively, a dataset can be divided into two parts, and cross-validation can be used for tuning hyperparameters. And final score can be obtained on from the test data that was never used.

Nearest Neighbor Classifier

During the training, the algorithm simply memorizes all of the data (copies a pointer). During the prediction the algorithm returns a label that most closely matches the training image. (finds the most similar image in the training data). By specifying the distance metric the algorithm can be applied to different things. for example pictures difference in pixels. A text-specific function that compares one paragraph to another

Assumptions of Logistic Regression

However some other assumptions still apply. Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal. Reducing an ordinal or even metric variable to dichotomous level loses a lot of information, which makes this test inferior compared to ordinal logistic regression in these cases. Secondly, since logistic regression assumes that P(Y=1) is the probability of the event occurring, it is necessary that the dependent variable is coded accordingly. That is, for a binary regression, the factor level 1 of the dependent variable should represent the desired outcome. Thirdly, the model should be fitted correctly. Neither over fitting nor under fitting should occur. That is only the meaningful variables should be included, but also all meaningful variables should be included. A good approach to ensure this is to use a stepwise method to estimate the logistic regression. Fourthly, the error terms need to be independent. Logistic regression requires each observation to be independent. That is that the data-points should not be from any dependent samples design, e.g., before-after measurements, or matched pairings. Also the model should have little or no multicollinearity. That is that the independent variables should be independent from each other. However, there is the option to include interaction effects of categorical variables in the analysis and the model. If multicollinearity is present centering the variables might resolve the issue, i.e. deducting the mean of each variable. If this does not lower the multicollinearity, a factor analysis with orthogonally rotated factors should be done before the logistic regression is estimated. Fifthly, logistic regression assumes linearity of independent variables and log odds. Whilst it does not require the dependent and independent variables to be related linearly, it requires that the 1 / 2 Statistics Solutions Advancement Through Clarity http://www.statisticssolutions.com independent variables are linearly related to the log odds. Otherwise the test underestimates the strength of the relationship and rejects the relationship too easily, that is being not significant (not rejecting the null hypothesis) where it should be significant. A solution to this problem is the categorization of the independent variables. That is transforming metric variables to ordinal level and then including them in the model. Another approach would be to use discriminant analysis, if the assumptions of homoscedasticity, multivariate normality, and absence of multicollinearity are met. Lastly, it requires quite large sample sizes. Because maximum likelihood estimates are less powerful than ordinary least squares (e.g., simple linear regression, multiple linear regression); whilst OLS needs 5 cases per independent variable in the analysis, ML needs at least 10 cases per independent variable, some statisticians recommend at least 30 cases for each parameter to be estimated.

Underfitting

In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.

Distance Metric to compare images

L1 (Manhattan) distance - the sum of the pixel-wise absolute value difference - basically subtract training image pixel from test image pixel and take the absolute value, then add all of them together. L2 (Euclidean) distance - the square root of the sum of the difference between the two inputs squared different distance metrics make different assumptions about the underlying topology that one would expect in space. L1 forms a square around the origin, while L2 forms a circle. This is important as L1 depends on the coordinate frame because of this, and L2 would not be affected https://www.youtube.com/watch?v=hl3bQySs8sM https://www.youtube.com/watch?v=7a1lj4RBfvU

Rahul's two favorite foods are bagels and pizza. Let A represent the event that he eats a bagel for breakfast and B represent the event that he eats pizza for lunch. On a randomly selected day, the probability that Rahul will eat a bagel for breakfast, P(A), is 0.6, the probability that he will eat pizza for lunch, P(B), is equal to 0.5, and conditional probability that he eats a beagle for breakfast, given that he eats pizza for lunch, P(A|B), is equal to 0.7 Based on this information, what is P(B|A), the conditional probability that Rahul eats pizza for lunch, given that he eats a bagel for breakfast, round to the nearest hundredth.

P(A) = 0.6 dependent P(A|B) 0.7 P(B) = 0.5 P(B|A) = ? INDEPENDENT EVENTS= outcome of one event does not affect the outcome of the other. If A and B are independent the probability of both occurring is P(A and B) = P(A) X P(B) DEPENDENT EVENTS = outcome of one event affects the outcome of the other. If A and B are dependent events the the probability of both occurring is P(A and B) = P(A) X P(B|A) Back to the problem. Since the events are dependent P(A and B) = P(A|B) X P(B) = P(B|A) X P(A) 0.35 = 0.7 x 0.5 = P(A|B) x 0.6

BAYES RULE Two events; headache, flu p(H) = 1/10 p(F) =1/40 p(H|F) = 1/2 You wake up with a headache - what is the chance that you have the flue?

P(H&F) = P(F) X P(H|F) P(F|H) = P(H&F)/P(H) =1/8

Prediction vs Inferance

Prediction = the goal is to predict the dependent variable (y) Inference = we are interested in understanding the way that Y is affected as X1,.....,Xp change. Inference: Given a set of data you want to infer how the output is generated as a function of the data. Prediction: Given a new measurement, you want to use an existing data set to build a model that reliably chooses the correct identifier from a set of outcomes. Inference: You want to find out what the effect of Age, Passenger Class and, Gender has on surviving the Titanic Disaster. You can put up a logistic regression and infer the effect each passenger characteristic has on survival rates. Prediction: Given some information on a Titanic passenger, you want to choose from the set {lives,dies} and be correct as often as possible. (See bias-variance tradeoff for prediction in case you wonder how to be correct as often as possible.)

Segmentation

Segmentation refers to the act of segmenting data according to your company's needs in order to refine your analyses based on a defined context, using a tool for cross-calculating analyses. The purpose of segmentation is to better understand your visitors, and to obtain actionable data in order to improve your website or mobile app. In concrete terms, a segment enables you to filter your analyses based on certain elements (single or combined). Segmentation can be done on elements related to a visit, as well as on elements related to multiple visits during a studied period. In the latter case, we refer to this segmentation as "visitor segmentation".

Statistical Power

Statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down.

The Pearson Coefficient (ρ),

The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit). The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as the value of one variable increases, the value of the other variable decreases.

Variance

The Variance is defined as: The average of the squared differences from the Mean. To calculate the variance follow these steps: -Work out the Mean (the simple average of the numbers) -Then for each number: subtract the Mean and square the result (the squared difference). -Then work out the average of those squared differences. (Why Square?)

score card building

The behaviour scorecards are used by almost all the banks to predict the probability of default of a customer and the key decisions are made based on the behaviour scorecard. Most of risk analytics projects are around the development and validation of behaviour scorecards. An advance level of the same concept can be applied to develop the rating scale, similar to the one used by S&P or Moody's where AAA rating indicates low risk of default (hence better rating) compared to BBB rating.

P-value

The p-value is actually the probability of getting a sample like ours, or more extreme than ours IF the null hypothesis is true. So, we assume the null hypothesis is true and then determine how "strange" our sample really is. If it is not that strange (a large p-value) then we don't change our mind about the null hypothesis. As the p-value gets smaller, we start wondering if the null really is true and well maybe we should change our minds (and reject the null hypothesis). A little more detail: A small p-value indicates that by pure luck alone, it would be unlikely to get a sample like the one we have if the null hypothesis is true. If this is small enough we start thinking that maybe we aren't super lucky and instead our assumption about the null being true is wrong. Thats why we reject with a small p-value. A large p-value indicates that it would be pretty normal to get a sample like ours if the null hypothesis is true. So you can see, there is no reason here to change our minds like we did with a small p-value.

Empirical Rule

The rules gives the approximate % of observations w/in 1 standard deviation (68%), 2 standard deviations (95%) and 3 standard deviations (99.7%) of the mean when the histogram is well approx. by a normal curve

Standard Deviation

The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. It is calculated as the square root of variance by determining the variation between each data point relative to the mean. If the data points are further from the mean, there is higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation.

Chi-Square Test of Homogeneity

The test is applied to a single categorical variable from two or more different populations. It is used to determine whether frequency counts are distributed identically across different populations. For example, in a survey of TV viewing preferences, we might ask respondents to identify their favorite program. We might ask the same question of two different populations, such as males and females. We could use a chi-square test for homogeneity to determine whether male viewing preferences differed significantly from female viewing preferences.

Reducible vs Irreducible Error

To approximate reality, learning algorithm use mathematical or statistical models whose "error" can be split into two main components: reducible and irreducible error. Irreducible error or inherent uncertainty is associated with a natural variability in a system. On the other hand, reducible error, as the name suggests, can be and should be minimized further to maximize accuracy. Reducible error can be further decomposed into "error due to squared bias" and "error due to variance." The data scientist's goal is to simultaneously reduce bias and variance as much as possible in order to obtain as accurate model as is feasible. However, there is a tradeoff to be made when selecting models of different flexibility or complexity and in selecting appropriate training sets to minimize these sources of error!

Regression Versus Classification Problems

We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems.

Bias and variance using bulls-eye diagram

What is bias? Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. What is variance? Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn't seen before. As a result, such models perform very well on training data but has high error rates on test data. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it's going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can't be more complex and less complex at the same time. Total Error To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.

Covariance vs. Correlation

When comparing data samples from different populations, two of the most popular measures of association are covariance and correlation. Covariance and correlation show that variables can have a positive relationship, a negative relationship, or no relationship at all. Sample covariance measures the strength and the direction of the relationship between the elements of two samples, and the sample correlation is derived from the covariance. Basically correlation is normalized covatiance to be withing -1 and 1

t test vs z test

Z-tests are statistical calculations that can be used to compare population means to a sample's. The z-score tells you how far, in standard deviations, a data point is from the mean or average of a data set. A z-test compares a sample to a defined population and is typically used for dealing with problems relating to large samples (n > 30). Z-tests can also be helpful when we want to test a hypothesis. Generally, they are most useful when the standard deviation is known. Like z-tests, t-tests are calculations used to test a hypothesis, but they are most useful when we need to determine if there is a statistically significant difference between two independent sample groups. In other words, a t-test asks whether a difference between the means of two groups is unlikely to have occurred because of random chance. Usually, t-tests are most appropriate when dealing with problems with a limited sample size (n < 30). Z-tests are utilized when both groups you are comparing have a sample size of at least 30, while t-tests are used when one or both of the groups have fewer than 30 members. For example, we may use a two-sample z-test to determine if systolic blood pressure differs between men and women. Or, in a clinical trials setting, we may use a two-sample t-test to determine if viral load differs among people who are on the active treatment compared to the placebo or control treatment. You simply always use the t-test if you don't know the population standard deviation a-priori. You don't have to worry about when to switch to the z-test, because the t-distribution 'switches' for you. More specifically, the t-distribution converges to the normal, thus it is the correct distribution to use at every N

chi-square goodness of fit test

a statistical test to determine whether some observed pattern of frequencies corresponds to an expected pattern This lesson explains how to conduct a chi-square goodness of fit test. The test is applied when you have one categorical variable from a single population. It is used to determine whether sample data are consistent with a hypothesized distribution. For example, suppose a company printed baseball cards. It claimed that 30% of its cards were rookies; 60% were veterans but not All-Stars; and 10% were veteran All-Stars. We could gather a random sample of baseball cards and use a chi-square goodness of fit test to see whether our sample distribution differed significantly from the distribution claimed by the company.

The Bayes Classifier

assigns each observation to the most likely class given its predictor values.

Logistic Regression

ln[P/(1-P)] = α+ βX + ε Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Rather than choosing parameters that minimize the sum of squared errors (like in ordinary regression), estimation in logistic regression chooses parameters that maximize the likelihood of observing the sample values. Data is fit into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable.

When to use a t-test over z-test

n<30 standard deviation unknown


Kaugnay na mga set ng pag-aaral

Chapter 3. Hypertext Transfer Protocol

View Set

Community Quiz 3 Practice Questions

View Set

ch 2 review exam part 2, Spanish 2, Lesson 11 Test

View Set

Biology - Diffusion&Osmosis_ch 5, How Osmosis Works Video and Quiz, CH 5.4-10, Bio210 Exam 2 (Ch5), BIOL Chapter 5 Review, Chapter 5 Membrane Questions, BIO EXAM 2, BIO201 Connect 4, Biology chapter 5

View Set

Simulation Lab 3.2: Module 03 DNS Cache

View Set

Ch. 30 - Pediatric Interventions

View Set

Applying the Butler Model to Venice

View Set

The Management Process - Exam #2

View Set