CS 396 Midterm

Ace your homework & exams now with Quizwiz!

KNN Distance Metrics

- Euclidean distance (straight line distance) - Cosine Distance (good for documents, images) uses a vector to measure similarity. Less sensitive to magnitude of each dimension. - Jaccard Distance is used for set data - Hamming distance is used for string data as is edit distance - Manhattan distance is used for coordinate wise distance

Dirty Data Problems

- Naming conventions - Missing fields - Redundant records - Formatting issues - Different representations of the same idea or concept - Primary key violations

Accuracy, Recall and Precision

Accuracy is the total amount that was correctly labeled by the total labels. tp + tn / (tp+tn+fp+fn) Precision is of the times it was marked as true or false, what proportion was actually correct. More important when only a subset of positive data is needed anyway. tp / (tp + fp) Recall is how well it was able to recall whether it was true given all of the trues. Is more important when target event happens rarely. tp / (tp + fn)

Methods to Collect Digital Data

APIs (secondary data), bulk downloads (secondary data), scraping or web crawling (primary data), custom apps (primary data) where you can record information.

SVM

A classifier that maximize the margin between training data and the classification boundary. Maximize the margin. Maximizes the chance that classification will be correct on new data. Guaranteed to converge to the optimal boundary in most cases. You can add kernels and transformations to input data. SVMs work well with even unstructured data. Always try SVM first because you can use kernel functions to solve complex problems. SVMs however take a long time to train and it is not easy to find a good kernel and is not robust to noise.

JSON

A data interchange language that is consistent and easy to read by both computers and humans. You have JSON arrays and objects. Objects are pretty much unordered sets of name/value pairs.

Contour Plot

A graphical technique for representing a 3- dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format.

Schema

A schema is a description of a particular collection of data, using a given data model.

Bayes' Theorem

A theorem that enables the use of sample information to revise prior probabilities. Pretty much says that the probability of A happening given B is equal to the probability of A happening * probability of B happening given A divided by the probability of B.

Gradient Boosting Tree

Along with Logistic Regression, it can solve 90% of the problems encountered in reality. The standard recipe for winning ML competitions. A GBT model consists of many trees, each trained on the residual of the previous tree. Each tree explain the error from previous trees. Weighted sum of the outcome from all trees.

K-fold Cross-Validation

Break the data into k equal-sized subsets. For each i in 1,...,k do: - Train a model on all the other folds - Test the model on current subset Compute the average performance of the k runs

P.D.F to C.D.F

CDF is pretty much the integral of the PDF. The PDF is just a probability density function something like a normal distribution.

Cautions with Data Collection

Caution 1 is to comply with all policies. This includes the General Data Protection Regulation (GDPR) which you pretty much just says that you have to tell people when you are collecting data, how, for how long and what purpose, and how long you will retain the data form. We also have the Health Insurance Portability and Accountability ACT. Caution 2 is to be unbiased with data collection. In almost all cases, we are performing a sampling in data collection. We should be performing uniform sampling, but we need to be aware of the dataset we are sampling from.

Clustering

Clustering has many goals. Segment a large set of cases into small subsets that can be treated similarly - segmentation. Generate a more compact description of a dataset - compression. Model an underlying process that generates the data as a mixture of different, localized processes - Underlying process. Segmentation is like image segmentation (separating things in an image) Compression is like handwritten digit recognition Underlying process is like accents of people

What is/are Data

Collection of information that can be stored. Digitally, on paper, or really in any form. Obviously some things cannot be stored digitally like human memory or pi. There are also different types of data. Numerical, text, multimedia, combination of all.

Stacking

Combine model outputs using a second-stage learner like linear regression

Boxplots

Convey location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. Are good for outlier detection. Boxplot can help answer questions such as how do groups vary, how does variation vary between groups, are there outliers.

Pearson correlation coefficient

Correlation is a statistical technique used to determine the degree to which two variables are related. Best way to first see is by using a scatter plot. We can use the Pearson correlation coefficient, r, to calculate correlation. Unfortunately this is not suitable for non linear correlations or outliers.

Spearman rank correlation coefficient

Counts the number of disordered pairs, not how well the data fits a line. Thus better for non-linear relationships and outliers. Let X=[x1,...,xn] and Y=[y1,...,yn] be two lists. Let rank(x) be the rank element x. rank(x) = k if x is the k-th smallest value in X. where di = rank(xi) - rank(yi). Both the pearson and spearman correlation can only check monotonic correlation.

Multivariate Non-graphical EDA

Cross tabulation, covariance, correlation coefficient. Cross tabulation is just pretty much a contingency table.

Data Collection

Data collection is vital because improperly collected data can lead to bad conclusions, poor models, or the inability to answer the problem you set out to solve. There are several different types of data. Traditional data based on senses, modern social types of data (what people what to hear, see, sense, etc.) opinion data, and data about data.

Data Compatibility

Data needs to be standardized so that we can make apple to apple comparisons. This is particularly important when collecting data from multiple sources. Things like unit conversions, number representations, time/name unification, etc. Vigilance in data integration is essential. Name unification is how we can standardize naming conventions. Overly simplified transformation introduces collisions between people for example. Overly complicated transformations introduce inconsistency in data. Time unification is pretty much standardizing the formatting of date related data. Using UTC or datetime objects. Financial unification deals with stuff like currency conversions. Correction for inflation etc.

What is Data Science

Data science is the combination of computer science, statistics, and domain science or knowledge. Data

Common Paradigm in Data Science

First we have to find the best data sources for our problem. Once we have the raw data, we have to clean it in such a way that we remove invalid or potentially uninformative data. Once we clean the data, we have a dataset which we go ahead and attempt to apply a model to. This leads us to the last step which is visualization. Visualization is used though out the entire data science cycle.

Unsupervised Machine Learning

Given a set of data X, we want to learn the underlying structure of the data. The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data. Another example of unsupervised machine learning is principal component analysis (PCA). For example, applying PCA on a data set containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

Histograms

Histograms typically show - Center of data - Spread of data - Skew of data - Outliers - Presence of multiple modes in the data Histograms can help answer questions that relate to the things above.

K-means clustering

Informally, goal is to find groups of points that are close to each other but far from points in other groups. The standard k-means algorithm is based on Euclidean distance. A simple greedy algorithm locally optimizes this measure. Steps in K means clustering 1. assign each item a cluster based on distance 2. recompute the center based on the mean of all items 3. repeat 1 and 2 until there is no improvement It's a greedy algorithm with random setup - solution isn't optimal and can vary significantly with different initial points. Performance is O(nk) per iteration, not bad and can be heuristically improved. n = total features in the dataset, k = number clusters. u Clustering is used more than it should be, because people assume an underlying domain has discrete classes in it. In reality the underlying data is usually continuous.

Complementary cumulative distribution function

Is superior to a CDF in many ways but is no longer a simple representation of the distribution of the data. We plot P(x) instead of p(x) where P(x) = integral(x, inf) of p(x')dx'. CCDF also follows the power law with exponent a - 1. If we plot P(x) on log scales, we will get a straight line. Cumulative distributions with a power law form are sometimes said to follow zipfs law or a Pareto distribution.

QQ plot

Quantile quantile plot. It is a graphical technique for determining if two data sets come from the same distribution. It can help answer questions such as do these two datasets come from the common distribution. Do two data sets have similar tail behavior, does one dataset follow a certain distribution.

Regularization

Regularization is a technique used in an attempt to solve the overfitting problem in statistical models. There is L1 regularization the sum of the absolute values of model coefficients should be small. L2 regularization: the sum of squares of model coefficients should be small. Learning rates.

Variation

Not all things we measure are peaked around a typical value. Some vary over many orders of magnitude (size of cities vs size of humans).

Null and Alternative Hypothesis

Null hypothesis H0 is a claim of no difference in the population. Alternative hypothesis Ha claims that H0 is false. Collect data and seek evidence against H0 as a way of supporting Ha. Alternative hypothesis can be one sided or two sided test.

Treating Missing Values

Remove the rows or columns, substitute with specific values, forward fill or back fill, or impute to deal with missing values. Impute is the most common way to fill in missing values: mean value, random value, or interpolation imputation. Random value imputation permits statistical evaluation of the impact of imputation. Imputation by interpolation is the use of linear regressions to predict missing values.

Scientific Plotting

Rule of thumb for plotting is: include explanation of axis, include units, distribute the points evenly, and use color to distinguish different groups.

KNN

Pretty much find the k most similar data points and take the labels and return the most frequent label. KNNs require no training because the data is the model. Accuracy improves as the number of samples increase. Matching is simple - just find the k most frequently occurring items. The only real configurations that KNNs have is the number of neighbors you want, the distance metric used, and how to weigh the labels of the neighbors.

Primary Data

Primary data is information collected for the specific purpose at hand. Primary data is collected from first hand experiences through things such as observation, interviewing, measurements, and case studies. Pros: Great control, accurate data, reliable data, authentic Cons: Resources consuming, doesn't scale well

Secondary Data

Secondary Data is existing information that was previously gathered for a purpose other than the study at hand. Can be collected from research, online records, APIs, etc. Pros: Cheap, quick, wider geographical area is covered. Cons: Not collected for specific research needs, possible not up to date.

Identifying Power Law Behavior

Standard strategy is to make a simple histogram with the data plotted on a log-log scale and seeing if it looks like a straight line. This can lead to a scatter plot that has a lot of noise. The reason why there is noise is because of the fact that each bin only has a few samples in it and there can be a large fluctuation or because of sampling errors.

Star Plot

Star plots are used to examine the relative values for a single data point and to locate similar points or dissimilar points. Can help answer question such as what variables are dominant for a given observation, are there outliers, which observations are most familiar

First Part of Data Science Pipeline

The first portion of the Data Science pipeline is finding the right data sources which is data collection and then taking the raw data and converting it to a dataset using data management. Data Collection is finding data sources whether it be primary or secondary sources. Data Management is the process of taking the raw data and turning it into a useful dataset using pandas, json, SQL, etc.

P value

The p value is the likelihood of the observed test statistic, given the null hypothesis is true. Anything greater than .1 is non significant .05-.1 is marginally significant .01-.05 is significant anything less than .01 is highly significant A non significant p value is not conclusive. Finding a non significant p value is not a validation of the null hypothesis.

Pandas

There are two essential concepts: series and dataframes. Series is a named python list (one entry dict with list as value). A dataframe is a collection of series.

Data Management

There is a structure spectrum that data fits in. There is structured data (schema first) which comes from relational databases, next is semi structured (schema later) which is like tagged text, json, or XML files, last is unstructured data (schema never) which is just plain text and such. Structured data refers to data already stored in databases, in some ordered manner. Accounts for 20% of data. Use things like SQL to grab data. Semi structured data does not have formal data models associated with relational databases or other forms of data tables. Contains tags or markers to separate semantic elements and enforce hierarchies of records and fields within the data. Known as self describing structure. Unstructured data is any data with an unknown form or structure. Poses multiple challenges in terms of processing for deriving value out of it.

Data Sources, Management and Cleaning Data

There is primary and secondary data collection. Primary is pretty much the collection through things such as surveys, polls, interviews. Secondary are collected through the internet, the government, financial reports, etc. Data comes in all forms so we need to clean it. Data might have typos, invalid data or missing data, inconsistent data, OCR errors, etc. Data comes in all forms. Small data is typically in some flat files or such. Medium size datasets are typically build using SQL. Large scale datasets are typically in some distributed file systems (think capital one). Different types of data require different processes.

Outliers and Errors

These can bias model training. Discover these by using visualizations and summary statistics. Scatter plots, box plots, etc. Outliers are not usually well defined. You have to do a sort of case by case analysis. If you are trying to study some rare disease, outliers are probably the data you want. To clean this you can drop, interpolate, or substitute.

Graphical EDA

This includes Histograms, box plots, QQ plots, scatter plots, contour plots, spectral plots, and star plots.

Data Cleaning and Transformation

This is typically the most time consuming part of the data science pipeline. Statistics view of data is that there is some process that produces data. This data is a sample of the output of that process. Results are probabilistic and there can be bias in the data. CS view of data is I got my hands on some data which has some invalid entires. We should improve the data's quality. A domain expert's view of data is something like this answer doesn't look right, this data doesn't look right, what exactly happened? Data Scientists combine all of these together.

Logarithmic Binning

This means that you vary the width of the bins in a histogram. Bin sizes increase exponentially. To reduce noise in a log log scale graph, you can normalize the sample counts according to bin sizes. This way, bins in the tail get more samples which reduces statistical errors and makes the plot much clearer. As the ratio of the widths of successive bins, a, increases, the number of samples in each bin decrease as k increases. Most power law distributions in nature have 2 ≤ a ≤ 3. So noisy tails can happen.

Boosting

Train models on the updated output of another model iteratively. Boosting is sequential.

Bagging

Train weak models in parallel on different samples of the data, then combine by voting or averaging.

Univariate non graphical EDA

Univariate non graphical EDA is pretty much making preliminary assessments about the population distribution using the sample. Here you might use something like mean, mode, median, IQR, range, variance. Additionally you might have some measure for the skewness (compare median and mean), Pearson's coefficient of skewness, etc. Last but not least you can incorporate hypothesis testing. For example you might do something like the Hypothesis Testing of Median where you use chi square test assuming the median is some value, then you find counts of values greater than or less than m. Expected count should be s / 2 for each side where s is the size of sample. Then you have a contingency table of 1x2 with df = 1. You can apply this reasoning to IQR as well.

Ensemble Methods

Use multiple algorithms to obtain better predictive performance than could be obtained from any of the algorithms by itself.

Two-sample Test

We have two population samples and we want to study the relationships between the two populations. We are going to assume that the two samples have the same population variance. We want to study if they have the same mean. The null hypothesis here is two sided. We use the t distribution and degrees of freedom to calculate the statistic.

Exploratory Data Analysis

Visualizing data seeking patterns, contrasted with Statistical Analysis in which mathematics is used to determine the likelihood a pattern exists by chance. EDA's goal is maximize the insight into a data set and into the underlying structure: obtain things such as - list of outliers - important features or factors - uncertainties for the conclusion EDA is highly iterative in the data cleaning and management process or the data modeling process.

Supervised Machine Learning

We are given some features and labels and we want to learn the function that converts features to the labels. This is either a classification or regression problem.

Hypothesis Testing of Mean with Known Variance

We are given the population standard deviation or variance. We use the Z statistic to calculate the test statistic.

Hypothesis testing of independence using Chi Square

We are pretty much testing the independence of variables (gender and voting independent). The null hypothesis assumes that they are independent and the alternative assumes they are dependent. The Chi square statistic compares observed and expected counts.

Decision Tree

We can find a small tree in a greedy manner by recursively choosing the best split feature at each node. At each node, we choose the split that results in the "purest cut": Quantify purity: entropy reduction, information gain If we hit the depth limit, we output the most popular class at that node.

Statistical Tests

We have a few tests - KS Test - Chi square for independence testing - Chi Square for goodness of fit - Hypothesis testing with known variance - Hypothesis testing with unknown variance -- Two Sample testing -- One Sample If we know the population variance, we know we should be using the z distribution, else we use the t distribution. For one sample, we have some population mean and we don't have the population variance so we use the samples variance. For two sample, we assume equal or unequal population variance and test whether they are equal, here we incorporate the degrees of freedom as well. In addition, in two sample we have two samples duh. Hypothesis test of independence is straightforward, just use chi square to compare expected and actual counts and incorporate the degrees of freedom (# row - 1) x (# col - 1). Finally for goodness of fit, we have two different types of data: continuous and categorical. For Continuous its best to us the KS Test to see if they are on the same distribution. For categorical, its best to use the Chi square test. When testing, you have to make sure that all expected counts are ≥1 and that at least 80% of expected counts ≥ 5. The degrees of freedom here are equal to the number of proportions - 1. NOTE: the data must be mutually exclusive.

Goodness of Fit Testing using Chi Square test

We have a set of sample and want to know if they follow a certain distribution. For continuous distributions, we us the KS Test. For categorical distributions use Chi Square Test. The null hypothesis is that the sample follows the distribution. When testing the expected counts, we have some probability assigned to something and we have n samples. Therefore the expected count of i = n piH0 where pi is the probability of getting i under our null hypothesis. The chi-square statistic for goodness of fit with k proportions measures how much observed counts differ from expected counts. It follows the chi-square distribution with k − 1 degrees of freedom and has the formula. The chi-square test for goodness of fit is used when we have a sample from a population and the variable is categorical with k mutually exclusive levels. We can safely use the chi-square test when - all expected counts have values ≥ 1.0 - More than 80% of the k expected counts have values ≥ 5.0

Hypothesis Testing of Mean with Unknown Variance

We use the t distribution for this. We still have the same structure as z tests. There are now one sample and two sample tests.

Calculating the exponent of power law distribution

We want to estimate a from observed data. The most commonly used method is to fit the slope of the line in one of the plots. This however, is known to introduce systematic biases into the value of the exponent. Instead we will use something based on maximum likelihood estimation where we assume some xmin which is the minimum value of x for which the power law behavior holds and use this in our estimate. Extracting a value for a from real-world distributions can be tricky. It requires us to make a judgement, about the value xmin above which the distribution follows the power law

Linear Regression

We want to find the best linear function to explain the data. We pretty much find the line that minimizes the squared distances between points and the line.

Kolmogorov-Smirnov Test

We want to know if two sets of samples are drawn from the same distribution. Given two sets of samples, we can use the KS test to compute a significant level indicating how likely the two sets are drawn from the same distribution. Using simple visualizations or mean and variance is subjective and not a good measurement. Steps in KS Test: 1. Compute the CDF of each dataset. 2. Compute the maximum distance between the two CDFs. 3. The test fails at a significance level of alpha if the max distance is greater than C(alpha) * ((n + m)/mn) ** .5 where C(alpha) = (-.5ln(alpha))**.5 The null hypothesis is always that they are from the same distribution. You have raw data, covert it to a PDF for each side by side and then into a CDF. Next you find the maximum difference then calculate the statistic from above.

The Central Dogma of Statistics

You have some population which you sample using probability. One this sample we discover some descriptive statistics, which we can then use to apply inferential statistics on the population.

Hypothesis Testing

a decision-making process for evaluating claims about a population. There are a few different types: mean with known/unknown variance, test of independence, and goodness of fit testing. Also called significance test. You test a claim about a parameter using evidence. The procedure consists of four steps. 1. null and alt hypos 2. test statistic 3. p value and interpretation 4. significance level

One Sample Test

a hypothesis test comparing a sample mean to a given population mean.

Logistic Regression

a nonlinear regression model that relates a set of explanatory variables to a dichotomous dependent variable. Pretty much, you map a regression onto some classification function or logistic function. Example is studying for a test and passing. Logistic regression is designed as a binary classifier (output say {0,1}) but actually outputs the probability that the input instance is in the "1" class. Logistic regression is the most widely used general- purpose classifier. It's very scalable and can be very fast to train. Used for things like spam filtering, news message classification, web site classification, Most classification problems with large, spare feature sets. Biggest issue is that it can overfit on very spare data, so it is often used with regularization.

F measure

combines precision and recall. Use F score when labels are highly imbalanced.

Power Law Distribution

frequency of an occurrence varies as a power of some attribute (eg size) of that event. We take the log log scale of something. If there is a visually striking linear relationship, we can take the exponential and find the probability distribution function.

Log-log scale

ln p(x) = -a*ln(x) + c where a and c are constants. If we take the exponential of both sides we get p(x) = C x ** -a for some constant C, alpha > 0. A distribution of this form is a power law distribution.

Spectral Plot

u A graphical technique for examining cyclic structure in the frequency domain, based on Fourier transform.


Related study sets

Retail Marketing Chapter 2: Types of Retailers

View Set

LPI Linux Essentials 010 V1.6 - Chapter 14 Quiz

View Set

Intro to Culinary Kitchen Safety Quiz

View Set

Converting Decimals, Fractions, and Percents

View Set