Data100 Final Review

Ace your homework & exams now with Quizwiz!

A random sample with replacement is one drawn _________________ with replacement. A SRS is a sample drawn uniformly at random _________________ replacement where every individual has the _________________ chance of being selected, every pair the same as every other pair, etc. With large populations sampling with or without replacement are pretty much the _________________, but probabilities with replacement are _________________ to compute

uniformly, without, same, same, easier

The entropy S of a node is - sum of pc log pc, and it is a measure of how _________________ a node is, with low entropy being _________________ predictable. We can use weighted entropy as a loss function to decide which _________________ to take; suppose a given split results in two nodes with many samples. A fully grown decision tree runs the risk of _________________, and we can't use the _________________ term idea since there is no global function being minimized and is built node by node in a _________________ fashion. We can address more _________________ to prevent growth or perform _________________ by cutting off less useful branches if we set aside a _________________ set and replace node by its most common prediction has no impact on validation error, don't split.

unpredictable, more, split, overfitting, regularization, greedy, rules, pruning, validation

Slope is measured in units of y per unit of _________________. The models we create show _________________ not causation, and a snapshot of several ppl @ one instance in time (_________________) not snapshots of the same ppl over time (_________________). If the new data we test our model on looks nothing like the data we _________________ our model on, there's no guarantee it will be any good. Thus, we need to _________________ our data before we quantify it!

x, association, cross-sectional, longitudinal, fit, visualize

We prefer _________________ data for analysis since they are easy to _________________, coming in the form of _________________ (DF which have named columns and different types and are manipulated using _________________). and _________________ which are numeric data of the same type and manipulated using _________________.

rectangular, manipulate, tables, languages, matrices, linear algebra

Sort_values creates a copy of the DF sorted by a specific _________________. It can also use on a _________________. Value_counts creates a new _________________ showing counts of every value. Unique returns all unique values as an _________________.

column, series, series, array

Data Cleaning: process of transforming raw _________________ to facilitate subsequent analysis and often addresses issues such as _________________ or formatting, _________________ or _________________ values, unit _________________ and encoding text as numbers. EDA is the process of _________________, _________________, and _________________ data to build/confirm understanding of the data, identify and address potential _________________, inform the subsequent _________________ and discover potential _________________. It is an open-ended analysis. We are looking for 1) _________________ the shape of the data file 2) _________________ how coarse/fine e/ stratum is 3) _________________ how (in) complete is the data 4) _________________ situation in time 5) _________________, how well does the data capture reality

data, structure, missing, corrupted, conversion, visualizing, transforming, summarizing, issues, analysis, hypotheses, structure, granularity, scope, temporality, faithfulness

Contour plots are the 2D versions of _________________, by default shows a _________________ distribution on the two axes and these are the histograms/density curves of e/ variable independently. Visualization requires a lot of thought; single quantitative variables can use _________________ plot, _________________, and _________________ plots, two quantitative: _________________ plot, _________________ plot and _________________ plot and combinations use _________________ plot, overlaid _________________ or _________________ plots and SBS _________________/_________________ plots.

density curves, marginal, rug, histogram, density, scatter, hex, contour, bar, histogram, density, box, violin,

Data wrangling includes _________________ rows, _________________ columns, performing _________________, creating _________________ tables, applying _________________ methods such as _________________ and _________________ tables together. A model is a useful _________________ of reality which ignores certain things/makes assumptions. They can be useful to _________________ the world we live in and predict the value of _________________ data. There are two classes: _________________ which are based on well-estabished theories of how the world works and _________________ based upon observation and data

filtering, selecting, aggregation, pivot, string, regex, joining, simplification, understand, unseen, physical, statistical

PCA in appropriate for EDA when visually identifying clusters of similar observations in _________________ dimensions, we are still _________________ the data (don't know what to predict), we have reason to believe data is _________________ rank (few determine the linear association). SVD describes matrix decomposition X = U SIGMA VT; if X has rank r, there are r _________________ values on the diagonal of SIGMA. The values in SIGMA are called _________________ from greatest to least, the cols of U are the _________________ SV and the cols of V are the _________________ SV. PCA is a specific SVD application where the largest n _________________ are kept and X is centered at the _________________ of eac column. All other SV are set to zero in the _________________ reduction. The first n rows of VT are the _________________ for the n PCs and the first n columns of U SIGMA or XV contain the n _________________ of X. Primarily utilizes U SIGMA. A PC direction is a linera combo of attributes given as _________________ of VT; plotting the values of PC1 can provide insight into how attributes are combined; high variance attributes are typically _________________ while low ones are not. They are _________________ to prior components. PCA is a technique to summarize data by finding directions that minimize _________________ error and maximize the captured _________________. We use SVD to conduct PCA to preserve the _________________ of the original data in 2D. _________________ plots tell us how much info lost in PCA

high, exploring, low, nonzero, singular values, left, right, singular values, mean, dimensionality, directions, PCs, rows, included, orthogonal, projection, variance, structure, scree

Pivot tables take the arguments _________________, _________________, _________________, and _________________. merge is the basic syntax for joining two _________________, and outputs another _________________.

index, columns, values, aggfunc, DF, DF

_________________ are easy to distinguish while _________________ are hard and so are _________________ or _________________ clouds. Also avoid jiggling the _________________ such as in stacked bar charts, histograms, area charts, etc. Overplotting solutions including adding small random noise (_________________) or making the points _________________. We can address context directly to the plot through an informative _________________ (takeaway), axis labels, reference _________________ markers and labels, _________________ if appropriate and _________________ that describe the data that should be _________________ and self-contained, describe what has been graphed, draw attention to important _________________ and describe any _________________ drawn.

lengths, angles, areas, word, baseline, jittering, smaller, title, lines, legends, captions, comprehensive, features, conclusions

A mode of a distribution is a local or global _________________. A distribution with a single clear maximum is called _________________. Distribution with two modes are called _________________, and those with 2+ are called _________________. We need to distinguish between modes and random _________________. If a distribution has a long right tail, we call it skewed _________________. For a quantitative variable, the first or lower quartile is at the _________________ mark, the second quartile at the _________________ (called the _________________) and the third/upper at the _________________ percentile. The internal contains the middle _________________ of data, and the interquartile range measured the _________________.

maximum, unimodal, bimodal, multimodal, noise, right, 25, 50, median, 75, 50, range

EDA/Data Cleaning Examine the data and the _________________ for organization/structure Examine each field or _________________ or dimension Examined pairs of _________________ dimensions _________________/summarize the data, validate _________________, identify and address _________________ and apply data _________________ and corrections

metadata, attribute, related, visualize, assumptions, issues, transformations

Big data is beyond traditional SW _________________ to quickly capture, curate, manage and process. It is largely _________________ but can be semi-structured or structured data. Data requiring _________________ computing tools to process. It is using through actionable analytics including _________________, what is happening now, _________________, what will happen, and _________________ what should I do about it

tools, unstructured, parallel, descriptive, predictive, prescriptive

We can extend the idea of the KDE to two dimensions, such as a _________________ plot which is a 2D KDE and we can create smoothed versions of _________________ plots. Transforming data can reveal _________________; when a distribution has a large dynamic range, it is useful to take the _________________. We can _________________ a scatterplot so we can interpret them better looking at slopes and intercepts. If we take the log of y-values, this implies there was an _________________ relationship in the original plot. If we take the log of both, this implies a _________________ relationship or a one-term _________________ in the original plot. Basic functional relationships such as _________________, _________________, and _________________ curves are important. We can also use the Bulge diagram to choose from multiple transformations to _________________ or _________________ the scale of an axis.

contour, scatter, patterns, log, linearize, exponential, power, polynomial, exponential, logarithmic, polynomial, increase, decrease

Map Reduce abstraction uses a _________________ map that allows for re-execution on failures, in which we can use random _________________ to mitigate sample issues. The reduce is also commutative and allows for _________________ of operations. Associative redue allows for _________________ of operations as well. The map function is applied to a _________________ part of a large file in parallel the output is _________________ for fast recovery on node failure

deterministic, seed, reorder, regrouping, local, cached

Permutations: how many ways can I arrange a sequence; if I have n objects, I want to select k of them in a way that order _________________, then I can do this in n! / (n-k)! ways. Combinations are a similar situation in which order does _________________ matter which we account for overcounting and use a formula of the term _________________.

matters, not, n choose kS

The variance of a Bernoulli distribution is _________________ and its expectation is _________________. That of a binomial dist is _________________ and _________________ respectively. The sample mean is an _________________ estimator of the population mean. The shape of a distribution is dictated by the _________________ law which says if your increase the sample size, the SD decreases by the square root of that factor. CLT starts that no matter what population, the prob dist of the sum of IID sample is roughly _________________.

p, p(1-p), sqrt(np(1p)), np, unbiased, square root, normal

1) Does my data contain unrealistic/incorrect values? 2) Does my data violate obvious dependencies? 3) Was the data entered by hand? 4) Are there obvious signs of data falsification? This may include _________________ or _________________ defaults _________________ data Time zone _________________ _________________ records or fields _________________ errors Units that are not _________________ or consistent

Faithfulness, missing, default, truncated, inconsistencies, duplicated, spelling, specified

1) What does each record represent? 2) Do all records capture granularity @ same level? 3) If the data are coarse how was it aggregated?

Granularity

1) Does the data cover my area of interest? 2) Is my data expansive? 3) Does my data cover the right time frame?

Scope

1) Are the data in standard format/encoding? 2) Are the data organize din records? 3) Are the data nested? 4) Does the data reference other data? 5) What are the fields in each record?

Structure

1) When was the data collected? 2) What is the meaning of the time/date fields? 3) Are there strange null values? 4) Is there periodicity?

Temporality

What are the metrics for success?The model predicts market values _________________ and _________________ following international standards. The system is _________________, accountable and transparent and eliminates regressivity to engender _________________ in the system among all stakeholders. Fairness is the ability of our pipelineto accurately assess all residential property values accounting for all _________________. Transparency is the ability of the DS department to share/explain pipeline results and _________________ to both internal/external stakeholders. Accuracy is a _________________ but not sufficient condition fo a fair system. Fairness and transparency are _________________ dependent. Learn to work with them and consider how your data analysis will _________________ them. Keep in mind the power and _________________ of data analysis.

accurately, uniformly, fair, trust, disparities, decisions, necessary, context, reshape, limits

Model risk is the MSE of prediction, the expectation is the _________________ over all samples (for fitting our model, new observations @ fixed x). We look at chance error due to _________________ alone in new observation and the new observations and _________________, which is non-random due to our _________________ being different than true underlying function g. Some reasons are due to _________________ error or _________________ info acting as noise, so we can get more precise measurements? Error could have come from our _________________, which results in _________________, small differences leading to large differences in the model fixed by reducing model _________________ or removing noise. Bias is the difference between our predicted value and the true g(x) averaged over all possible _________________, but it is _________________. If positive we tend to _________________ x, and if negative opposite. Reasons due to _________________ or general lack of domain knowledge, fixed by _________________ model complexity or consulting experts. Decomposition of Model risk = _________________ ^2 + _________________ ^2 + _________________. Trying to minimize all but _________________ variance is out of control, reducing complexity/variance can increase _________________ and the opposite can increase _________________. _________________ knowledge matters, right model structure!

average, randomness, sample, model, measurement, missing, sample, overfitting, complexity, samples, non-random, overestimate, underfitting, increasing, sigma, model bias, model variance, observation, bias, variance, domain

Scale needs to kept consistent on an _________________ and there need to be limits. We can create _________________ plots to show different regions of interest if possible. We can also use _________________ to aid comparison, where lines make it easy to see large effects and have two separate lines makes clear certain _________________ and highlights patterns. To examine distributions/relationships in subgroups, we can employ juxtaposition, placing multiple plots SBS with the same _________________ or superposition, placing multiple density curves, scatter plots on _________________ of each other. We can also use _________________ or _________________ to represent additional variables. Perceptually uniform colormaps have the property that the _________________ change is the same between intervals. The previous one _________________ was far from this, but the new _________________ is. Avoids combinations of _________________ and _________________ for colorblind people. We can use color to highlight data type, such as choose a qualitative scheme that makes it easy to _________________ between categories and for quantitative, choosing one that implies _________________. If the data progresses from low to high, use a _________________ scheme where lighter colors for extreme values. If the low/high values deserve equal emphasis, use a _________________ scheme where lighter colors = middle values.

axis, separate, conditioning, differences, scale, top, color, shape, perception, Jet, Viridis, red, green, distinguish, magnitude, sequential, diverging

Loss functions quantify how _________________ a prediction is for a single observation. If our prediction is close to the actual value, we want a _________________ loss, and we want a _________________ loss for predictions far from the actual value. We call _________________ the actual - predicted natural choice, but we want to treat negative/positive predictions the same. Squared AKA _________________ loss is (Yi - yi^hat) ^2 If our prediction is equal to our actual observation, loss is _________________ meaning good fit. We also have absolute or _________________ loss which is | y - y_hat |. We can replace y_hat with _________________ for constant functions.

bad, low, high, error, L2, 0, L1, theta

Probability Sampling Random samples can produced _________________ estimates of population characteristics (like estimating the _________________). Equivalent to _________________ samples, a probability sample is a type of sampling _________________. But with random samples we are able to estimate the _________________ and _________________ error to quantify the _________________. Must be able to provide the _________________ that any specified set of individuals will be in the sample. Individuals in the population do not need the _________________ chance of being selected, as long as you can measure errors (thus not all are good).

biased, maximum, random, technique, bias, chance, uncertainty, chance, same

Estimate an interval where we think the population parameter is based on the _________________ and _________________ of the estimator. Population param will be in our interval _________________ % of the time LT. The estimator ci for f is a function that takes a sample and returns an interval containing _________________ for p% of samples by approx the sampling distribution of f using sample s and choosing the middle p%. The population parameter is fixed and so is our _________________ (no randomness), sometimes called the _________________ bootstrap CI. The regression model has some underlying true _________________ between feature and response. Observed response = _________________ matrix * _________________ parameters + _________________ the latter two which are unobservable. We can take the LS estimation to find the _________________ model param. So we can bootstrap our training data, fit a _________________ model to each resample and look at the _________________ of the parameter estimates

center, variance, p, theta *, interval, percentile, relationship, design, true, errors, optimal, linear, distribution

Fault tolerant distributed file systems split the file into smaller parts at record _________________ ideally across machines and read parts of the file in _________________ (fast reads for large files), using cheap commodities that embrace _________________ events head-on. To interact with data (smaller datasets) we request a data _________________ and we have lots of tools to compute locally! For larger dataset/computational intensive tasks we do a _________________ and response for cluster/cloud compute

boundaries, parallel, failure, sample, query

A decision tree is a very simple way to _________________ data. It is a tree of questions that must be answered in sequence to yield a predicted classification. There is one _________________ decision point where there is 1+ possible right answer which is why we use the _________________ method of the DF class. These will always have _________________ accuracy on the training data except when there are samples from different _________________ with the same features. The tendency is _________________ with such accuracy. We can include even more _________________. Decision boundaries can be _________________ although many overlapping pts leads to only 93% accuracy on the training set. Traditional decision tree generation algorithm: all the data starts in the _________________ node and repeat until every node is either _________________ or _________________, where we pick the best faeture x and best split value and we split the data into two nodes. A node with only _________________ typ is pure and one that has _________________ data/cannot be split is unsplittable.

classify, terminal, query, perfect, categories, overfitting, features, erratic, root, pure, unsplittable, one, duplicate

GD will correctly push CE loss towards _________________ which is bad because it is _________________. Points are linearly separable if we can correctly separate the classes with a _________________ where the class label does not count as a dimension. We are looking for a degree 0 _________________ to separate them which is a single point. A set of d dimensional points is if we can draw a degree _________________ hyperplane separating the points perfectly. If so, some of our weights will diverge to pos or neg infinity because GD will keep rolling down the loss _________________. To avoid large weights, we use _________________ (we should standardize our features before applying as with linear regression).

negative inf, overconfident, hyperplane, d-1, surface, regularization

CSV, TSV, tabular data where records are delimited by _________________ and fields are delimited by _________________ and _________________, and we have an issue w/ commas/quoting. JSON is strict formatting quoting addressing previous issues and is widely used for _________________ data; but it is not _________________ and each record can have many _________________ and records can have tables. XML is also a type of _________________ data

newline, commas, tabs, nested, rectangular, fields, nested

Quantitative variables are _________________ and _________________ that have meaning. They can be separated into continuous variables used to measure arbitrary _________________ and discrete for a set of _________________ possible values. Qualitative variables are _________________, and can be separate into _________________ categories w/ levels w/ no consistent meaning to differ and _________________ categories w/ no specific ordering

ratios, intervals, interval, finite, categorical, ordinal, nominal

We can modify strings using the _________________ class. We can use the _________________ method to drop a row (axis = _________________) or a column (axis = _________________). We can also sort by a specific metric by using the Series._________________ method. The result of a groupby operation applied to a DF is a _________________ object that we can use various functions to generate a DF or series including _________________, wh/ creates a new DF w/ one aggregated row per subframe, _________________ which creates a new series with the size of each subframe, _________________ which creates a copy of the original DF but keeps only rows from subframes obeying the provided condition, which takes a _________________ as an input and returns either T or F. For each group g, f is applied to the _________________ comprised of rows from the original DF corresponding to that group. We can also group by _________________ columns

str, drop, 0, 1, map, DFGB, agg, size, filter, GF, subframe, multiple

Selection bias: systematically _________________ or _________________ particular groups, avoid by examining the SF and the _________________ of sampling. Response bias: _________________ responses, avoid by examining the nature of the _________________ and the _________________ of surveying Non-response bias: avoid by keeping surveys _________________ and being _________________

favoring, excluding, method, questions, method, short, persistent

In a regression context, x is equivalent to _________________, _________________, _________________/_________________ variable, _________________, _________________, and _________________, while y is equivalent to _________________, _________________, _________________, and _________________ variable. Each x variable has an associated _________________ given by the parameter, which we call _________________.

feature, covariate, independent, explanatory, predictor, input, regressor, output, outcome, response, dependent, weight, coefficients

Data warehouses collects and organizes _________________ data from multiple _________________. Data periodically ETL's into it, where E is _________________ from remote sources, it is _________________ into standard schemas, and _________________ into typically relational DB. EL provides a _________________ of operational data in a single sytsem and isoaltes analytics _________________ from business critical services. T is cleaning/preparing data for analytics in a _________________ representation which is hard and needs specialized tools including schemas, encodings, granularities.

historical, sources, extracted, transformed, loaded, snapshot, queries, unified

Feature engineering is the process of transforming the raw features to be more _________________ used in modeling tasks. It enables us to capture domain _________________ (periodicity/relationships) and express _________________ relationships using SL models and encode _________________ feature as inputs. Feature _________________ transform features into new features. E.g. by adding the all _________________ column we are introducing a constant feature, offset, intercept or _________________. Basic transformations include removing _________________ features that could influence the model, applying _________________ transformations, normalizing or _________________, and converting categorical features to numbers through _________________ (binary features). Feature fns capture domain knowledge by introducing additional _________________ from other sources or _________________ features

informative, knowledge, non-linear, non-numeric, functions, 1, bias, uninformative, non-linear, standardizing, one hot encoding, info, combining

We want to choose appropriate _________________ and _________________ in order to make comparisons more natural. We want to choose _________________ and _________________ that are easy to interpret correctly and address _________________ and _________________ to help tell the story. Smoothed estimates of distributions help with big-picture _________________ such as KDE as a method of _________________ data. Transforming our data can help us _________________ relationships so we can reveal the data to form a narrative.

scales, condition, colors, markings, context, captions, interpretation, smoothing, linearize

A set of vectors is orthonormal set if all vectors are _________________ vectors and they are all _________________, Given VT, to determine if the columns of V form an orthonormal set, we verify that the dot product of any row of VT with itself is _________________ and with any other row is _________________. If the matrix is square and its columns form an orthonormal set, the transpose of such a matrix is also its _________________. When performing SVD, we've left the columns in the U SIGMA matrix unnamed, but called the _________________. To deal with the issue of non-matching we _________________ the data by subtracting the mean of each column for all values in that column so we get a better _________________. After approximation, you can address back the _________________ to get back the original scale. The ith singular value tells us how value the ith _________________ will be in reconstructing our original data or how much _________________ is captured by that component. The total variance of our data is the _________________ of the individual variances of the attributes. The varince captured by the ith PC is equal to _________________^2 / N. We can get the relative importance of each PC by computing the _________________ of variance by np.round(s**2 / sum(s**2), 2). PCA is the process of _________________ transforming data into new coordinate system such that the greatest variance occurs in the _________________ dimension, the second most in the second, etc.

unit, orthogonal, 1, 0, inverse, principal components, recenter, projection, means, PC, variance, sum, ith SV, fraction, linearly, first

Expected value of a RV X is the _________________ of the values of X where the weights are the _________________ of the values. It has the same _________________ as the RV and doesn't need to be possible value of the RV, it is the center of _________________ of the probability histogram. Its properties include linear _________________, _________________ and the _________________ of expectation. The expected value of a Bernoulli distribution is _________________, and it is _________________ for a Binomial Distribution,

weighted average, probabilities, units, gravity, transformations, additivity, linearity, p, np

Binomial/multinomial probabilities arise when we are sampling at random _________________ replacement, a fixed number (_________________) times. Sampling from a _________________ distribution or we want to count the # of each category that end up in our _________________. We define proportion p of the individuals are _________________ and the remaining of failures, we can find the probability of _________________ successes using binomial probability formula. We can also break down the population into three separate categories p1 + p2 + p3 = 1 , category 1 with proportion _________________ of the individuals and so on. We can find the probability of drawing k1 individuals from category 1 and so on.

with, n, categorical, sample, successes, k, p1

A DF is _________________ tabular data, a series is _________________ data (columnar) and an index is a sequence of row _________________. A DF is a collection of _________________ that share the same index. Indices can also be _________________ or have a name, and they do not have to be _________________. However, _________________ names are almost always unique. We can extract a (collection of) Series by performing _________________ or sometimes indexing by column using [], where a list argument yields a _________________ and a name argument yields a _________________. We can also index by row numbers using this method if we provide _________________ which results in a _________________. [] also supports a _________________, which can be generated by logical operators on Series and combined using the _________________ operator allowing filtering of results by multiple criteria. The isin function makes it convenient to find _________________ that match one of many possible values and the _________________ command is an alternate way to combine 1+ conditions (using keyword _________________).

2D, 1D, labels, series, non-numeric, unique, column, column selection, DF, series, numerical slice, DF, boolean array, &, rows, query, and

Apache Hadoop was the first open-source map reduce software based on _________________'s approaches it is the basis of several key technologies but is very _________________ to use they use _________________ execution as each stage passes thru the hard drives but iterative jobs involve a lot of disk _________________ on each repetition which is also very slow (complex jobs, interactive queries, online processing) combined with interactive _________________ and stream _________________. We now have hard drives, CPUs and _________________ on a new engine called Apache Spark (parallel execution engine for big data processing). General efficient support for multiple _________________ and is easy to use and fast, exploiting in-memory when available and low OH scheduling, optimized engine. The Spark programming abstracion writes programs in terms of _________________ on distributed datasets, particularly resilient distributed datasets which are distributed collections of objects stored on _________________ or on _________________ built via parallel transformations (map, filter) automatically rebuilt on _________________ (resilient). Transformations are often _________________ but actions trigger _________________ such as count, collect, save As Text File, not map, filter, group by.

Google, tedious, distributed, IO, mining, processing, memory, workloads, transformations, disk, memory, failure, lazy, computation

Box plots summarize several characteristics of a numerical distribution including _________________, _________________, _________________, and the _________________ placed at the Q1 - 1.5 IQR and Q3 + 1.5 IQR, as well as _________________ which we defined arbitrary as past these bounds above. Violin plots also show smoothed _________________ curves as the _________________ of our box now has meaning, and are useful for comparing _________________ distributions

Q1, median, Q3, whiskers, outliers, density, width, multiple

Ridge regression is a term for when the model is _________________, we use _________________ loss and _________________ regularization. The objective function we minimize is the _________________ loss plus an added _________________. Unlike OLS, there always exists a unique optimal _________________ vector. LASSO regression is a term for the same model and loss but _________________ regularization instead, and the same objective function with a different penalty. We use a regularized objective function to determine our model _________________ but we can look at _________________ to evaluate eperformance

Y = X theta, squared, L2, average squared, penalty, parameter, parameter, L1, parameters, RMSE

Agglomerative clustering is where every DP starts out as its own _________________ and we join clusters with neighbors until we have exactly _________________ left. If there is no right answer, we arbitrarily can chose the _________________, _________________, or _________________. It is a form of _________________ clustering where we can keep track of which two clusters got merge and each cluster is a _________________. Can visualize this visualizing hierarchy resulting in a _________________.

cluster, k, max, min, avg, tree, dendrogram

Our prediction is a linear combo of the _________________ of X, where the set of all possible linear combos is the _________________ of the columns of X. This is all the vectors you can _________________ using the columns of X, and it is in subspace R^n where we want to find the vector that is closest to _________________, which is the _________________ of Y onto span X. Two vectors are orthogonal iff their _________________ product is 0. Product encapsulates all d equations into single equation, and can be a vector full of _________________, under the assumption that XTX is _________________ rank we can use the normal equation.

columns, span, reach, Y, orthogonal projection, dot, zeros, full

A slope a1 measures the change in y per unit change in xi assuming all other variables are held _________________. Multicollinearity is when a feature can be predicted fairly _________________ by a linear combination of other features, can't interpret the _________________ (small data changes -> change slopes). Only impacts _________________ not predictive capability. Perfect multicollinearity: one feature can be written _________________ as the linear combination of other features, the design matrix is not _________________ rank (one-hot-encoding + intercept).

constant, accurately, slope, interpretability, exactly, full

The output of logistic regression is a _________________ value in the range [0,1]; to classify with a decision rule or _________________. With different thresholds we get different _________________ = # of points classified _________________ / # of points _________________. A confusion matrix gives the four quantifies for a particular _________________ and set of data. The precision is _________________ / _________________ + _________________, what proportion were actually 1 from all predicted to be 1. recall = _________________ + _________________ / _________________, what proportion did we predict 1 from those actually 1. Accuracy = _________________ + _________________ / _________________. Suggests there is an inverse tradeoff between _________________ and _________________ so we can adjust our classification threshold to suit our needs; higher threshold is fewer _________________ (precision tends to _________________) and a lower threshold is fewer _________________ (recall _________________).

continuous threshold, predictions, correctly, total, TP, TP, FP, TP, TP, FN, TP, TN, n, precision, recall, FP, increase, FN, increases

Instead of a discrete histogram, we can visualize what a _________________ distribution corresponding to the same histogram would look like, as the smooth curve drawn on top of the histogram is called a _________________ curve. We can call the params of sns._________________ or sns._________________. One of the benefits of either is that they show us the bigger _________________ of our distribution, such as the _________________, its _________________ (L or R), the _________________ (L or R) and the _________________ which we can define arbitrarily.

continuous, density, displot, kdeplot, modes, skewness, tails, outliers

A census is an official _________________ or survey of a _________________, typically recording various details of individuals. Pros are that it is a lot of _________________, and thus no _________________ bias and easy _________________. However, it is very _________________ and is often impossible. A survey is a set of _________________ (what is asked, how it is asked, can affect _________________ the respondent answers and _________________ the respondent answers). A sample is a _________________ of the population and is used to make _________________ about a population. How you draw the sample will affect your _________________. Two common sources of error are chance error: random samples can vary from what is expected in _________________ direction and bias, a _________________ error in one direction.

count, population, data, selection, inference, expensive, questions, how, whether, subset, inferences, accuracy, any, systematic

In a precision vs. recall curves, threshold _________________ from the top left to the bottom right. We can compare our model using the area under curve (_________________) where the ideal value is _________________. And a perfect predictor is up against the top right corner. A perfect classifier is one with a _________________ and _________________ of 1. The FPRis _________________ / _________________ + _________________, which is the proportion of innocent ppl did I convict and TPR is _________________ / _________________ + _________________ which is the proportion of guilty people did I convict which is the same thing as _________________. ROC curves plot the _________________ and _________________; as the threshold increases, both of these values _________________ which is good if we detect fewer false positives but bad for detecting few positives. Receiver Operating Characteristic. A perfect classifier is one with a TPR of _________________ and a FPR of _________________, and a best possible AUC of _________________. Numerical assessments include accuracy, precision, recall, TPR, FPR, AUC for PR and ROC and visualizations are confusion matrices, precision/recall curves and ROC curves. Decision boundaries are _________________

decreases, AUC, 1, precision, recall, FP, FP, TN, TP, TP, FN, recall, TPR, FPR, decrease, 1, 0, 1, linear

Smoothing to focus on the general _________________ rather than individual observations. We can also spread the proportion _________________ such as in KDE plots which are used to estimate a _________________ or density curve from a set of data (its area must sum to 1) To create one, place a _________________ at each data point, _________________ them so that area = 1 and finally _________________ all kernels together after choosing a _________________. A kernel is a valid density function that is _________________ for all inputs and must integrate to 1. A common kernel is a _________________ one, where x represents any input and xi the ith observed value and kernels are centered on our _________________ values so the mean of the distribution is _________________. Alpha controls the _________________ of our KDE. The boxcar kernel assigns uniform density points w/in a _________________ of observation. Bandwidth is analogous to the _________________ of each histogram bin, and as it increases, the KDE becomes _________________ smooth. Simpler to understand but may remove important distributional info, called a _________________.

distribution, uniformly, probability density function, kernel, normalize, sum, bandwidth, Gaussian, observed, xi, smoothness, window, width, more, hyperparameter

We can _________________ records w/ missing values, _________________ our missing values via _________________ (for example, mean or hot deck which uses a _________________ value from the subgroup). In which case we need to check for _________________ bias using domain knowledge or model missing values using future analysis

drop, infer, imputation, random, induced

A natural measure is the average loss across all points, AKA _________________ risk and an objective function which tells us how well it fits the given data. If our model is low, we are good at making predictions, and we want to find params that _________________ average loss to make our model as good at making predictions as possible. If we choose squared loss, the avg squared loss is typically referred to as the _________________, and if we choose absolute loss we have _________________. Mathematically, we want to use the arg that minimizes the following function denoted by _________________. Algebraically, we can take the _________________ , set it equal to 0 and solve for the optimizing value. The derivative of the sum of several pieces is equal to the sum of the derivative of said _________________. We can take the second derivative to ensure we have truly found a _________________ not just a critical point. A neat trick is that the sum of deviations from the min is _________________. Which means that theta_hat = mean y = y_bar regardless of the _________________ that we use but only for this combo of model/use. It provides some formal _________________ as to why we use means so commonly in summary statistics when working with the _________________ parameter.

empirical, minimize, MSE, MAE, argmin, derivative, pieces, minimum, 0, dataset, reasoning, optimal

Log mining is the process of loading _________________ messages into memory and interactively searching for patterns. The next generate DF is _________________, which achieves 4x speedup on 4-core laptop. ETL is used to bring data from _________________ into a data _________________. Many way to organize tabular data warehouse in _________________ or _________________ schemas. OLAP techniques lets us _________________ data in a data warehouse. Unstructured data is hard to store in a tabular format so it is amenable to standard techniques which is why we have a new paradigm called the data _________________, enabled by the ideas of distributed file _________________ and _________________. Distributed file storage involves the _________________ of data for better speed/reliability but is more costly. Map reduce eases distributed computation, such as Apache _________________ which his an open-source implementation and _________________ which is faster/easier to use. Modin is a way of _________________ data exploration

error, Modin, operational data stores, warehouse, star, snowflake, analyze, lakes, storage, computation, replication, Hadoop, Spark, accelerating

Statistical bias: difference between your _________________ and the _________________. Empirical distribution: of our _________________, values and proportions. Probability distribution: model for how the sample is _________________, values and probabilities (often not known). A random variable takes _________________ values with particular _________________. They use _________________ letters, and are indicated by _________________ letters. The Probability Distribution of a discrete RV can be expressed as a table or a graph, where P(X =x). A function of RV is also an _________________

estimate, truth, sample, generated, numerical, probabilities, capital, lowercase, RV

Training data is used to _________________ the model and test data is used to check _________________ error. How to split depends on application, and can be _________________, _________________ or _________________. We have a larger _________________ set for more complex models but a larger _________________ set for better estimate of generalization error. We can only use the test dataset _________________ after deciding the model. Cross validation simulates multiple train-test _________________ on the training data. Regularization is parametrically controlling the model _________________. We want to find the best valueof theta that uses fewer than _________________ features, which is a combinatory search problem. The L0 norm ball is ideal for _________________ selection but combinatorically hard to optimize, l1 norm ball encourages _________________ solutions and is _________________, L2 norm ball spreads weight over features meaning it is _________________ but doesn't encourage sparsity, and the L1 + l2 norm ball needs to _________________ to tune regularization parameters. We perform standardization to ensure each dimension has the same _________________ centered around zero, and we don't typically regularize the _________________ term; regularization penalizes _________________ equally.

fit, generalization, temporally, geometrically, randomly, training, test, once, splits, complexity, beta, feature, sparse, convex, robust, compromise, scale, intercept, dimensions

A distribution describes the _________________ at which values of a variable occur. All values must be accounted for only _________________ and they must address up to _________________ or the number of values we are observing.

frequency, once, 1

Random Variables: need to know how data is _________________ to understand the world, formalized by RV and their _________________. Many reoccur and have names, and one of the most prominent features is the _________________. Statistic: single piece of data, a numerical _________________ (function) of a _________________ (realization of RVs). An estimator is a statistic designed to estimate a _________________.

generated, distributions, expectation, summary, dataset, parameter

Lots of data has a _________________ structure and there are many open data sources. Big data usage is _________________

graph, ubiquitous

We can overlay _________________ or _________________ curves on top of each other or put _________________ or _________________ plots side by side for comparison. these are concise and are well suited to compare _________________ distributions. Scatter plots are used to reveal _________________ between pairs of _________________ values. We often use them to help inform _________________ choices. We can also use _________________ to help encode categorical variables, but they may suffer from _________________, and the solution is to address some random noise in both directions. Hex plots can be thought of as a 2D _________________, showing the _________________ distribution. Xy plane is binned into hexagon, and more shaded ones typically indicate greater _________________ or frequency. They are easier to see _________________ relationships with and cover the region better, and the visual _________________ of squares is not a problem.

histograms, density, box, violin, multiple, relationships, numerical, modeling, color, overplotting, histograms, joint, density, linear, bias

Missing values in quantitative data are resolved by trying to estimate or _________________ them or adding a _________________ field as a signal. For categorical data, we can add a _________________. For text data, we can do _________________ encoding (which is the _________________ of one hot encoding for a string of text, and encoding text as a long _________________ of word counts so we lose word order information in a very high dimensional and sparse table. A bag is another term for a _________________, an unordered collection which may contain 1+ instances of a each element as well as _________________ words that do not contain significant important) or _________________-gram models which try to preserve word _________________, which can also be very sparse and many _________________ are concurrent, but was can use a _________________ approximation.

impute, binary, column, bag of words, generalization, vector, multiset, stop, n, order, combinations, hashing

Series: named column of data with an _________________, indexes: mappings of keys to _________________, DF: collection of _________________ with a common index. DF access methods including _________________ on predicts and _________________, df.loc is location by _________________, df.iloc is location by integer _________________ and groupby/pivot as methods of _________________ data.

index, rows, series, filtering, slicing, index, address, aggregating

Convenience Samples is whoever you can get a hold of, and not a good idea for _________________. Sources of bias can introduce themselves in unusual ways since _________________ does not equal random. Quota samples are done by specifying a desired _________________ of various subgroups and reaching those targets however possible You require a sample look like your population in a few _________________ but not all. Try to ensure that the sample is _________________ of the population: _________________ over quantity.

inference, haphazard, subgroups, aspects, representative, quality,

One goal of DS is to _________________ human decisions, and plots directly address this goal. Data visualizations have their own _________________. An encoding is a mapping from a _________________ to a visual _________________. In other works, for each _________________ that represents a datum, an encoding maps this datum to its visual _________________. Not all encoding channels are _________________.

inform, tradeoffs, variable, element, mark, position, exchangeable

K-means clustering is to pick an arbitrary _________________, and random place k _________________ each a different color. Until convergence, color points according to the _________________ center and move center for each color to the center of points for that _________________ this is different that KKN for classification where prediction is the most _________________ class among the k-nearest data points in the training set. Every time you run K means, you get a _________________ output, can define loss to decide which his best. Two metrics include the sum of _________________ from each point to its center (which is formally defined as _________________) and the weighted sum of square distances from each DP to its center (called _________________). We are trying to _________________ inertia but we may fail to find the global optimum. First optimizer holds the center positions constant and optimizes the _________________ while the second holds the colors constant and optimizes the center _________________. For all possible k^n colorings compute the k centers and compute the _________________ for the k centers; if the current inertia is better, write down.

k, centers, closest, center, common, distances, inertia, distortion, minimize, colors, positions, inertia

Loc can access values by _________________ or using a _________________. it can provide a list of row and column _________________ which returns a DF. It can also be used with _________________ (all label types) which are _________________ not exclusive. If we only provide a single label as a column argument, we get a _________________. If we provide a list of only one label, as an argument, we get back a _________________. If we provide a single row label, we get a _________________ (index = names of the columns from the DF).

labels, boolean arrays, labels, indices, inclusive, series, DF, series

Population: group you want to _________________ about, denoted of size _________________, sampling frame: _________________ from which the sample is drawn (set of all _________________ who could end up in sample), sample = who you actually end up sampling, subset of the _________________ and of size _________________. There may be individuals in SF and sample not in the _________________. Administrative data: SF contains a lot not in the _________________, but we have access to the entire SF.

learn, N, list, people, SF, n, population, population

The GD algorithm uses alpha as the _________________ rate; too large and it _________________ to converge, too small and it takes too _________________. If the loss function has multiple local minima, GD is not _________________ to find global minimum. For a convex function f, any local minimum is also a global minimum and it will _________________ finds the the globally optimal minimizer. F is convex iff tf(a) + (1-t)f(b) >= f(ta + (1-t)b). Batch GD descent: nudging _________________ in the _________________ gradient direction until it convverges, using an update rule of the learning rate + the _________________ of the loss wrt theta. First we initialize all model weights of _________________ (small random numbers), update them with the _________________ rule and repeat until convergence. Typically the loss function is the average _________________ over a large dataset, which is computationally expensive. We can also draw a SRS of data _________________ called a batch and a choice of batch size is a tradeoff of gradient _________________ and _________________ so we can compute the GDE and use as aa gradient.

learning, fails, long, guaranteed, theta, negative, gradient, 0, update, loss, indices, quality, speed

Consider new loss called the negative _________________ loss for a single observation when the true y is equal to 1. Cross entropy loss is often written unconditionally as loss = - y log (y hat) - (1-y) log (1 - y hat) which we also call _________________ loss. Benefits include that the loss surface is guaranteed to be _________________, more strongly penalized _________________ predictions, has roots in probability and _________________ theory. Different optimization problems have different _________________, and CE loss is strictly better than _________________ loss for logistic regression. Our goal is to find p1 and p2 that maximize the above function over all _________________ and _________________ since we want those most likely to have generated the data we have observed (differentiating, setting to 0, solving). Log(x) is a strictly _________________ function. Minimizing CE loss is equivalent to maximizing data likelihood we are choosing the model params that are most likely given this data which is called maximum likelihood estimate or _________________. Assume the log-odds prob of belonging to class 1 is _________________. CE helps us measure the difference in between two _________________.

log, log, convex, bad, info, solutions, squared, increasing, MLE, linear, distributions

Fully grown decision trees will almost always overfit data and have _________________ model bias and _________________ model variance. Bootstrap aggregating or _________________ is generating many bootstrap resamples of training data and fitting one _________________ for each resample, where the final one is the _________________ predictions of a small model. It often isn't enough to _________________ model variance, and they often make similar predictions. We only want to use a sample of m _________________ at each split where p = # of features, m = sqrt(p) and this creates individual trees with overfit in different ways so the overall forest has _________________ variance. We want to start with data in one node and until all are pure, pick an _________________ node and pick a random _________________ of m features to pick the best feature and split so that the _________________of the split is minimized and take the _________________ vote at the end with two hyperparms T, m. These ideas are generally _________________. Random forests are _________________ for regression/classification, _________________ to feature scaling and translation, have _________________ feature selection, have _________________ decision boundaries w/out complicated feature engineering, don't _________________ as much, example of an _________________ method wh/ combines the knowledge of many simple models and is an example of _________________ to reduce model varaince

low, high, bagging, model, average, reduce, features, low, impure, subset, loss, majority, heuristic, versatile, invariant, automatic, nonlinear, overfit, ensemble, bootstrap

The modeling process has three steps: Choosing a _________________ (constant, linear, non-linear), choosing an objective _________________ (prediction/loss function, description/likelihood function) and fitting the model by _________________ our objective function (analytical or numerical approach). The simplest model is a _________________ one that predicts the same number regardless of all other information. It is useful for description and _________________. Notation includes yi which is an individual _________________, yi_hat which are _________________ observations, theta which are _________________ parameters and theta hat which are _________________ parameters. Parameters define the model although some like KDE may be _________________. A constant model ignores any input and models can also have many params, where the goal is to find the _________________ value of the parameter. Estimation is using the data to define the _________________ parameters while prediction is using these fitted model paras to predict _________________ for unseen data.

model, function, optimizing, constant, prediction, observations, predicted, model, fitted, nonparametric, best, model, outputs

In Online Analytics Processing, users interact with _________________ data through complex SQL queries or graphical tools. Reporting and Business Intelligence use high-level tools to _________________ with their data through often large queries. To deal w/ unstructured data, we use data lakes which stores a _________________ of all the data in one place in its _________________ form and enables data consumers to choose how to _________________ and use the data (schema on _________________). A cultural shift from curating to saving, and the _________________ begins to dominate the signal. There is limited data governance/planning and no cleaning/verification results in _________________ data. We need better tools as the new ones don't work enabled by DS such as data _________________ and cleaning, ML, computer vision, statistical sampling. Organizations are evolving and technologies are improving, such as lots of hard drives and _________________. To store/compute large unstructured datasets we handle very large files spanning multiple _________________ and use cheap commodity devices that _________________ frequently and we do quick/easy _________________ data processing. We can use distributed file systems that capitalize on _________________ and distributed computing to load/process on multiple machines _________________ approaching parallelism

multidimensional, interact, copy, original, transform, read, noise, dirty, CPUs, computers, fail, distributed, redundancy, concurrently

Operational Data Stores capture the _________________, and include many different _________________ across an organization. The goal is to be careful through different formats or _________________ and live system's often don't maintain their _________________ as a result. We would like a consolidated, clean historical _________________ of the data.

now, databases, schemas, history, snapshot

Bernoulli Distribution: a RV takes the value 1 with probability _________________ and 0 otherwise (called the Probability Mass Function or the _________________). Binomial distribution: RV that counts the # of _________________ in n independent trials where e/ succeeds w/ probability p. Types of probability distributions fall into discrete, where the set of possible values of X is either _________________ or countably _________________ and values are separated by some _________________ amount or continuous where the set of values is _________________ and X can be some real number in an _________________. Discrete distributions are _________________ on a finite set and the probability of each value is 1/ _________________. Continuous distributions are uniform on the unit _________________ and can be any real # in range (0,1) and can also be modeled by the _________________ distribution.

p, PMF, successes, countable, inf, fixed, uncountable, interval, uniform, size of set, interval, normal

Multiple linear regression is a model with _________________ features + an _________________ term, where the weight associated with _________________ xj is thetaj. Do determine the fit of our model to the data, we can look at the _________________ or the _________________, as well as the _________________ or the _________________ plot, which are defined as the different between actual and predicted value. RMSE is the square root of the _________________ between predictions and their true values. It is the square root of MSE, which is the _________________ loss we've been minimizing to determine optimal model params; it has the same units as _________________. And a lower value means more _________________ predictions. With multiple features, we can also look at the _________________ between each feature and our true y values individually or measure the strength of the linear association between our actual and predicted _________________ (want to approach y = x). We define the R^2 value as the square of the correlation, referred to as the coefficient of _________________, and can be calculated as the _________________ / _________________ as well or the proportion of variance that our fitted values _________________, or explain. As we address more features, fitted values approach actual y values and thus R2 _________________.

p, intercept, feature, MSE, RMSE, correlations, residual, MSE, average, y, accurate, correlation, y, determination, variance of fitted values, variance of y, capture, increases

The solutions to tabular data manipulation were _________________ and _________________, to regression and classification were _________________ and _________________, and dimensionality reduction was _________________ , the latter two which were all part of _________________. In supervised learning, goal is to create a _________________ that maps inputs to outputs and the model is learned from example pairs where each pair consist of the input _________________ and the output value. In regression the output is _________________ and for classification it is _________________. In unsupervised learning the goal is to identify patterns in _________________ label w/out pairs. The goal of clustering is to assign each _________________ to a cluster, which is an unsupervised task

pandas, SQL, linear models, decision trees, PCA, ML, function, vector, quantitative, categorical, , unlabeled, point,

A fact we know about a population is called a _________________, a numerical function. We want to compute the _________________ of a random sample instaed. Inference is drawing conclusions about population parameters given a random _________________. Steps in reasoning from premises to logical _________________ (divided into induction, deduction), from premises to a universal conclusion. Statistical inference is the process of using data analysis to deduce _________________ of an underlying distribution of probability. It is the process of using data to infer the _________________ generated by data. Prediction is the task of using our model to make predictions in response of _________________ data and inference is the task of using to conclude about underlying true relationships between _________________ and response.

parameter, statistic, sample, consequences, properties, distribution, unseen, features

The shape of MAE with constant model is jagged because the weighted sum of several abs value curves which results in a _________________ linear function , and we use the fact that the derivative of the sum is the sum of _________________. We need to choose a theta such that the # observations _________________ than theta is equal to those _________________ than theta, which is the definition of the _________________. Thus, there may be _________________ optimal values for MAE and MSE . We can present the loss _________________, a plot for the loss encountered for each possible value of _________________. While MSE is very _________________ and easy to minimize numerically as well as very _________________ to outliers, MAE is not as smooth and harder to minimize (not _________________), and is _________________ to outliers.

piecewise, derivatives, less, greater, median, different, surface, theta, smooth, sensitive, differentiable, robust

Estimator is some function of a _________________ whose goal is to estimate a population parameter denoted theta hat, is a _________________. A sampling distribution is a distribution of _________________ values over all possible samples, unknown. Bias of an esimator is the difference between the estimator's _________________ value and the _________________ value of the parameter being estimated. 0 bias: on average our estimate is _________________, non-zero is consistently too large or small. Varaince is the expected squared deviation of an estimator from its _________________. The larger it is the more it _________________ from the average. We use _________________ to denote estimates and how it relates to optimal thetas. The sample mean estimator is estimated by the empirical mean of m squared differences and the m sample means by drawing m random _________________ of size n from the population and applying the _________________ to estimate the variance,

population, RV, estimator, true, predicted, correct, mean, varies, hats, samples, estimator

To determine the properties of the sampling distribution from the estimator, we need to access the _________________ but we can treat our random sample as one and _________________ from it. We want to sample _________________ replacement. The bootstrapped sampling dist may not match the sampling _________________ of that same estimator since the center of the bootstraped dist is the estimator applied to our original sample (can't recover true expected value). The variance is often close to the true _________________ of the estimator and its _________________ depends on the original sample

population, resample, with, varaince, quality

Iloc returns the item that appear at the numerical _________________ specifies, integer-based indexing by position. It is harder to make mistakes and it is easier to _________________ code, and is not vulnerable to changes in _________________ of raw data files. If you want a DF consisting of a random selection of rows, we can use the sample method which is _________________ replacement by default. Numpy ops such as _________________, which displays the top few rows and _________________ which gives the total # of DP, _________________ gives the size of the data and _________________ provides summary. Index gives the index (row _________________) and columns returns _________________ labels.

positions, read, ordering, without, head, size, shape, describe, labels, column

Bar Plots are the most common way of distributing the distribution of a _________________ or categorical variable. They are also used to display a numerical variable that has been measured on individuals in different _________________. Lengths encode _________________ and the widths encode _________________, while colors may indicate a _________________. There are three ways, using _________________ .plt, the underlying library, _________________.plot() and seaborn or _________________ which allows us to quickly create sophisticated visualizations.

qualitative, categories, value, nothing, subcategory, matplotlib, pandas, sns

Residual plots describe the _________________ of our model, and displays the residuals vs. the _________________ values. A good one has no _________________ and the model represents the data relationship well; it also has a similar _________________ spread throughout the entire plot for reliable accuracy. Residuals are orthogonal to the _________________ of X unless there is an _________________ term in which case the sum of the residuals is _________________, as is the _________________, and the positive/negative residuals cancel out. Als, the predicted y value is the same as the _________________ true y value. There is at least one _________________ parameter that minimizes average loss since the min value of both MSE and MAE is _________________. For a constant model with squared loss, any set of values has a unique _________________, so a unique sol exists. For a simple linear model w/ squared loss, any set of _________________ models has a unique mean, SD, and r. Constant model with absolute loss has a unique sol when there is an _________________ number of y values. Since _________________ is typically larger than p. However, when XTX is not _________________, a unique solution DNE since there are inf many optimal choices for coeff, which is equivalent to when the cols of X are _________________

quality, fitted, pattern, vertical, span, intercept, 0, mean, average, model, 0, mean, non-constant, odd, n, invertible, LI

In a linear regression model, we predict a _________________ variable and in a logistic regression our goal is to predict a binary _________________ variable. Our response is the probability our observation belongs to class _________________. Logistic function is a type of _________________, a class of functions that share certain properties. The output is bounded between 0 and 1 which fixes an issue using linear regression to predict _________________. If theta 1 is positive, the curve increases to the _________________; the further theta 1 is from 0, the _________________ the curve is. If we increase x by one unit, odds is multiplied by e^theta 1; if theta 1 > 0, the odds _________________. The odds ratio is interpreted as the number of _________________ for each failure. The loss surface of MSE for a logistic regression model with a single _________________ plus intercept often looks something like this; update rule stops when gradient is 0. Squared loss is not the best choice of loss function for logistic regression since average squared loss is not _________________, nuerical methods struggle to find a solution. Wrong predictions aren't _________________ significantly enough, squared loss bounded between 0/1.

quantitative, categorical, 1, sigmoid, probabilities, right, steeper, increase, successes, convex, penalized

In linear regression, our goal is to predict a _________________ variable from a set of features where our response y can be any _________________ number. We determined optimal model parameters by minimizing some average _________________ and added a _________________ penalty. When performing classification, we are predicting some _________________ variable instead. The _________________ classification has two classes w/ response of 0/1 and a _________________ classification has many. K NN is learnt before. Regression/classification are methods of _________________ learning, but _________________ regression is mostly used for classification. We can't use OLS because it is very sensitive to _________________. Also the graph of averages can be used. Log-odds probability is roughly _________________. odds(p) = _________________ / _________________ and log-odds(p) = log(_________________ / _________________). The logistic function or _________________(t) is e^t / 1 + e^t. We can substitute t = x^T theta, where P(Y = 1 | x) = 1 / 1 + e^(-xT theta) = sigma(x^T theta)

quantitative, real, loss, regularization, categorical, binary, multiclass, supervised, logistic, outliers, p, 1-p, p, 1-p, sigma

Rug plots are used to show the distribution of a single _________________ or numerical variable (each and every single _________________). It may be too much detail/hard to see the whole picture, and we risk _________________. Histograms are the _________________ version, where we lose _________________ but gain _________________. The horizontal axis is the # line divided into _________________, and thus areas represent _________________ (total area is 1 or 100%). The unit of height = proportion per unit on the x-axis, can be seen by dividing by the _________________ of the bin. Proportion in bin = _________________ * _________________. By default, matplotlib histograms show _________________ on the y axis not proportions per unit, we can use the optional _________________ parm to fix the y-axis. Beware of strong conclusions since the number of bins influences it _________________ and bins do not have to be the same _________________.

quantitative, value, overplotting, smoothed, granularity, interpretability, bins, proportions, width, width of bin, height of bar, counts, density, appearance, width

A correlation coefficient _________________ measures the strength of the _________________ association between two variables and it has no _________________. It ranges between -+ _________________, with 1 indicating a perfect linear association and -1 indicating a perfect negative linear association. It says nothing about _________________ or consequently _________________ association; but if r = 0, we say the two variables are _________________. The average of the _________________ of x and y both measured in _________________. The quantity is called the _________________. A simple linear regression model helps us create a graph of _________________ and implements the optimal params by hand. To minimize MSE for the SLR model, we need to find two params optimally, and want the best _________________ of such params, which we call _________________ linear regression which we find by taking _________________ derivatives.

r, linear, units, 1, causation, non-linear, uncorrelated, product, SU, covariance, average, combination, LS, partial

A dataset's dimensionality is the _________________ of the matrix representing the data. Visualizations for high dimensions are difficult, but we can try reducing to _________________ dimensions to preserve info. One idea is to pick 2 _________________ namely those with high _________________ in order to effectively differentiate observations in a process called _________________ in which we create a _________________ combination of attributes. We want to ultimately determine whether this 2D plot is really showing the _________________ of the data. SVD will automatically do a similar _________________ for us to create two separate tables. SIGMA is a _________________ matrix which contains the so-called _________________ of X. The columns of U and V form a _________________ set. A diagonal matrix is one with zeros every except possibly the diagonal, equivalent to _________________ the columns. In SIGMA, the singular values appear in _________________ order and are always _________________. The singular values beyond rank _________________ will always be 0. Two orthogonal vectors meet at a _________________ angle and have a _________________ product of 0. A unit vector has length _________________. The length of vector v is the square root of its _________________ norm

rank, 2, attributes, variance, PCA, linear, variability, transportation, diagonal, singular values, orthonormal, scaling, decreasing, non-negative, r, right, dot, 1, L2

These big tables have substantial _________________ and are expensive to store and access; we may make updating mistakes. The multidimensional data model uses _________________ tables and _________________ tables which minimizes redundant info and data errors. These dimensions are easy to manage/summarize in a _________________ representation and we do analysis through _________________

redundancy, dimension, fact, normalized, joins

Often data will _________________ other pieces of data. Primary key is the (set of) column(s) that determine the _________________ of the remaining columns and are _________________. Foreign keys are those that reference the _________________ keys of other tables.

reference, values, unique, primary

Visualization is the use of computer-generated, interactive, visual _________________ of data to amplify _________________. Finding the artificial _________________ that best supports our natural means of _________________. Visualization complements _________________. Goals of data visualization are to help your own _________________ of the data/results as a key part of the _________________ process, useful throughput modeling, as it is lightweight, iterative and flexible. It is also to _________________ results to others, as it is highly editorial and selective. Be thoughtful and careful and fine-tuned to achieve a communication goal, often _________________ consuming. It is a constant _________________ across the DS lifecycle.

representation, cognition, memory, perception, statistics, understanding, EDA, communicate, time, tool

The dot product between to vectors is a _________________ that is only defined if they are the same _________________, a special case of the _________________ product it is a way that we can perform _________________. Our MSE involves observations at once, so we want to model in terms of all observations displayed in a _________________ matrix. With n observations, rows correspond to _________________, while columns correspond to _________________. Thus, we can express our linear model on our entire _________________ as Y, with dimensions ___ x ___, X with dimensions ____ x _____ and theta with dimensions _____ x _____. For a single observation, y_hat = f(x) = xT theta where x is a vector of size _________________, y is a _________________ and theta is a vector of size _________________.

scalar, length, inner, multiple regression, design, observations, features, dataset, n, 1, n, p + 1, p + 1, 1, p+1, scalar, p+1

To evaluate how well-clustered a specific DP is, we can use the _________________ score or width; high scores are when _________________ the other points in its X's cluster and a low one is when it is _________________. S can be _________________ when the avg distance to X's cluster mates is larger than the distance to the _________________ center. The highest possible S is _________________ when every point in X's. clsuter is right on top of X. For a DP, the score S is A = avg distance to other points in the _________________ and B = avg distance to points in the _________________ cluster and S = (B-A) / max(A,B). We can plot for all of our DP and ones with large scores are deeply _________________ in their cluster. We can alternatively rely on real world _________________ to choose k

silhouette, near, far, negative, closest, 1, cluster, closest, embedded, metrics

A norm of a vector is some measure of its _________________ , and we are concerned about _________________ and _________________ norms. L2 vector norm can be considered the _________________ of the vector, and it is a generalization of the Pythogorean Theorem in _________________ dimensions; distance between 2 vectors is their difference. Square of the l2 norm is the sum of the _________________ of the vector's elements. Residuals are the difference between the _________________ and _________________ value, in the regression context. We use the letter _________________ to denote residuals, and the MSE is equal to the mean of the squares of the residuals. We can stack all n residuals into a vector called the _________________ vector, the difference between two vectors of actual - predicted. Our goal is to find the value of _________________ that minimizes the squared L2 norm of the residual vector, minimize distance between Y and Y hat. We can use a _________________ argument here.

size, L1, L2, length, n, squares, actual, predicted, e, residual, theta, geometric

Variance is the expected _________________ from the expectation of X, and units are the _________________ of the units of X and must be _________________. It is used to quantify _________________ error. Chebyshev's inequality tells that the vast _________________ of the distribution lies in the interval expectation +- few SDs. It also emphasizes that X is _________________ if X is a constant. A shift by b units does not affect _________________ but a _________________ by a does. X is in _________________ if we subtract expectation and divide by SD (# SDs from expectation). Where the expectation becomes _________________ and the SD _________________. X1, X2 have the same distribution but their _________________ may be different. The varince of a sum is affected by the _________________ of the two variables. The covariance is their expected _________________ of deviations as a generalization of variance. If X,Y are independent, knowing X tells us nothing about Y, and independence is a _________________ statement, and thus variance is _________________. Correlation is the _________________ scaled by two SDs. Covariance equal = 0 is the same as _________________, as independent RV are. If so, the variance of the sum is the sum of the _________________. Independent and identically distributed or _________________ populations have specific sum values.

squared deviation, square, non-negative, chance, majority, centered, spread, scaling, SU, 0, 1, sums, dependence, product, strong, additive, covariance, uncorrelated, variances, IID

The best k is _________________, and one method of selection is the _________________ method, where we plot the _________________ versus many k values and pick the one where we get diminishing returns afterwards.

Data100 Final Review

Related study sets

MENS HEALTH-STUDY THIS FOR PRACTICE Q'S!!!

Wine Start 35

Corporate Finance Exam 1 Review(Chpts 1-5)

Pregnancy

Quiz #2

Glossary - Life & Health

Fair Credit Reporting Act

Astronomy Homework Questions

Data Analysis: Chapter 14

The equity Method of Accounting for Investments

Korea

Motor Behavior - Motor Learning & Performance

Positive and Negative impacts of imperialism

MACRO EXAM 3

HHP chapter 5

Programming Fundamentals - Quiz 5

ATI Infections

A & P 2 Lab Quiz 2

FNAN 250 CHAP 12

OB exam 3