Data Science with Python Full Vocab List

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Abend (Abnormal End)

An unexpected or abnormal termination of an application or operating system that results from a problem with the software

Data wrangling (a.k.a. "munging"): organizing and preparing data for analysis

Because data usually begin in a messy and disorganized state, data scientists need to have ninja skills with spreadsheets, Pandas, R, and/or other tools that can be used to whip the data into shape. Although data wrangling is the least glamorous aspect of data science, it may consume half or more of the project time. Data scientists may delegate data wrangling to younger or less experienced people on the team, especially if the senior people lack the patience (or skill) to do data wrangling well. However, top-flight data wranglers can command high wages, because the difference between an efficient approach and an inefficient approach may involve a cost change of several orders of magnitude. For example, if you had 80 million records to clean up, and you spent 2 weeks writing a script that automated the process and had a runtime of 15 minutes, you would certainly be justified (as a consultant) charging a fee of $100,000, if the client's only reasonable alternative was to spend tens of thousands of labor hours doing the process manually.

GIGO

Garbage in, Garbage out

Data visualization and insights (telling the story, making your case)

Here is where data scientists really earn their money. (And yes, they are paid well and are highly sought, especially by tech companies in California.) The insights gathered from exploratory data analysis (EDA) and model-building are of little use unless the data scientist can COMMUNICATE those findings, using words and graphs, to the decision makers who need those valuable insights. This process of "selling" and "packaging" the insights for the high-level execs who run the enterprise is both art and science. It is marketing, in a way, since it uses psychology and communicat

Data modeling

In statistics, practitioners (statisticians) are usually interested in estimating the parameters of a distribution, because then it is possible to state confidence intervals and make other types of inference about the population of interest. A statistical model is really a mathematical model defined by certain parameters (such as slope and intercept, or mean and standard deviation). A model is not reality, but it is a simplified version of reality. Key quote from the British statistician George E.P. Box (1919-2013): "All models are wrong, but some are useful." Statisticians care about parameters, in part, because the field of statistics grew out of mathematics, and parameters are a nerdy, theoretical thing that such people care a lot about. In data science, however, practitioners (data scientists) are less interested in the parameters of the distribution. In fact, they may not care at all about the parameters: "parameters schmameters." Data scientists focus on the usefulness of the predictions/descriptions and tend to ignore the parameters. A data science model is considered useful if it can make worthwhile predictions about previously unseen data. To accomplish this, the data scientist divides the original source data into at least 2 subsets, called "training" and "test." (There may be other subsets for validation, model selection, etc., but in our introductory class, we will stick to training and test.) A model is built by using the training data as input to some machine-learning program (such as scikit-learn) and then tested on the previously unseen test data to see if the model still makes useful predictions. If the model performs well on the training data but poorly on the test data, we say that the model is OVERFITTING the data. If the model performs poorly on both training and test data, we say that the model is

P-value†

P(achieving results as extreme as, or more extreme than, the test statistic | null hypothesis is true) If the P-value is sufficiently small, we are saying that the test statistic we computed would be unlikely if the null hypothesis were true, from which we conclude that the null hypothesis should be rejected. In other words, we have found statistically significant evidence to reject the null hypothesis. On the other hand, if the P-value is not sufficiently small, we say that the evidence to reject the null hypothesis is insufficient. We fail to reject the null hypothesis. The threshold for what constitutes "sufficiently small" is called α (alpha), also known as the significance level of the test. The most common value for alpha is 0.05, which was a completely arbitrary level proposed by Ronald Fisher in 1925. That value has largely stuck. Note: We never say that we have proved the null hypothesis. Proving the null hypothesis is a rookie mistake. Note: Since the publication of a 2005 article by John Ionnidis, a Stanford professor, the entire concept of using Pvalues to test for statistical significance has been under attack, with fierce arguments from both sides. The AP Statistics course outline pretends as if this controversy does not exist, and you should also ignore the controversy in order to score well on the AP exam. However, in the larger world, you will want to know that P-values are no longer as widely accepted as they once were. There are many reasons for this: P-hacking, replicability problems, publication bias, multiple tests within one study, low statistical power, etc.

positive predictive value (PPV)

P(infected | +), which can be computed with a probability tree if sensitivity, specificity, and base rate are known definition that does not rely on a medical context: PPV = P(true target variable is 1 | target variable was predicted to be 1) Note that we are using 0 to indicate the null-hypothesis outcome and 1 to indicate the alternative-hypothesis outcome. If Ho (null hypothesis) is "Customer repays the loan, no problem" and Ha (alternative hypothesis) is "Customer defaults on loan," then PPV answers the question, "What fraction of people judged to be loan defaulters will actually default on their loans?" In data science, PPV is usually called "precision." Note that the definition is the "flipped" version of the definition for sensitivity.

interquartile range (IQR)

Q3-Q1

tails (left tail, right tail)

Tails are the lower and upper halves, respectively, of a distribution (often tapering off, but not necessarily).

noise

Unwanted variation, whether random or not Since noise is a feature of all real-world data, the essential challenge of statistics and data science is to find the signal (meaningful variation) amidst the noise (extraneous variation). See, for example, Nate Silver's excellent book, The Signal and the Noise.

A regression

When you are trying to fix code, but then something else bug out after you fixed the bug regression testing: running all cells from top to bottom to make sure everything works

WAD

Works as Designed

normal probability plot

a calculator tool for displaying observed values plotted against the inverse normal cdf of their percentiles, on the theory that a truly normal data set would therefore produce a straight line of dots Bending to the left (as one traces a finger from left to right) implies left skewness, whereas bending to the right (as one traces a finger from left to right) implies right skewness. Warning: These rules apply to the plots produced by the TI-83/84 family of calculators, which put values on the x-axis, inverse normal cdf of percentiles on the y-axis. Many commercial stat packages swap the roles of x and y, which means that the rules for interpreting left and right skewness would be backwards. On the TI-83/84 family of calculators, an S-shaped plot ( ∫ ) indicates "heavy tails," whereas a plot that looks more like the graph of y = x^3 (upward portion, mostly horizontal portion, another upward portion) indicates "light tails." Warning: As above, remember that many commercial stat packages swap the roles of x and y, meaning that the interpretations would be reversed. Another way to distinguish heavy and light tails is to sketch a segment connecting the first and last points (excluding outliers). If the points in the graph are mostly below the segment on the left half and above the segment on the right half, you have heavy tails. If the points in the graph are mostly above the segment on the left half and below the segment on the right half, you have light tails. I n data science, we usually use a Q-Q plot (quantile-quantile plot) instead of an NPP, but the analysis is basically the same

variable

a column in a data table Note that this definition differs from what you learned in previous math courses. Most of the things you called "variables" in algebra or precalculus class would properly be called statistics or parameters in our class.

uniform distribution

a distribution whose histogram (or pdf graph, in the case of a continuous distribution) looks like a single rectangle; in other words, all permitted values of the r.v. of interest are equally likely

transformation

a function (not necessarily linear, but often linear) that is applied to data in order to give it a different distribution Univariate example: A highly right-skewed distribution can sometimes be transformed to a nearly normal distribution by taking the log of all values; such a transformation would be called a "log transformation" or "log transform," and the starting distribution would be called a "lognormal distribution." Bivariate example: In AP Statistics, one of the places we use transformations is in the study of curve fitting. If two variables have a nonlinear association, we can often find a "transformation to achieve linearity." This is helpful, since if we can achieve a roughly linear association between two variables, we can then use LSRL procedures to find the line of best fit and algebra to write an equation for the overall model. We generally do _not_ do this in data science, because we have software for that. The AP curriculum will probably eventually drop the "transf

two-way table (a.k.a. contingency table)

a grid of joint counts (or, less often, joint probabilities) for 2 categorical variables Note: Strictly speaking, the row and column totals are not included in the definition, though we almost always want to see them.

model (more on doc)

a mathematical procedure (or algorithm) for predicting a response variable based on one or more explanatory variables

statistic

a number computed from data; examples include xbar, phat, s, sample median, sample Q1, mode, batting averages of baseball players, QB ratings, etc.

parameter

a number that describes a population; common examples include μ (population mean), σ (population s.d.), ρ (population linear correlation coefficient for bivariate data), population median, etc. In some contexts, especially in data science, the word "parameter" is used more generally to refer to a number that controls how a model functions. You can think of a parameter as an "adjustable constant" that defines the shape or behavior of the model. For example, every simple bivariate linear model has two parameters, namely the slope and the y-intercept of the line that illustrates the model's behavior. Those parameters also control the distribution of all possible ordered pairs in the theoretical population of the "true" relationship between x and y, which is why the statistical definition given above is also valid in that context. Hyperparameters are parameters of the model-building process, not parameters of the model itself. Hyperparameter tuning is the process of finding the hyperparameter values that are best for _building_ the model, but those hyperparameter values are not settings of the model itself once the model has been selected and built. An example of a hyperparameter would be the maximum number of levels deep that you wish to go when building a hierarchical classification model. Whether you select 3 or 4 (or 10, or whatever) for that maximum depth hyperparameter, that setting will have a strong effect on the run time and the results of your model-building software. However, after the model has been selected and built, the number 3 (or 4, or 10, or whatever) does not appear anywhere in the algorithm of the built model.

point estimate

a number, such as xbar or phat, that forms the center of a symmetric C.I.

time series

a particular type of bivariate data in which the x variable uses a time scale

discrete variable

a quantitative variable whose possible values include no intervals; the values are often integers (such as counts) but could also be something like measurements to the nearest quarter inch

continuous variable

a quantitative variable whose possible values include one or more intervals; most measurements that are not counts can be deemed to have this property, even if, in practice, data have to be rounded to a certain number of decimal places

data set (a.k.a. dataset, dataframe)

a set of observations, almost always organized in a 2D matrix of rows (records) and columns (fields) The terms column, field, and variable are used interchangeably. You should be equally comfortable with all three. In data science, Pandas is the Python library that we primarily use for dealing with dataframes.

Kludge

a solution to a problem, the performance of a tasks, or a system fix which is inefficient, inelegant, or even incomprehensible, but which somehow works a solution to a problem, the performance of a task, or a system fix which is inefficient, inelegant ("hacky"), or even incomprehensible, but which somehow works

sample

a subset of a population, not necessarily random Notation: n = sample size. For the sample itself, we use set notation with curly braces: for example, {4, 5, 11, 13, 27, 35}. Although a random sample (usually an SRS) is desirable, a true SRS is almost never possible in experiments. In experiments, random assignment of treatments is much more important than finding an SRS of subjects. Therefore, we treat the units in the treatment and control groups as being, for example, an SRS of all possible lab rats that could have been randomly assigned to treatment and control groups, even though the lab rats are not really an SRS. In data science, it is important to select samples randomly for training and test. Typically, 75% or 80% of your dataframe will be used for training the model, and the remaining records will be used for testing. Failure to select training data randomly can lead to severe overfitting or underfitting because of selection bias. For example, if your training data all come from weekdays, your model will probably not generalize well to a test sample that includes weekends.

outlier (univariate)

a value that is more than (1.5*IQR) below Q1, or more than (1.5*IQR) above Q3

categorical variable

a variable (e.g., sex or hair color) that does not take numeric values, though statistics (such as counts) can still be computed; two key types are nominal and ordinal

confounding variable

a variable that affects both group membership and the response variable; synonym for lurking variable A classic example of confounding would involve the explanatory variable "smoking status" (true or false) and the response variable "probability of contracting lung cancer." Although smokers are much more likely to contract lung cancer, we _cannot_ say that smoking causes an increased likelihood of lung cancer without additional analysis. Why not? The reason is that there are many potential confounding variables, such as alcohol consumption, tooth decay, and poverty, all of which are positively correlated with the number of cigarettes smoked. How do we know that smokers (who are more likely to drink heavily) are not getting the lung cancer as a result of their alcohol consumption? This argument was used successfully for decades by the big tobacco companies in an effort to avoid liability for smoking-related diseases such as lung cancer. The name applied to this argument was the "sloppy

quantitative variable

a variable whose possible values are numeric; two key types are continuous and discrete

bivariate

adj.: involving two variables Visualizations: If both variables are categorical, use a two-way table or a heatmap. If both variables are quantitative, use a scatterplot. If one variable is quantitative and the other is a time scale, the situation is called a time series; use a line graph (preferred) or a bar graph. If one variable is quantitative and the other is categorical, use stacked boxplots, a bar graph, stacked bars, clustered bars, or 3D bars.

resistant (or "robust")

adj.: not affected very much by outliers, addition/deletion of data, or minor changes in model assumptions Usage examples: We may say that a model is "resistant to outliers" or "robust in the presence of outliers." The 2- sample t test, the workhorse of scientific experimentation, is robust in the presence of mild violations of assumptions, especially when the control group and experimental group are of approximately equal size. Some models, such as the F distribution, are not robust and should be avoided by beginners. The F test is not resistant to nonnormality; in other words, if its normality assumption is violated, the results of the test may be wildly unreliable

univariate

adj.: referring to a single variable; preferred visualizations are histograms, bar graphs (categorical only), stemplots, or modified boxplots; pie graphs are deprecated

multivariate

adj.: refers to data sets in which interactions among 3 or more variables are of interest

unimodal, bimodal, multimodal

adj.: refers to distributions that have one, two, or more than two modes, respectively

gap

an empty region in a histogram Note: There is no rule of thumb. Judgment is solely by subjective visual evaluation.

data point (datum)†

an observation (record) in a data table, or sometimes a single field value of that observation; in the case of bivariate data, the term usually refers to a single (x, y) ordered pair

outlier (regression)

an ordered pair whose |residual| is "large" (based on eyeballing)

population

any group of interest, from which samples are typically drawn for the purpose of gaining knowledge (signal)

association (general, positive, negative)

any observable relationship between two variables Note: The term can apply even if one or both of the variables are categorical. However, "positive" and "negative" usually make sense only when both variables are quantitative. "Positive" means that the variables tend to increase or decrease together, while "negative" means that as one increases, the other tends to decrease. If there is no relationship at all between two variables, the term "independence" or "pure independence" may be used.

selection bias

bias that can be caused by having a sample that is not selected properly; key types are overcoverage (e.g., getting too many multi-phone households by using random-digit dialing) and undercoverage (e.g., not being able to locate people experiencing homelessness) In order to constitute bias, the selection process must produce samples that, on average, systematically produce values of the statistic of interest that tend to be higher or lower than in the population as a whole. It is not enough for one sample to differ in the response variable, since that happens all the time. (That is called sampling error, and it is not bias.) Bias occurs if there is a systematic process of selecting samples that have a higher or lower expected value for the statistic of interest than the expected value for the population as a whole.

histogram

blocky set of bars (although we do NOT call it a bar graph) for visualizing a distribution; possible values (or, more commonly, bins of values) are marked on the x axis, and counts or relative frequencies are marked on the y axis

false positive, false negative

common names for Type I and Type II error, respectively

conditional vs. unconditional probability

conditional probability is restricted to a subset of possible outcomes, and its notation always uses the | ("given") symbol; unconditional probability (a.k.a. marginal probability) is the ordinary type of probability that 7th graders can understand

distribution

data distribution = the set of all observed values for a variable, along with their frequencies or relative frequencies; usually depicted by a histogram probability distribution = the set of all possible values for a random variable, along with their long-run relative frequencies (if discrete) or probability densities (if continuous); usually depicted by a relative frequency histogram (discrete) or a pdf curve (continuous) Remember that "long-run relative frequency" is the AP definition of probability, conforming to the frequentist philosophy. That is why a probability distribution always includes relative frequency or density information, depending on whether the variable is discrete or continuous. In descriptive statistics (the fall months of AP Statistics), we deal only with data distributions. A data distribution is typically depicted by a histogram (or a stemplot). We would not use a modified boxplot to depict a distribution, since only the five-number summary and outliers are shown. If you want to see what a data distribution looks like, you really need a histogram or a stemplot. Data science students (not AP Statistics students) may also use a KDE (kernel density estimate) to depict a data distribution. The key fact to remember is that a data distribution is not simply a list of values; it is a list that has been sorted and binned so that the data can easily be put into a histogram. Probability distributions are a little trickier, since a histogram may not always work. For continuous variables, a probability distribution will have to use a pdf curve instead of a relative frequency histogram. Is there a distinction between a distribution and the visualization of that distribution? Technically, yes. However, you will never go wrong if you think of distributions as histograms or pdf curves. Splitting hai

observation

data from a single row of a data table; depending on context, this may refer to the entire row, a single column from that row, or (for bivariate data) the x and y values for that row

quartile

data value corresponding to the 25th percentile (Q1) or 75th percentile (Q3) of a distribution; Q2 is an acceptable, though infrequently used, notation for the median (50th percentile)

boxplot ("box and whiskers plot")

data visualization for five-number summary only, with outliers absorbed into the whiskers Note: We usually prefer _modified_ boxplots to regular boxplots, since the outliers are worth knowing about.

modified boxplot

data visualization for the five-number summary, with outliers (by the 1.5 IQR rule) shown as dots Note: In this style of visualization, a whisker will extend not to the min or max but only to the most extreme data point that is not an outlier.

heat map (a.k.a. heatmap)

data visualization in which physical regions are color-coded to indicate values of a statistic, or a grid of variables is shaded to show correlation values; extremely useful in many fields, including sports and marketing

bar graph (column graph if vertical)

data visualization technique for categorical data shown by horizontal or vertical bars whose lengths are proportional to the count, sum, relative frequency, or other statistic of interest; could also represent a quantitative variable (e.g., in a time series, though a line graph would be preferred there)

pie graph

deprecated visualization for a categorical variable; stacked bar graphs are preferred, since they take up less space and can be easily extended to show trends across time, for example

slope (slope coefficient)††

for AP Statistics: the LSRL statistic b in the model yhat = a + bx for data science: the OLS statistics bsub1, bsub2, bsub3, etc., in the model yhat = a + bsub1(x1) + bsub2(x2) + bsub3(x3) + . . . + bsubn(xn) The slope coefficient (singular) is found in AP Statistics by using STAT CALC 8. In data science, our data files are much too large for using a hand calculator, and we usually use multiple linear regression anyway (giving us multiple slope coefficients). Therefore, in data science, we use either the LinearRegression feature of sklearn.linear_model or the OLS feature of statsmodels.api. (If using the latter, it is customary to import statsmodels.api as sm, then use sm.OLS to produce your linear fit.) Note that a is often called bsub0. The notation of bsub0 for intercept and bsub1 for slope was used in AP Statistics exams from 1996 through 2019. From fall 2019 onward, a and b are used for intercept and slope, respectively, which conforms with the TI calculator's STAT CALC 8 notation but conflicts with the standard notation used in the rest of the world. Memorize these words: "Our model predicts that each 1-unit increase in ___ [describe your explanatory variable here] is associated with ___ [say "an increase" or "a decrease"] of about ___ units of ___ [describe your response variable here], on average."

median

for both discrete and continuous distributions, the 50th-percentile value

nominal scale

for categorical variables, a set of possible data consisting of "name only"; examples include hair color, county of residence, and ZIP code (note that ZIP code looks numeric but is really nominal and categorical)

ordinal scale

for categorical variables, a set of possible data for which an ordering is implied; an example is S/M/L/XL for Tshirts, even if the size differences from one size to the next are not uniform

mode

for discrete distributions, the most common value; for continuous distributions, the value with the highest pdf

interval scale

for quantitative variables, a set of possible data for which addition, subtraction, means, and standard deviations are meaningful; examples include the Fahrenheit and Celsius scales for temperature

ratio scale

for quantitative variables, a set of possible data for which all arithmetic operations are meaningful, including multiplication, division, and ratios Example 1: almost any unit scale (such as pounds, inches, meters, square centimeters, dollars, or euros) for which 0 literally means that none is present. Example 2. Kelvin scale for temperature, since it has an absolute zero (temperature of 0 K means no molecular motion). A temperature of 30 K is literally 3 times warmer than a temperature of 10 K, since there is 3 times as much molecular motion. Note: Fahrenheit and Celsius temperature scales are _not_ ratio scales, since they do not have an absolute zero. It would be nonsense to say, "The outdoor air temperature is 30 degrees today, which is 3 times warmer than yesterday, when it was only 10 degrees." Note: Most quantitative data has the ratio scale. Ratio scale has an absolute zero, interval scale does not.

correlation†

for us, the Pearson's r value (statistic) computed from bivariate quantitative data The r value can range from -1 (indicating perfect negative linear correlation) to 1 (indicating perfect positive linear correlation). If r is close to 0, that is an indication that a linear fit is completely inappropriate; either the data have no pattern, or the pattern is nonlinear. Although an r value near -1 or 1 is often a good indication of linearity, there are many exceptions. For example, exponential growth usually gives an r value of 0.9 or above, and exponential decay usually gives an r value of -0.9 or lower. That means that r^2, an indicator of the strength of linear correlation, is greater than 0.8 in both cases, and that is considered strong. Nevertheless, a curved function would do a much better job of fitting the data. Therefore, merely having a strong r (or r^2) value is not enough to say that a linear fit is appropriate. You need to LOOK AT YOUR DATA (scatterplot) and check the residual plot as well. In data science, the "Spearman's rho" is another type of correlation that is appropriate not only for bivariate quantitative data but also for ordinal categorical data. However, the Pearson's correlation is more commonly seen. AP Statistics uses only Pearson's, never Spearman's.

data

information + noise A more traditional definition is facts (plural), whether they are categorical, quantitative, or other. Therefore, anything that can be reduced to binary (0/1) format, and therefore anything capable of being stored on a computer, would qualify. Examples: quantitative: height, weight, pole-vault height, lap time, RBI, OBP, AB, BA, WHIP, IP, FGP, FTP, W/L record categorical: gender, year in school, ZIP code, state of residence, hair color, preferred T-shirt size other: audio clips, video clips, essay question responses, teacher evaluation paragraphs, etc.

skewness (left, right)

lack of symmetry in a distribution If the histogram or pdf curve dribbles out to the left, we have left skewness. If the histogram or pdf curve dribbles out to the right (which is much more common, especially with economic data), we have right skewness.

range

maximum - minimum; note that this definition is different from what you learned in precalculus

mean††, weighted mean

mean = sum of all values divided by the count weighted mean = sum of all values (where each has been multiplied by a weight), divided at the end by the sum of the weights Note: Those definitions are valid for discrete distributions only. For continuous distributions, AP Statistics students are expected to estimate, since the true computation would require calculus. The mean of a random variable is always computed as a weighted mean, and the same idea is used for the variance of a random variable as well (since variance equals probability-weighted MSE). "Weighted" simply refers to the fact that the data values (inputs) have coefficients applied to them before being divided by the sum of those coefficients. If probabilities are used as weights, this is easy, because the divisor will typically be 1 since probabilities in a distribution add up to 1. Note: For discrete random variables, the weighting follows common-sense rules. For example, if a fair die is used to calculate the numbe

signal

meaningful information Note: Data (in the real world) will always include noise in addition to meaningful information. The key task for people who analyze data (statisticians, data scientists, market researchers, etc.) is to separate the signal from the noise. UNDERFITTING is a situation in which a model does not capture enough of the signal to be useful. OVERFITTING is a situation in which a model captures too much noise, thereby making the model unsuitable for making generalized predictions involving previously unseen data.

variance

measure of dispersion defined to be the mean squared deviation from the mean; units are the "square" of whatever units the data are in, which might not be useful or meaningful Population variance = σ^2 = MSE = (mean squared error) = (mean squared deviation from μ). This is a conceptual/theoretical description, not something we can use for computation, since μ is an unknown parameter in all real-world settings. Sample variance = s^2 = statistic that is similar to population variance, except that we use "mean squared deviation from xbar" instead of "mean squared deviation from μ" when making the computation. Also, we use n - 1 instead of n in the denominator when computing s^2. For more information on the reason for the choice of n - 1 instead of n in the denominator, please see the lexicon entry for mean squared error (MSE). Variance does NOT equal variability

standard deviation (s.d.)

measure of dispersion equal to the square root of variance; has the advantage of being in the same units as the underlying data Sample s.d. = s = square root of sample variance. Population s.d. = σ = square root of population variance = unknown parameter in all real-world settings

five-number summary

minimum, Q1, median, Q3, and maximum, in that order

classification

modeling technique used when the target variable is categorical; important in real life, but not used in AP Statistics

least-squares regression line (LSRL)†

most popular type of bivariate curve fitting; unique line that minimizes the sum of squared residuals and has other desirable mathematical properties

percentile

number (or approximate number) below which __% of the population is found Examples: The 90th percentile of the SAT Math portion is about 630, since roughly 90% of the students who take that test score below 630. The median is a synonym for the 50th percentile. By convention, the median of a sample of size n, where n is even, is the mean of the two middle values. Warning: Since Q1 (25th percentile) and Q3 (75th percentile) computations vary depending on who wrote the software, always use what your calculator or spreadsheet gives you--don't try to compute quartiles or percentiles by hand.

statistics

originally a branch of applied mathematics concerned with the data analysis that Napoleon and other "statists" found useful to support the bureaucratic state; however, statistics is now a field in its own right (with a separate Department of Statistics at many universities) with both theoretical and applied branches, as well as subspecialties like forensic statistics, biostatistics, experimental design, survey design, psychometrics, econometrics, etc.

confidence level

percentage that represents the long-run success rate of the METHOD used for computing confidence intervals; says nothing about the probability that any particular interval is correct

A/B testing

popular marketing technique that uses statistics to compare two approaches (e.g., two slightly different direct-mail or webpage designs) to see if one is better than the other by a statistically significant degree

analysis

process of breaking things down into simpler components

confidence interval (C.I.)†

set of parameter values that are collectively deemed ___% likely (where ___ denotes some percentage level of confidence), based on statistics for point estimate and m.o.e.; for a symmetric interval (the only type we study in AP Statistics), half the interval's width equals the m.o.e.

residual†

signed quantity (also called "error") computed by this formula: observed value - predicted value By this definition, a positive residual always indicates a y value (true value of the response or target variable) that is greater than its predicted value. In other words, the true data point is above the LSRL or other fitted curve. Similarly, a negative residual always indicates a y value that is less than its predicted value; the true data point is below the LSRL or other fitted curve. Because predicted values are denoted by yhat, we can give this equivalent formula for residual: residual = y - yhat Fun fact: LSRL and other ordinary least-squares models have the interesting property that the sum of residuals always equals 0.

homoscedasticity*, heteroscedasticity

situation in which variances are (are not) roughly equal These terms are most often applied to residuals. In that context, homoscedasticity means that the residuals, although they vary, have roughly the same amount of variability across all values of the explanatory variable(s). Heteroscedasticity of residuals (a bad thing) means that the residuals have much larger variability for some values of the explanatory variable(s) than for others. Homoscedasticity of the residuals is a requirement for conducting valid inference involving linear regression models. In AP Statistics, where our regression models have only a single explanatory variable, x, homoscedasticity is quite easy to verify. We simply need to make sure that the residual plot (with residuals as a function of x) shows no dramatic "flaring" or "expansion" as x changes. If there is no dramatic change in the variability of the residuals as x changes, we will declare that there is no significant heteroscedasticity, an

overfitting/underfitting

situations in which a model includes too much noise or too little signal The essential challenge of all modeling activities, whether regression or classification, is to include enough signal in the model without including too much noise. This is a difficult challenge, since it is hard to tell in advance what observed variations are meaningful ("signal") versus extraneous ("noise"). If the model includes too much noise, overfitting will occur. If the model includes too little signal, underfitting will occur.

margin of error (m.o.e.)

size of the "plus or minus" deviation within which we are __% sure (usually 95% sure) that the sampling error of a statistic lies Note: Only the sampling error is included. All other potential sources of error (wording of the question, response bias, tabulation error, house effect, etc.) are excluded. Therefore, you should think of the m.o.e. as the "m.o.s.e." (margin of sampling error), and a few sources do use the abbreviation m.o.s.e. instead of m.o.e. for clarity. The abbreviation should really be changed worldwide, but the abbreviation "m.o.e." is already widespread. A "rough and ready" estimate of the m.o.e. is to multiply the standard error by 2. The reason that doubling the s.e. usually gives a reasonable estimate of m.o.e. is that in a normal or nearly normal sampling distribution (i.e., the distribution of all possible values for the statistic), the standard deviation of that sampling distribution is by definition the s.e., and in any normal distribution, approximately 95% of the possibilities are found within ±2 standard deviations of the central value. Therefore, as long as our confidence level is 95%, the "double the s.e." estimate for m.o.e. is usually reasonable. In AP Statistics, we learn much more accurate techniques for computing the m.o.e., especially involving inference about the mean with small sample sizes (where we use t procedures instead of z procedures). In AP Statistics, we use the formula (crit. val.)(s.e.) to compute the m.o.e., and the critical value sometimes differs greatly from 2. Nevertheless, the "double the s.e." idea is useful enough that it may appear in scientific papers. Usually there will be a legend or a key telling you what the "plus or minus" in a table signifies, and if it is not m.o.e., then it is usually either s.d., s.e., or twice the s.e

sample standard deviation (s)

square root of sample variance (see "variance" definition

scatter plot

standard visualization, using x and y axes, for any bivariate relationship involving two quantitative variables

lurking variable

synonym for confounding variable

regression

synonym for modeling/curve fitting where the target variable is continuous (at least in theory); name comes from the concept of "regression to the mean"

mean squared error (MSE)

synonym for population variance In data science, we use MSE of regression residuals as a measure of how good the fit is; lower is better. Note: Sample variance (s^2) uses a slightly different formula with n - 1 instead of n in the denominator. It can be proved, after a page or so of messy algebra, that E(s^2) = σ^2. Therefore, the sample variance is an unbiased estimator of the population variance. The proof relies on using n - 1 instead of n in the denominator for s^2. Side note for AP Statistics students only: Even though s^2 is an unbiased estimator of σ^2, s is not an unbiased estimator of σ. It turns out that s, the sample s.d., is always negatively biased. That is why you cannot use z procedures for a C.I. for the mean with small sample sizes, even if the population distribution is normal. The s.e., computed as s/sqrt(n), is simply not large enough to properly account for all the variability in the xbar statistic. Consequently, AP Statistics students learn to use t*, the t-critical value, instead of z*, the z-critical value, when computing the m.o.e. for a mean. (Recall that m.o.e. is the critical value multiplied by the s.e.)

curve fitting

synonym for regression, a modeling technique used when the target (response) variable is quantitative and continuous; in business, the most common method is LSRL (least-squares regression line), but many other possibilities exist

error

synonym for residual (does not mean a "mistake")

normal (a.k.a. Gaussian or z) distribution

the famous bell-shaped distribution commonly found in nature Gauss hypothesized that independent and identically distributed (i.i.d.) errors in any context, as long as they have finite s.d., will add up to a normal distribution as n→∞. From this, he developed the CLT: The sampling distribution of xbar approaches the normal distribution centered on μ, with standard error σ/sqrt(n), for large n whenever the population has finite s.d. However, even the great Gauss was unable to prove the CLT rigorously. The first rigorous proofs are credited to Laplace (1810), Chebyshev, Lyapunov, and Lindeberg (1920, 1922), with each mathematician building upon the work of those who went before and developing better and better versions. A curious footnote is that Alan Turing, known today as the father of computer science, proved something similar to Lindeberg's version in 1934, not realizing that it had already been proved. The normal distribution is the most commonly occurring distribution in nature; examples include heights of trees in a forest, weights of passengers on a cruise ship, or scores that students earn on a test. However, distributions from finance and economics tend to be strongly skewed, not normal. Nerd note: The pdf of the normal distribution is f(z) = 1/sqrt(2π) · exp(-z^2 / 2), where exp denotes the exponential function (e to the power of . . .). It is remarkable that the normal distribution's formula involves both π and e, two of the most basic constants of nature!

data science (more on doc)

the hottest college major in America, with starting salaries approaching $200K Data science is shown in a Venn diagram by the intersection of statistics, computer science, and at least one other field (often business or management, but any other "domain knowledge" will do for the third circle). Domain knowledge is crucial, since otherwise you cannot convincingly make your case to an audience of decision makers. Because the field is so new, no generally accepted definition of data science exists yet. However, there appear to be 3 key skill areas where every data scientist needs to be competent:

exploratory data analysis (EDA)

the most essential (and, often, the most useful) step in the process of modeling or understanding data

variability, variation

the most fundamental concept of all, since in the real world, all measurements are subject to excess variability (noise) Finding a signal (i.e., meaningful variation) amidst the noise is the goal of all statistical analysis, whether exploratory or inferential.

coefficient of determination†

the proportion of variation in y that is explained by variation in x (or, equivalently, by the LSRL computed for y upon x); found by squaring the linear correlation coefficient In AP Statistics, the notation for this is r^2. In data science, since we are nearly always using multiple linear regression, we use the notation R^2 instead. R^2 tells us the portion of the variation in the target variable that can be explained by our regression model. For example, if the coefficient of determination is 0.943, that means that 94.3% of the variation in y can be explained by the linear regression model. The rest of the variation is attributable to random variation ("noise") or other factors not included in the model.

mathematics

the study of patterns; the science of abstraction

descriptive statistics

the type of statistics that 7th graders can do: mean, mode, median, IQR, pie graphs, boxplots, etc.

explanatory variable

variable x (for us, in the bivariate regression setting); more generally, any variable that is thought to play a meaningful role on the "input side" of a mathematical model Note: Merely because a variable is considered explanatory implies _nothing_ regarding cause and effect. A variable can be useful for prediction without being in any way connected to causing a change in the response variable. In multiple regression, there are multiple explanatory variables and a single target variable. Although multiple regression is of great interest to data scientists, and even though multiple linear regression is hardly any more difficult than simple bivariate LSRL, multiple regression is not an AP topic.

response variable (data science "target")

variable y (if we are talking about the actual values) or what (if we are talking about the output of a model) In data science, where the primary goal is to make predictions, the response variable is usually called the target variable. Note: Merely because a variable is considered "response" (or "target") implies _nothing_ regarding cause and effect. A response variable can be usefully predicted by one or more explanatory variables without being in any way "caused" by changes in those explanatory variables.

stacked (a.k.a. segmented) bar graph

visualization technique that is preferable to pie graphs, since it takes up less space and can be readily generalized to show trends across time, for example


Ensembles d'études connexes

GI (100 Q's) [Test: Thursday May 7th, 2015]

View Set

Foundations of Nursing Chapter 14 & 15

View Set

Chapter 4: Specific Factors and Income Distribution

View Set

Base Conversion - Binary,Octal, Decimal, and Hexadecimal

View Set