COGS 9 - EXAM 2
Data Partitioning
Training data: used to build your predictive model Validation data: data from original dataset that was held out and not used in training the model; helpful in fine-tuning prediction accuracy. Test data: new and independent data set used to assess if prediction model is generalizable.
Linear regression
best-fitting line *Magnitude of relationship: slope (effect size) *Linear relationship, multivariate normality, no multicollinearity, no autocorrelation, homoscedasticity
best practices for sampling from a population
* Always think about what your population is * Collect data from a sample that is representative of your population * If you have no choice but to work with a dataset that is not collected randomly and is biased, be careful not to generalize your results to the entire population
Data Visualization Best Practices
* Choose the right type of visualization * Be mindful when choosing colors: Many color-blind individuals cannot see the difference between red and green. * Label your axes! * Make sure the text size is big enough! * Make sure your numbers add up! * Make sure the numbers and graphics represent the data. * Make comparisons easy on your readers, avoid unnecessary whitespace. * Use y-axes that start at 0 for bar plots. * Keep it Simple. * Allow the viewer to make comparison top to bottom Airport: Avoid comparisons across rows. * Order rows logically: usually largest to smallest. * Order columns logically: more specific to less specific. * Limit the number of rows and columns * Include informative labels * Be mindful of significant digits * Include a good caption * Include the source of the data
What makes Deep Learning different than Artificial Neural Network?
* Internet sized data sets * Computing power * Trick architectures * Trick methods of training the network
What can't Deep Learning do?
* Needs so much data, at least some must be supervised * May take days to train * Hard to know WHY it does something * Can learn things you don't want it to (wolves&snow, bedside x-ray) * Transfer learning is hard * No chance of AGI
Deep Learning use cases
* Tons of data * It isn't clear how to create good features for the task * Pre-existing network can be fine-tuned for your task ... OR ... you have time and cleverness to create something from scratch * You know what you're doing
Inference
*Answer the question 'is there a relationship?' Also usually: what direction? how strong is it? *Are they trying to predict measurement(s) for individuals?
Feature Selection
*Determines which variables are most predictive and includes them in the model. *Variables that can be used for accurate prediction exploit the relationship between the variables but do NOT mean that one causes the other.
What is machine learning?
*Machine learning is the science of getting computers to act without being explicitly programmed. *Machine learning is a method of data analysis that automates analytical model building. *Machine learning is just picking an appropriate model and then minimizing a loss function.
Sentiment Analysis
*Programmatically infer emotional content of text. *Compare to a sentiment lexicon : dataset containing words classified by their sentiment
Ecological Fallacy
*Situation that can occur when a researcher or analyst makes an inference about an individual based on aggregate data for a group. *Issues: Inferences drawn about associations between the characteristics of an aggregate population and the characteristics of sub-units within the population are wrong. That is: results from aggregated data (e.g. counties) cannot be applied to individual people. *What should we do? Be aware of the process of aggregating or disaggregating data may conceal the variations that are not visible at the larger aggregate level
What is wrong with plotting raw count histogram-type distributions of many real-world geospatial features, such as population, economic activity, number of website users, etc.?
*Spatial data is incompatible with conventional statistics. *There are underlying correlation between geospatial features and population
TF-IDF
*Term Frequency - Inverse Document Frequency *Term Frequency (TF): how frequently a word occurs in a document. *Inverse document frequency (IDF): how important a word is to a document.
exploratory analysis
*The goal is to find unknown relationships between the variables you have measured in your data set. Exploratory analysis is open-ended and designed to verify expected or find unexpected relationships between measurements. *What can the data tell us? *Understand data properties * Discover Patterns *Generate & Frame Hypothesis * Suggest modeling strategies *Check assumptions (sanity checks) * Communicate results (present the data)
When would you use TD-IDF?
*The high score: how is this thing different than most things in a collection? Unique customer needs; Subreddit/Discord-server topics; Which document to retrieve when I search for X. *The low score: how is the thing the same as most other things in a collection? Find stop words; What needs are common across customers
Supervised Learning
*You tell the computer what features to use to make predictions. *Prediction accuracy dependent on training data. *Categorical variables (Classification) *Continuous variables (Regression)
choropleth map
*a map that uses differences in shading, coloring, or the placing of symbols within predefined areas to indicate the average values of a property or quantity in those areas. *Choropleth Maps shine when displaying a single variable. *Choropleth excel at displaying the big picture, not subtle differences. *Choropleth should display relative differences, not absolute numbers *Use light colors for low values. Dark colors for high values. *Consider using the smallest unit possible
Tokenization
*takes corpus of text and splits it into tokens. *token: a meaningful unit of text; what you use for analysis.
Machine Learning Bias
1. Anticipate and plan for potential biases before model generation. Check for bias after. 2. Use machine learning to improve lives rather than for punitive purposes. 3. Revisit your models. Update your algorithms. 4. You are responsible for the models you put out into the world, unintended consequences and all.
Visualization best practices
1. Choose the right type of visualization 2. Use appropriate colors 3. Label your axes 4. Make sure everything is big enough 5. Make sure numbers add up 6. Make sure plot reflects the data 7. Make comparisons easy on your reader 8. Don't deceive viewers 9. Keep it simple
Why effective data communication matters
1. It's often the only thing your coworkers/bosses see 2. It can set your work apart from others' 3. It helps show off the awesome stuff you've done 4. Cognitive load is a thing
Model
= algorithm: Pick a kind of ML based on the task; Pick a (simple) algorithm that can solve it + hyperparameters: Some algorithms need you to pick settings; Can be picked automatically through validation + data features: Subset or transform variables to make learning easier; Can be picked automatically through validation
Histogram
A graph of vertical bars representing the frequency distribution of a set of data. Only one variable.
box plot
A graph that displays the highest and lowest quarters of data as whiskers, the middle two-quarters of the data as a box, and the median.
Scatterplot
A graphical depiction of the relationship between two variables.
Correlation
A measure of the relationship between two variables. Pearson Correlation.
algoritm
A methodical, logical rule or procedure that guarantees solving a particular problem.
Bell-Shaped Distribution
A probability distribution in which the highest frequency occurs in the middle and frequencies tail off to the left and right of the middle.
confounding
A variable that influences both the dependent variable and the independent variable, causing a false association.
Regression
A way to forecast a given numerical quantity using other relevant features. DOES CHANGE IN ONE VARIABLE MEAN CHANGE IN ANOTHER.
Edge Effects (The Boundary Problem)
Analyzing A vs B ignores similarities between the two based on their shared boundary
Anscombe's Quartet
Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties.
barplot
Count of values within a categorical variable
Spatial Autocorrelation
Data from locations near one another in space are more likely to be similar than data from locations remote from one another: Housing market; Elevation change; Temperature
Density plot
Demonstrates the distribution of the data (A smoothed version of a histogram). And helps to identify extreme values.
A/B Testing
Designing and running an experiment to compare two versions (typically, a web page or an app), to determine which is "better" * Choose one key metric for testing. * Design Experiment & Run for Length of Time Planned. * Confidence Intervals more important than p-values. * Don't look at all the possible subgroups * Look for "bucketing skew" (diff sample size b/w groups A/B). * Only include meaningful users in your sample. * Keep the analytical approach simple. * Change one thing at a time
Comparison of means
Difference in means between variables t-test
Deep Learning
Effective + mechanistic definition: Multi-layered artificial neural networks that improve at tasks through experience. Works by function optimization.
What was the algorithm proposed in A Mulching Proposal (R2)? What does this have to do with ethics?
Elderly people are rendered down into a fine nutrient slurry, directly addressing both issues.
Loss function
Find the line that minimizes the sum of squared errors
data-ink ratio
Good graphics should include only data-Ink. Non-Data-Ink is to be deleted everywhere where possible. The reason for this is to avoid drawing the attention of viewers of the data presentation to irrelevant elements.
How can the performance of an algorithm be measured?
How effective is it/how many resets does it take - think sorting? How precise is it? How often does it fail and how does it fail (does it handle edge cases well)? Most clearly - how well does it solve the problem it is designed to solve?
Shape
It's critical to know the distribution of the variables in your dataset because certain statistical approaches can only be used with certain distributions.
Central Tendency
Knowing the mean, median, and/or mode can help you get an idea of what a typical value is for your variable of interest. mean and median are used to summarize the central tendency for quantitative variables. mode is most helpful in describing the central tendency for categorical variables
Model Assessment
Measurement; Purpose; Variable type *RMSE; Root Mean Squared Error; summarize distance b/w prediction and actual value; sensitive to outliers (lower = better); Continuous *Accuracy; What % were predicted correctly? (higher = better); Categorical (Continuous) * Sensitivity or Recall; Of those that were positives, what % were predicted to be positive? (higher = better); Categorical * Specificity; Of those that were negative, what % were predicted to be negative? (higher = better); Categorical *Precision or PPV; Of those that should have been positives, what % were predicted to be positive? (higher = better); Categorical * F1 score; Considers Precision and Recall (0 = poor; 1 = best); Categorical
Correlation = Causation
NO. Association does NOT prove causation. Correlation indicates the possibility of a cause-effect relationship but does not prove such.
Classification
Often we seek to assign a label to an item from a discrete set of possibilities.
data outliers
Outliers can occur due to... * Data entry errors * Poor sampling procedures * Technical or mechanical error * Unexpected changes in weather * People providing inaccurate information
How can plotting certain features, such as disease prevalence rates by geographic region, help scientists form new hypotheses about disease causes (or cures)?
Plotting disease prevalence rates can help scientists determine how a virus/disease spreads and what conditions predispose you to a disease. With these two things, scientists can determine what causes a disease and thus, how to potentially prevent it.
Model Selection
Supervised *Regression = predicting continuous variables (e.g., age). *Classification = predicting categorical variables (e.g., education level). Unsupervised *Data given as input => model identifies patterns in the input data => prediction output *model identifies patterns in the input data: PCA, k-means clustering, t-SNE, neural nets, self-organizing maps *Useful for exploring a dataset (EDA), finding useful features in higher dimensional data, or finding known categories
Modifiable Areal Unit Problem (MAUP)
The aggregation units used are arbitrary with respect to the phenomena under investigation, yet the aggregation units used will affect statistics determined on the basis of data reported in this way. If the spatial units in a particular study were specified differently, we might observe very different patterns and relationships. gerrymandering
Unsupervised Learning
The computer determines how to classify based on properties within the data. *Categorical (Clustering) *Dimensionality reduction (Continuous)
descriptive analysis
The goal of descriptive analysis is to understand the components of a data set, describe what they are, and explain that description to others who might want to understand the data.
Skewed Right Distribution
The peak of the data is to the left side of the graph. There are only a few data points to the right side of the graph.
Skewed Left Distribution
The peak of the data is to the right side of the graph. There are only a few data points to the left side of the graph.
At Stitch Fix, at what points in the process are algorithms used?
They are used throughout the process: managing the inventory of the warehouses, minimizing the cost of shipping, finding the best wardrobe based on preferences, and finding the best combination of the above conditions for the user and the company. They also look at what other users like a specific user like to try and guess what that user likes.
What does it mean for an algorithm to be FAT (Fair, Accountable, Transparent)?
This says that algorithmic systems should, to be ethical, be: (1) Fair: lacking biases which create unfair and discriminatory outcomes; (2) Accountable: answerable to the people subject to them; (3) Transparent: open about how, and why, particular decisions were made.
What is meant by "What the average user thinks they are doing is what is actually being done?"
Ties into informed consent - the average user (who is not overly familiar with data science) should not have data collected that they do not know about and the data collected should actually be used in the way that users think it is being used for the good of the service. There should be no data used in the background that the user doesn't know about.
Tables
Used to arrange text in columns and rows and are helpful in presenting, organizing, and clarifying information. Effective ways to display data summaries.
Spatial Statistics
Violations of conventional statistics: ● Spatial autocorrelation ● Modifiable areal unit problem (MAUP) ● Edge effects (Boundary problem) ● Ecology fallacy ● Nonuniformity of space
Why are maps the killer app?
Well, I mean, we all want to know where stuff is. We want to obviously be able to get around and we'd like to be, get to be able to get around without going through.
Predictive Analysis Ethics
When models are trained on historical data, predictions will perpetuate historical biases.
bimodal distribution
a distribution with two modes (peaks)
Basic steps of machine learning
data partitioning; feature selection; model selection; model assessment.
Isarithmic maps
demonstrate smooth, continuous phenomena (temperature, elevation, rainfall, etc.)
Word clouds
display the words proportional to their frequency within the textual dataset.
Variability
in a set of numbers, how widely dispersed the values are from each other and from the mean. Range: highest score - lowest score. Interquartile Range (IQR): 75th percentile - 25th percentile. variance: measures how close the values in the distribution are to the middle of the distribution - average squared difference of the scores from the mean. Standard deviation (SD): square-root of the variance.
Pearson's r
linear correlation between two variables takes values [-1,1]
How do we describe a dataset?
size, missingness, shape, central tendency, variability
t-test
tests for difference in means between groups. Data are continuous, normally distributed, large enough sample size, equal variance between groups.
Uniform distribution
the frequency of each value of the variable is evenly spread out across the values of the variable
p-value
the probability of getting the observed results (or results more extreme) by chance alone