Introduction to Data Science
Advanced Imputing | MICE
Multiple Imputation by Chained Equations sklearn.impute.MICEImputer(v0.20) More in Python (not sklearn) fancyimpute package includes: -KNN [k nearest neighbor] impute -Softimpute -MICE and more
Data Collection | Labeling
Obtaining gold-standard answers for supervised learning
Supervised Learning | Univariate Regression
Simplest Linear Regression Model -Model relation between a single feature ( explanatory variable x) and a real-vauled response (target variable y) - Given data(x,y), and a line defined by the intercept slope the vertical offset for each point from the line is the error between the true label y and the prediction based on x - The best line minimizes the Sum of Squared Errors (SSE) -We usually assume the error is Gaussian distributed with mean zero and fixed variance
Supervised Learning | Recurrent Neural Network (RNN)
Single- Layer neural network Timeshare data, NLP, or translation application the sequence of the data means something this is the way to go
Supervised Learning |Convolutional Neural Networks
Single- Layer neural network Useful for image analysis the input is an image or a sequence of images which are weighted. Using kernels as filters to extract the local features. Pooling - reducing the size with max or average pooling. Dimension reduction process.
Semi- Supervised Learning
Some of the test data will have labels and others will not
Domain Knowledge | ML Workflow
What needs to happen with data to solve a business problem. First step is tranfroming into a ML problem, then get data... ref video for flow chart
To prevent encoding too many classes:
When using onehotencoder: Define a hierarchy structure Group the levels by similarity to reduce the overall number of groups.
Find the: Robust scaling
sklearn.preprocessing.RobustScaler
Underfitting
- Failure to capture important patterns in the training set -Typically indicates model is too simple or there are too few explanatory variables -Not flexible enough corresponds to high bias- the results show a systemic lack of fit in certain regions
Supervised Learning | Multivariate Regression
- Multiple Linear Regression includes N explanatory variables with N>=2 - Sensitive to correlation between features, resulting in high variance of coefficients -scikit-learn implementation: sklearn.linear_model.LinearRegression
Supervised Learning | Linear Regression
- Parametric methods where function learned has form (ref. fx) - optimized by learning weights by applying (stochastic) gradient descent to minimize loss function ( ref.fx) - good place to start with a new problem Methods * Linear regression for numeric target outcome * Logistic regression for categorical target outcome
Data Quality
-Consistency of Data -Accuracy of Data -Is it what we need to solve the business problem - Noisy Data - Missing data - Outliers in the data - Bias - Variance
Neural Networks
-Layers of nodes connected together - Each node is one multivarative linear linear function, with an univariatice nonlinear transformation -Trained via(stochastic) gradient descent -Can represent any non-linear function (very expressive)
Scaling Transformation | sklearn
-Mean/Variance standardization | sklearn.preprocessing. StandardScaler -MinMax scaling -Maxabs scaling -Robust scaling ( for the column) -Normalizer (done for each row)
Considerations when Feature Engineering
-Numerical Feature Transformations of attributes - Combinations of attributes to understand the relationship between them
Amazon Mechanical Turk
-Obtain human intelligence on demand -Access global, on-demand , 24-7 workforce - Pay only for what you use - Use for labeling - Human annotation
Model Tuning | Hyperparameter
1. An external configuration whose value cannot be estimated from the data. 2. Estimator parameter that is not fitted to the data
Before dropping or imputing (replacing) missing values, ask:
1. What were the mechanisms that caused the missing values? People may not have wanted to answer the questions. It is missing on purpose 2. Are these missing values missing at random? 3. Are there rows or columns missing that I am not aware of?
Seaborn
A Python data visualization library based on matplotlib
Supervised Learning | Decision Trees
A decision making process similar to the metal models made by human beings. One input lends itself to inform the next decision until a binary answer is achieved. suceptable to overfitting. trees must be pruned or use the Ensemble method sklearn.tree.DecisionTreeClassifier
AWS Professional Services
A global team of experts that can help you realize your desired business outcomes when using the AWS Cloud.
Solutions Architect (SAs)
A machine learning specialist is the Subject Matter Expert (SME) for designing machine learning solutions that leverage AWS services to automate solutions and drive down costs for customers.
Neural Networks | Peceptron
A single layer neural network that uses a list of input features .
Imputation
A technique that replaces a missing value with an estimated value including the attribute's: -Mean -Median -Most frequent(for categoricals), -or any other estimated value sklearn.preprocessing import Imputer
Supervised Learning | K-Nearest Neighbors
A way to figure out the response of a new operation based on how close it is to the training dataset. 1. Define the distance - Euclidean distance - Manhattan distance - any vector norm 2.Choose the number of K neighbors 3. Find the k nearest neighbors of the new observation that we want to classify Essentially a vote to classify the new operation to define the [k]
Model Training
Builds the model.Improve the model by optimizing parameters or data.
Filtering Sound
Choose the relevant features and remove frequencies if the power is less than a threshold Like voice analysis
True or False | When feature engineering you should not try generating many features first,then apply dimensionality reduction if needed
False, you should try generating as many features first, then apply dimensionality reduction if needed when feature engineering.
Error Analysis (EA)
Filter in failed predictions and manually look for patterns. This helps you pivot on target, key attributes, and failure type, and build histograms of error counts.
Supervised Learning | Logistic Regression
Forecast a binary action (yes or no , true or false) based on the data Finds the best weight vector by fitting the training data
Amazon SageMaker
Fully managed platform built for data pre-processing, feature engineering, model fitting, model evaluation, and model implementation all in one.
Confidence Interval
Go with the larger sample size. This is bases on the width of the interval not percentage of the metric.
Managing Labelers | Gold Standard HITs
HITs with known labels mixed in to identify problematic labelers
HITs
Human Intelligence Tasks used when labeling see Amazon Mechanical Turk
Data Sampling | Leakage
Information used in training and validation but not available in production
Data Labeling | Human
Labeling guidelines need to be VERY clear. Human Intellegence Tasks (HITs)
When preprocessing data, what are the risks of dropping rows?
Losing too much data: -Overfitting, wider confidence intervals, etc May bias the sample
When preprocessing data, what are the risks of dropping columns?
May lose information in features (underfitting)
Correlation Marix
Measure the linear dependence between features; can be visualized with heat-maps
Unsupervised Learning
Models learn from test data that has not been labeled. We only have a collection of features
Feature Scaling
Motivation: The range of features is dramatically different scales Example: Gradient descent and kNN [k nearest neighbor] Solution: Align all feature onto the same scale
Learning Curves
Plot training dataset and validation dataset error or accuracy against training set size. Motivation: Detect if the model is underfitting or overfitting, and impact of training data size. sklearn. learning_curve.learning_curve
Supervised Learning | Non-Linear Support Vector Machines
Popular in research, not in the industry -Kernalize for nonlinear problems: 1. Choose a distance function named a "kernel" 2. Map the learning task to a higher dimension space 3. Apply a linear SVM classifier in the new space - Not memory- efficient, because it stores the support vectors, which grow with the size of the training data -Computation is expensive sklearn.svm. SVC
Text-based Features | Count Vectorizer
Pre-word value us count (also called term frequency) *includes lowercasing and tokenization on white space and punctuation sklearn.feature_extraction.text.CountVectorizer
Data Science
Processes and systems to extract knowledge or insights from data, either structured ot unstructured.
Supervised Learning | Decision Trees | Entropy
Relative measure of disorder in the data source
Supervised Learning | Logistic Regression | Sigmoid Curve
Represents the probability (ref. fx)
Random Sampling
Sampling so that each source data point has equal probability of being selected
Filtering Images
Select the relevant features and remove the channels from an image if color is not important.
Supervised Learning | Random Trees
Set of decision trees, each learned from a different randomly sampled subset with replacement Sklearn.ensemble.RandomForestClassifier
Data Collection | Sampling
Subsets of instances/ Example/or Datapoint for training and testing
The Model Training Process
The execution of an algortithm against a data set, after you've applied some parameters
Model Tuning | Regularization Strength
The larger the Alpha the larger the penalty score.
Data Collection:
The process of acquiring training and /or test data
Radial Basis Function
Transforming the original data through a center. Used in Support Vector Machine as a kernel and in Radial Basis Neural Networks (RBNNs) Gaussian RBF is the most common RBF used
Supervised Learning
True label. Models learn from training data that has been labeled Example: Teacher instructing student -Regression -Classification
True or False| Encoding Nominals is wrong
True, because the ordering and size of the integers are meaningless
True or False | Many learning algorithms can't handle missing values
True, when preprocessing data many algorithms cannot handle missing values
Problem Formulation
Turning a business problem into a ML problem which can be solved with an algorithm Ask GREAT questions! - Define the problem that needs to be solved? -What is the business metric? - Is ML the right approach? -What data is available? -What type of ML problem is it? (Use a sequence of algorithms for the problem -What are my goals?
Supervised Learning | Linear Support Vector Machines
Two features, and two classes are linearly separable that can be shown by a line Popular approach in research, but not in the indusrty -Simplest case: Maximize the margin- the distance between the decision boundary (hyperplane) and the support vectors (training examples closest to the boundary) _ Max margin picture not applicable in non-separable case sklearn.svm.SVC
Model Quality
Underfitting or Overfitting the model is learning from the correct amount clean and organized data
Computation Speed and Scalability
Use SageMaker and EC2 instances for training in order to : --increase speed -solve prediction time and complexity -solve space complexity
How do you drop rows with NULL values in pandas
Use the dropna () rules More complicated rules include: df.dropna(how'all') df.fropna(thresh=4) df.dropna(subset=['*Fruits']) *appropriate axis with NULL values
Machine Learning
Using a dataset to estimate the underlying function that best generalizes, or predicts, LABELS for new examples
Data Sampling | Train/ test bleed
overlap of traing and test data when sampling to create datasets reference for ways to avoid
Find the: Maxabs scaling
sklearn. preprocessing.MaxAbsScaler Advantages: Doesn't destroy sparcity because we are just scaling it
Text-based Features | Hashing Vectorizer
sklearn.feature_extraction.text.HashingVectorizer Stateless mapper from text to term index More efficient when tons of data needs to be cached
Text-based Features | TfidfVectorizer
sklearn.feature_extraction.text.TfidfVectorizer Term_Frequency times Inverse Document-Frequency Per-word value is downweighted for terms common across documents (e.g., "the")
Find the : Mean/Variance standardization
sklearn.preprocessing. StandardScaler Advantages - Many algorithims behave better with smaller values - Keeps outlier information, but reduces impact
Find the: MinMax scaling
sklearn.preprocessing.MinMaxScaler Advantages: -Robust for the small standard deviations cases
Find the: Normalizer
sklearn.preprocessing.Normalizer Allied to each row as opposed to columns Used in text analysis
Polynomial Transformation
sklearn.preprocessing.PolynomialFeatures When you have muplitple numberic values Example transforming x -> x [squared or cubed] Considerations -Beware if overfitting if the degree is too high. Risk of extrapolation beyond the range when using polynomial transformation -Consider non-polynomial (non-linear) transformations like Log and Sigmoid transformations as well
Sampling Representativity
unbiased representation of thr expected production population. Especially important for testing and measuring sets.It is also important for training sets to get a good generalization
AWS Domain Assistance Services
-AWS ML specialist SAs - AWS Professional Services _ AWS ML Solutions Lab - AWS Partner Network (3rd parties)
Overfitting
-Model performs well on training set but poorly on test set - Indicates that the model is too flexible for amount of training data allowing it to "memorize" the "noise " - Corresponds to high variance- small changes in the training data lead to big changes in the results
Detailed | K-Nearest Neighbors
-Non-parametric, instance-based or lazy-learning (memorizing the training data) -Requires keeping the original data set - Space complexity and prediction-time complexity grows with size of training data - Suffers from the curse of dimensionality: points become increasingly isolated with more dimensions, for a. fixed-size training dataset. sklearn.neighbors.KNeighborsClassifier
Supervised Learning | Neural Networks
4:59 sklearn.neural.network.MLPClassifier
Data Labeling | Tools
Amazon Mechanical Turk
Managing Labelers | Plurality
Assign each HIT to multiple labelers to identify difficult or ambiguous cases, or problematic labelers
The Map Function in Pandas
Can transform categorical values into numerical values. Or use sklearn.proprocessing.LabelEncoder for labels
Testing Data
Generalizes the final data set.
Model Tuning | Regularization
Adding penalty term/score for complicity to cost function. Overfitting error can be reduced using this - a technique that helps evenly distribute weights among features. Two standard types used in regression problems: L1 regularization, Lasso : sklearn.linear_model. Lasso L2 regularization, Ridge : sklearn.linear_model.Ridge Linear regression with both: Elastic Net regression : sklearn.linear_model.ElasticNet
Data Preprocessing | Encoding Categotical Variables
Also called discrete. Special treatment is needed. There must be a relationship between categories with more than two categories (binary ok) or the solution is Wrong. Different Types: Ordinal Values- organized and ordered// use the map function in pandas Nominal Values- categories are unordered
Model Tuning | Bias
An error from flawed assumptions in the algorithm. A real discrepancy between the true model (forecast) and the estimates (response). Too much can cause an algorithm to miss important relationships between features and target outputs resulting in underfitting. Solutions: 1. Try new features 2. Decrease the degree of regularization
Model Tuning | Variance
An error from sensitivity to small variations in the training data. Too much can cause an algorithm to model random noise in the training set, resulting in overfitting. Problem: Too complicated Solution: 1. Increase training data 2. Decrease the number of features
CLF
An estimator instance, know as a classifier. Used to store trained model values, which are further used to predict value, based on previously stored weights
Domain Knowledge
An understanding of the domain which leads to better features, better debugging, better metrics, etc. Always get help from someone with this
Model Tuning
Analyze the model for generalization quality and sources of under performance such as overfitting Comparing
Stratified Sampling
Applies random sampling to each subpopulation separately
Reinforcement Learning
Apply penalty or reward
Feature Engineering
Creating novel features to use as inputs for ML models using domain and data knowledge. Its an art more than a science based on the business problem itself. The rule of thumb is to use intuition garnered from human intelligence. scikit-learn: sklearn.feature_extraction library with specific data formats
Sampling and Treatment Assignment
Data Collecting : Labeling video chart @ 1:41
Amazon ML Solutions Lab
Developing machine learning skills through collaboration and education with ML experts to help identify and build ML solutions to address business problems -Brainstorming -Custom Modeling -Training - On site with Amazon Experts
Error Analysis (EA) | Residual Analysis
Difference between the actual response and the forecast value. Note: Can be used to see what the model is predicting correctly. Looking for patterns to create a better forecasting model.
Bag-of -words model
Does not keep track of the sequence of words. Counts the number of words in the particular observation. Represent document as a vector of numbers, one for each word (tokenize, count, normalize) Note: Sparse matrix implementation is typically used, ignore relative position of words Can be extended to bag of n-grams of words characters
Validation Data Set
Evaluates the model performance during debugging and tuning. A sample of data used to give an unbiased evaluation of the model, helping you see where to tweak parameters.
Managing Labelers | Auditors
Experts that check labelers work
One-hot encoding (nominals)
Explode nominal attributes into many binary attributes, one for each discrete (categorical) value sklearn.preprocessing. OneHotEncoder pandas.get_dummies()
EDA
Exploratory Data Analysis - when you get to know the data and define if you have enough. Informs the next steps in the ML process
