Introduction to Data Science

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Advanced Imputing | MICE

Multiple Imputation by Chained Equations sklearn.impute.MICEImputer(v0.20) More in Python (not sklearn) fancyimpute package includes: -KNN [k nearest neighbor] impute -Softimpute -MICE and more

Data Collection | Labeling

Obtaining gold-standard answers for supervised learning

Supervised Learning | Univariate Regression

Simplest Linear Regression Model -Model relation between a single feature ( explanatory variable x) and a real-vauled response (target variable y) - Given data(x,y), and a line defined by the intercept slope the vertical offset for each point from the line is the error between the true label y and the prediction based on x - The best line minimizes the Sum of Squared Errors (SSE) -We usually assume the error is Gaussian distributed with mean zero and fixed variance

Supervised Learning | Recurrent Neural Network (RNN)

Single- Layer neural network Timeshare data, NLP, or translation application the sequence of the data means something this is the way to go

Supervised Learning |Convolutional Neural Networks

Single- Layer neural network Useful for image analysis the input is an image or a sequence of images which are weighted. Using kernels as filters to extract the local features. Pooling - reducing the size with max or average pooling. Dimension reduction process.

Semi- Supervised Learning

Some of the test data will have labels and others will not

Domain Knowledge | ML Workflow

What needs to happen with data to solve a business problem. First step is tranfroming into a ML problem, then get data... ref video for flow chart

To prevent encoding too many classes:

When using onehotencoder: Define a hierarchy structure Group the levels by similarity to reduce the overall number of groups.

Find the: Robust scaling

sklearn.preprocessing.RobustScaler

Underfitting

- Failure to capture important patterns in the training set -Typically indicates model is too simple or there are too few explanatory variables -Not flexible enough corresponds to high bias- the results show a systemic lack of fit in certain regions

Supervised Learning | Multivariate Regression

- Multiple Linear Regression includes N explanatory variables with N>=2 - Sensitive to correlation between features, resulting in high variance of coefficients -scikit-learn implementation: sklearn.linear_model.LinearRegression

Supervised Learning | Linear Regression

- Parametric methods where function learned has form (ref. fx) - optimized by learning weights by applying (stochastic) gradient descent to minimize loss function ( ref.fx) - good place to start with a new problem Methods * Linear regression for numeric target outcome * Logistic regression for categorical target outcome

Data Quality

-Consistency of Data -Accuracy of Data -Is it what we need to solve the business problem - Noisy Data - Missing data - Outliers in the data - Bias - Variance

Neural Networks

-Layers of nodes connected together - Each node is one multivarative linear linear function, with an univariatice nonlinear transformation -Trained via(stochastic) gradient descent -Can represent any non-linear function (very expressive)

Scaling Transformation | sklearn

-Mean/Variance standardization | sklearn.preprocessing. StandardScaler -MinMax scaling -Maxabs scaling -Robust scaling ( for the column) -Normalizer (done for each row)

Considerations when Feature Engineering

-Numerical Feature Transformations of attributes - Combinations of attributes to understand the relationship between them

Amazon Mechanical Turk

-Obtain human intelligence on demand -Access global, on-demand , 24-7 workforce - Pay only for what you use - Use for labeling - Human annotation

Model Tuning | Hyperparameter

1. An external configuration whose value cannot be estimated from the data. 2. Estimator parameter that is not fitted to the data

Before dropping or imputing (replacing) missing values, ask:

1. What were the mechanisms that caused the missing values? People may not have wanted to answer the questions. It is missing on purpose 2. Are these missing values missing at random? 3. Are there rows or columns missing that I am not aware of?

Seaborn

A Python data visualization library based on matplotlib

Supervised Learning | Decision Trees

A decision making process similar to the metal models made by human beings. One input lends itself to inform the next decision until a binary answer is achieved. suceptable to overfitting. trees must be pruned or use the Ensemble method sklearn.tree.DecisionTreeClassifier

AWS Professional Services

A global team of experts that can help you realize your desired business outcomes when using the AWS Cloud.

Solutions Architect (SAs)

A machine learning specialist is the Subject Matter Expert (SME) for designing machine learning solutions that leverage AWS services to automate solutions and drive down costs for customers.

Neural Networks | Peceptron

A single layer neural network that uses a list of input features .

Imputation

A technique that replaces a missing value with an estimated value including the attribute's: -Mean -Median -Most frequent(for categoricals), -or any other estimated value sklearn.preprocessing import Imputer

Supervised Learning | K-Nearest Neighbors

A way to figure out the response of a new operation based on how close it is to the training dataset. 1. Define the distance - Euclidean distance - Manhattan distance - any vector norm 2.Choose the number of K neighbors 3. Find the k nearest neighbors of the new observation that we want to classify Essentially a vote to classify the new operation to define the [k]

Model Training

Builds the model.Improve the model by optimizing parameters or data.

Filtering Sound

Choose the relevant features and remove frequencies if the power is less than a threshold Like voice analysis

True or False | When feature engineering you should not try generating many features first,then apply dimensionality reduction if needed

False, you should try generating as many features first, then apply dimensionality reduction if needed when feature engineering.

Error Analysis (EA)

Filter in failed predictions and manually look for patterns. This helps you pivot on target, key attributes, and failure type, and build histograms of error counts.

Supervised Learning | Logistic Regression

Forecast a binary action (yes or no , true or false) based on the data Finds the best weight vector by fitting the training data

Amazon SageMaker

Fully managed platform built for data pre-processing, feature engineering, model fitting, model evaluation, and model implementation all in one.

Confidence Interval

Go with the larger sample size. This is bases on the width of the interval not percentage of the metric.

Managing Labelers | Gold Standard HITs

HITs with known labels mixed in to identify problematic labelers

HITs

Human Intelligence Tasks used when labeling see Amazon Mechanical Turk

Data Sampling | Leakage

Information used in training and validation but not available in production

Data Labeling | Human

Labeling guidelines need to be VERY clear. Human Intellegence Tasks (HITs)

When preprocessing data, what are the risks of dropping rows?

Losing too much data: -Overfitting, wider confidence intervals, etc May bias the sample

When preprocessing data, what are the risks of dropping columns?

May lose information in features (underfitting)

Correlation Marix

Measure the linear dependence between features; can be visualized with heat-maps

Unsupervised Learning

Models learn from test data that has not been labeled. We only have a collection of features

Feature Scaling

Motivation: The range of features is dramatically different scales Example: Gradient descent and kNN [k nearest neighbor] Solution: Align all feature onto the same scale

Learning Curves

Plot training dataset and validation dataset error or accuracy against training set size. Motivation: Detect if the model is underfitting or overfitting, and impact of training data size. sklearn. learning_curve.learning_curve

Supervised Learning | Non-Linear Support Vector Machines

Popular in research, not in the industry -Kernalize for nonlinear problems: 1. Choose a distance function named a "kernel" 2. Map the learning task to a higher dimension space 3. Apply a linear SVM classifier in the new space - Not memory- efficient, because it stores the support vectors, which grow with the size of the training data -Computation is expensive sklearn.svm. SVC

Text-based Features | Count Vectorizer

Pre-word value us count (also called term frequency) *includes lowercasing and tokenization on white space and punctuation sklearn.feature_extraction.text.CountVectorizer

Data Science

Processes and systems to extract knowledge or insights from data, either structured ot unstructured.

Supervised Learning | Decision Trees | Entropy

Relative measure of disorder in the data source

Supervised Learning | Logistic Regression | Sigmoid Curve

Represents the probability (ref. fx)

Random Sampling

Sampling so that each source data point has equal probability of being selected

Filtering Images

Select the relevant features and remove the channels from an image if color is not important.

Supervised Learning | Random Trees

Set of decision trees, each learned from a different randomly sampled subset with replacement Sklearn.ensemble.RandomForestClassifier

Data Collection | Sampling

Subsets of instances/ Example/or Datapoint for training and testing

The Model Training Process

The execution of an algortithm against a data set, after you've applied some parameters

Model Tuning | Regularization Strength

The larger the Alpha the larger the penalty score.

Data Collection:

The process of acquiring training and /or test data

Radial Basis Function

Transforming the original data through a center. Used in Support Vector Machine as a kernel and in Radial Basis Neural Networks (RBNNs) Gaussian RBF is the most common RBF used

Supervised Learning

True label. Models learn from training data that has been labeled Example: Teacher instructing student -Regression -Classification

True or False| Encoding Nominals is wrong

True, because the ordering and size of the integers are meaningless

True or False | Many learning algorithms can't handle missing values

True, when preprocessing data many algorithms cannot handle missing values

Problem Formulation

Turning a business problem into a ML problem which can be solved with an algorithm Ask GREAT questions! - Define the problem that needs to be solved? -What is the business metric? - Is ML the right approach? -What data is available? -What type of ML problem is it? (Use a sequence of algorithms for the problem -What are my goals?

Supervised Learning | Linear Support Vector Machines

Two features, and two classes are linearly separable that can be shown by a line Popular approach in research, but not in the indusrty -Simplest case: Maximize the margin- the distance between the decision boundary (hyperplane) and the support vectors (training examples closest to the boundary) _ Max margin picture not applicable in non-separable case sklearn.svm.SVC

Model Quality

Underfitting or Overfitting the model is learning from the correct amount clean and organized data

Computation Speed and Scalability

Use SageMaker and EC2 instances for training in order to : --increase speed -solve prediction time and complexity -solve space complexity

How do you drop rows with NULL values in pandas

Use the dropna () rules More complicated rules include: df.dropna(how'all') df.fropna(thresh=4) df.dropna(subset=['*Fruits']) *appropriate axis with NULL values

Machine Learning

Using a dataset to estimate the underlying function that best generalizes, or predicts, LABELS for new examples

Data Sampling | Train/ test bleed

overlap of traing and test data when sampling to create datasets reference for ways to avoid

Find the: Maxabs scaling

sklearn. preprocessing.MaxAbsScaler Advantages: Doesn't destroy sparcity because we are just scaling it

Text-based Features | Hashing Vectorizer

sklearn.feature_extraction.text.HashingVectorizer Stateless mapper from text to term index More efficient when tons of data needs to be cached

Text-based Features | TfidfVectorizer

sklearn.feature_extraction.text.TfidfVectorizer Term_Frequency times Inverse Document-Frequency Per-word value is downweighted for terms common across documents (e.g., "the")

Find the : Mean/Variance standardization

sklearn.preprocessing. StandardScaler Advantages - Many algorithims behave better with smaller values - Keeps outlier information, but reduces impact

Find the: MinMax scaling

sklearn.preprocessing.MinMaxScaler Advantages: -Robust for the small standard deviations cases

Find the: Normalizer

sklearn.preprocessing.Normalizer Allied to each row as opposed to columns Used in text analysis

Polynomial Transformation

sklearn.preprocessing.PolynomialFeatures When you have muplitple numberic values Example transforming x -> x [squared or cubed] Considerations -Beware if overfitting if the degree is too high. Risk of extrapolation beyond the range when using polynomial transformation -Consider non-polynomial (non-linear) transformations like Log and Sigmoid transformations as well

Sampling Representativity

unbiased representation of thr expected production population. Especially important for testing and measuring sets.It is also important for training sets to get a good generalization

AWS Domain Assistance Services

-AWS ML specialist SAs - AWS Professional Services _ AWS ML Solutions Lab - AWS Partner Network (3rd parties)

Overfitting

-Model performs well on training set but poorly on test set - Indicates that the model is too flexible for amount of training data allowing it to "memorize" the "noise " - Corresponds to high variance- small changes in the training data lead to big changes in the results

Detailed | K-Nearest Neighbors

-Non-parametric, instance-based or lazy-learning (memorizing the training data) -Requires keeping the original data set - Space complexity and prediction-time complexity grows with size of training data - Suffers from the curse of dimensionality: points become increasingly isolated with more dimensions, for a. fixed-size training dataset. sklearn.neighbors.KNeighborsClassifier

Supervised Learning | Neural Networks

4:59 sklearn.neural.network.MLPClassifier

Data Labeling | Tools

Amazon Mechanical Turk

Managing Labelers | Plurality

Assign each HIT to multiple labelers to identify difficult or ambiguous cases, or problematic labelers

The Map Function in Pandas

Can transform categorical values into numerical values. Or use sklearn.proprocessing.LabelEncoder for labels

Testing Data

Generalizes the final data set.

Model Tuning | Regularization

Adding penalty term/score for complicity to cost function. Overfitting error can be reduced using this - a technique that helps evenly distribute weights among features. Two standard types used in regression problems: L1 regularization, Lasso : sklearn.linear_model. Lasso L2 regularization, Ridge : sklearn.linear_model.Ridge Linear regression with both: Elastic Net regression : sklearn.linear_model.ElasticNet

Data Preprocessing | Encoding Categotical Variables

Also called discrete. Special treatment is needed. There must be a relationship between categories with more than two categories (binary ok) or the solution is Wrong. Different Types: Ordinal Values- organized and ordered// use the map function in pandas Nominal Values- categories are unordered

Model Tuning | Bias

An error from flawed assumptions in the algorithm. A real discrepancy between the true model (forecast) and the estimates (response). Too much can cause an algorithm to miss important relationships between features and target outputs resulting in underfitting. Solutions: 1. Try new features 2. Decrease the degree of regularization

Model Tuning | Variance

An error from sensitivity to small variations in the training data. Too much can cause an algorithm to model random noise in the training set, resulting in overfitting. Problem: Too complicated Solution: 1. Increase training data 2. Decrease the number of features

CLF

An estimator instance, know as a classifier. Used to store trained model values, which are further used to predict value, based on previously stored weights

Domain Knowledge

An understanding of the domain which leads to better features, better debugging, better metrics, etc. Always get help from someone with this

Model Tuning

Analyze the model for generalization quality and sources of under performance such as overfitting Comparing

Stratified Sampling

Applies random sampling to each subpopulation separately

Reinforcement Learning

Apply penalty or reward

Feature Engineering

Creating novel features to use as inputs for ML models using domain and data knowledge. Its an art more than a science based on the business problem itself. The rule of thumb is to use intuition garnered from human intelligence. scikit-learn: sklearn.feature_extraction library with specific data formats

Sampling and Treatment Assignment

Data Collecting : Labeling video chart @ 1:41

Amazon ML Solutions Lab

Developing machine learning skills through collaboration and education with ML experts to help identify and build ML solutions to address business problems -Brainstorming -Custom Modeling -Training - On site with Amazon Experts

Error Analysis (EA) | Residual Analysis

Difference between the actual response and the forecast value. Note: Can be used to see what the model is predicting correctly. Looking for patterns to create a better forecasting model.

Bag-of -words model

Does not keep track of the sequence of words. Counts the number of words in the particular observation. Represent document as a vector of numbers, one for each word (tokenize, count, normalize) Note: Sparse matrix implementation is typically used, ignore relative position of words Can be extended to bag of n-grams of words characters

Validation Data Set

Evaluates the model performance during debugging and tuning. A sample of data used to give an unbiased evaluation of the model, helping you see where to tweak parameters.

Managing Labelers | Auditors

Experts that check labelers work

One-hot encoding (nominals)

Explode nominal attributes into many binary attributes, one for each discrete (categorical) value sklearn.preprocessing. OneHotEncoder pandas.get_dummies()

EDA

Exploratory Data Analysis - when you get to know the data and define if you have enough. Informs the next steps in the ML process


संबंधित स्टडी सेट्स

Physiology Review Questions: The Body Fluids and Kidneys

View Set

C-17 Copilot Airdrop (CPAD) 2016

View Set

Medical Expense Insurance (health)

View Set

Ridiculously Long Radiography Review

View Set

#13 XCEL: Chapter 3 Life Insurance Policy Provisions, Options, and Riders

View Set

Spanish - Final Exam Study Guide

View Set

Nutrition, energy needs, and feeding patterns throughout the lifespan

View Set