Data Preprocessing and Feature Engineering

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

So we have Term Frequency and Document Frequency, how do we use those together to give us a measure of how important the word is to the document?

(Term Frequency / Document Frequency) Same as: Term Frequency * Inverse Document Frequency Here we are just dividing the number of times the word occurred in the document by the number of times the word occurred in all documents! That gives us the relative importance of the word

What are the most common methods for dealing with missing data?

1. Fill with the mean of the feature (not great) 2. drop the samples with the missing data (not great) 3. Use ML to do the imputation (good!) Find the K Nearest (most similar) rows and average their values Use deep learning Use linear regression (MICE - Multivariate Imputation by Chained Equations) 4. Just get more data! Remember that filling missing data is an art, not a science. It will depend on the both the feature, the model, and the application in which the data will be used or served to clients.

What are some good ways you can use ML to impute your missing data?

1. KNN! Find the K Nearest (most similar) rows in the data set and average their values to fill mean values 2. Use a deep learning model to impute the mean 3. Do a multivariate regression on the other features in your data set!!

What are some good ways to deal with an unbalanced label? How should you solve the "needle in the haystack" problem?

1. Oversample your samples in the minority class 2. Under-sampling the data in the majority class

What are the key elements of data preprocessing and feature engineering?

1. Pruning unimportant features 2. Applying transforms and encodings to features 3. Deal with missing data in a way that makes sense 4. Create new, derived features to help the model learn

What services work with Random Cut Forest to remove outliers in your data?

1. SageMaker 2. Kinesis Analytics 3. QuickSight 4. more...

What are ways you can use pruning to improve model results?

1. Use domain knowledge to remove unimportant features such as an ID number 2. Use PCA to reduce the dimensionality of the data set 3. Use a clustering model such as K means to reduce the size of the data set

Generally, who are the human labelers who use ground truth?

1. You could use Mechanical Turk as a marketplace to find people 2. You could search for people locally and employ them 3. You could use current employees or interns 4. You could outsource to a professional labeling company

What is SageMaker Ground Truth?

An AWS Service which manages the process of human labeling of unlabelled data

What is the curse of dimensionality?

As you greatly increase the number of features, the solution which you want to converge on can become harder and harder to find because the data environment becomes very sparse. When the data environment is sparse, it's tough for a model to pick up on a signal which will allow it to converge

What's the definition of variance?

Average of the squared differences from the mean

Why would we use a power transform, a log transform, or a boxcox transform?

Because the model which we're using needs (expects) the data to be normal. If we feed the model data which fits a deep exponential curve, then it won't produce very good results

If you want to deal with imprecise data (maybe the data collected is no longer perfectly accurate), how could you deal with that?

Bin your features or your label! If your features or label are no longer perfectly accurate, you could bin them so that the bins they fall into are still correct

Does SMOTE work by generating new samples of the data or by under-sampling the majority class? Or both?

Both! SMOTE will balance the data set by generating new samples in the minority class and then under-sample the majority class. This way the model will have a much more even exposure to the data

What is a model which does NOT need data to be scaled and normalized?

Decision trees/ tree based models

What is the document frequency measuring in TF-IDF?

Document Frequency is measuring how often that term (word) occurs in all of the document in the corpus. If you look at how often "Machine Learning" comes up in ALL documents on Wikipedia, it probably won't come up very often.

What's always a good decision when you're dealing with missing data?

Get more data!! Go back to the DBAs and collect more data

What are some examples of cases where we could keep in outliers or remove them?

If want to predict human internet behavior so you examine internet traffic and you see one server's traffic is 10,000 that of a normal human's desktop, you can probably remove that because it's running an automated script. However, if you want to find the average money earned in the state of NY, you can't remove the billionaires just because they are outliers. Their income will move the mean by a lot, but it's an important thing to keep in mind.

When dealing with outliers, what multiple of sigma should you choose? 2sigma? 3sigma?

It depends on your data! You should take a look at your data and remove outliers based on what makes sense

If you drop the rows in your data with the missing values, what should you look out for?

Look out for bias!! Dropping some of the rows in your data might add bias to your training data which would lead to a poorly performing model

If you have many outliers in your numerical feature, should you use mean imputation or median imputation?

Median imputation! Mean imputation will be skewed by outliers and might make your training/validation data untruthful or inaccurate.

Why is scaling and normalizing around 0 important?

Most models need data to be scaled and normalized

Why is applying transforms important in data preprocessing?

Most models perform better when data coming in is normal

What types of models need data to be scaled/ normalized?

Neural networks

Is mean imputation generally a good method for dealing with missing data?

No! It can cause your model to perform worse, especially if you are missing lots of data for an important feature

Can you use mean imputation to fill NANs in your categorical features?

No! Mean imputation only works with numerical data

If you have to label very sensitive data, should you use Mechanical Turk?

No!! That would be a big security problem. If you have to label very sensitive data (for example medical or financial data) then you should use internal members of your own team to do the labeling

Is dropping the rows with NAs usually a good approach to dealing with missing values?

No, probably not. It is useful when a very small percentage of the data is missing, but it's generally not the best approach to dealing with missing data

Is under-sampling (removing rows in the majority class) usually a good idea?

Not usually. Removing data from your training set is usually not a good idea unless you want to avoid hardware or capacity problems associated with big data

How can you think of what TF-IDF is calculating?

Number of times a word appears in a document divided by the number of times a word appears everywhere

What should you remember to do if you are scaling your data before training?

Re-scale your predictions after inference! Remember that your predictions will be scaled, so you will need to re-scale them back to normal after inference has taken place

What is one method for dealing with an unbalanced dataset which is better than under-sampling and oversampling?

SMOTE! Synthetic Minority Over Sampling Technique! SMOTE will artificially generate new samples in the minority class using the KNN model. Similar to using KNN for imputation, SMOTE will use KNN to find the nearest (most similar) records and then average them together to create new records which will usually yield the same label. Those new records are then added to the training/validation sets to improve model performance

Why is shuffling important?

Shuffling helps models learn - leaving data un-shuffled can cause residual buildup of patterns and cause unrealistic increases of learned parameters inside channels in a network or highly parameterized model.

Is standard deviation (sigma)? or (sigma^2)?

Sigma! Remember that the standard deviatin is just the square root of the variance! So you just find the average of the squared differences from the mean in the data set to get the variance, then you take the square root of that value to get the standard deviation

Is variance (sigma)? or (sigma^2)?

Sigma^2 !

If you have a classification model and you want to adjust the resulting model because your training data was imbalanced, what could you do?

Simply adjust the threshold. Remember that adjusting the threshold would just act as a trade off between false positives and false negatives. This is important for different applications - for example for detecting junk mail, you could move your threshold for positive predictions up because you don't care about letting through spam

If you oversample your data by simply creating copies (clones) of the records which are in the minority class, would that really help a ML or deep learning model?

Sometimes, yes! Simple oversampling of records in the minority class can actually improve performance of neural networks

What is MICE in feature engineering?

Stands for Multivariate Imputation by Chained Equations It's a very sophisticated method of filling missing data - usually a very good option to have on hand

What's another way to deal with categorical data?

Target encoding! Use information from the label as a way to encode categorical features

What is the term frequency part of TF-IDF doing?

Term Frequency just finds out how often a term (word) comes up in the document. That's it! It's just counting the number of occurrences of that word. If "Machine Learning" comes up 16 times in a single document, then that document is probably about machine learning.

What does TF-IDF stand for?

Term frequency - Inverse Document Frequency It's a way of finding out how relevant a term is in a given document

When calculating TF-IDF, should you use the raw Inverse Document Frequency, or the log of the document frequency?

The Log!! The document frequency usually has an exponential distribution, so using the log of the IDF is usually better than using the raw values for the IDF

What is a limitation of TF-IDF?

The words are unordered. There is usually no attention payed to bigrams or trigrams. However! Bigrams and trigrams can be added to the TF-IDF matrix for better results

If you have categorical data, what's the best way to fill missing values?

Train a deep learning model to fill the missing values in your data Deep learning usually works very well to fill missing values in a categorical feature!

What is binning in data preprocessing? ("binning" or "bucketing")

Transforming your numerical data into ordinal data

If you have numerical data, what's the best way to fill missing values?

Use KNN to find the K Nearest (most similar) rows and find the mean of the closest rows. Then use that value to fill the missing value

How should you impute missing data in production?

Use ML to impute your missing data.

If you want to use linear or non-linear relationships between your feature with missing data and the rest of your features to impute missing values, what should you do?

Use a multivariate regression model! Use every feature as the independent variables and the feature with the missing data as the response! Filter for all of the rows WITHOUT missing data, then train your linear or non linear regression model (could be OLS or decision tree, etc.) on the training data, and use that trained model to predict the values which are missing! That's actually pretty clever!

How should you encode ordinal data?

Use integer encoding!

How should you encode categorical data?

Use one hot encoding!

What are some good ways to remove outliers from your data?

Use standard deviation clipping! "Everything outside of 2 sigma will get clipped - or 3 sigma..." Be careful here! Just because something lies very far away from the mean in a standard normal distribution, doesn't make it unimportant. Use judgement to decide whether or not something is really an outlier worthy of removal

If you have a corpus of text and you want to create derived features - metadata about the corpus - which will then be used in a learned model, how could you create those derived features?

Using Amazon Comprehend!!

If you have one million images which need to be labelled so you can train a custom AlexNet, how could you label those?

Using SageMaker Ground Truth! Ground Truth is a service which lets you employ people to actually label your data and create labelled data sets

If you want to bucket your data so that there are the same number of elements in each bucket, how should you do that?

With quantile binning!

Should you make all of your corpus lower case when applying TF-IDF?

Yes!

If you need to create additional derived features or a label for your images or videos, could you just use Rekognition to recognize the images?

Yes! If you want to pull information from images or video and you only need to know simple things like "is there a human in this image?", then you don't need Ground Truth! You can just save some money and send your images or video through Rekognition and use the outputs from that as your derived feature or as your label.

Is there an AWS service designed to help identify and remove outliers?

Yes! AWS has developed a random cut forest to help deal with finding and removing outliers.

If you need to label some corpus of text, could you use Amazon Comprehend instead of Ground Truth?

Yes!! If you need to save some money, you could pull insights from Amazon Comprehend instead of hiring people to categorize your text

Are there benefits to creating derived features which represent a feature squared and the same feature ^(0.5)?

Yes!! Adding a squared feature and a square root of that feature can allow a model to learn super linear and sub linear patterns from your data

Can SageMaker use Ground Truth to automatically label your data?

Yes!! Remember that SageMaker Ground Truth actually takes each sample as it's labelled to train a model which it uses to predict the new labels for unlabelled data. Eventually, Ground Truth can label the majority of the data and only send the data through to the humans if the model is relatively unsure about the label. (If the confidence is very low)

Is TF-IDf difficult to implement at scale?

Yes. Lots of computations involved when the number of words is very large

What is Quantile binning?

You bin each quantile of your numerical data so that each bin has the exact same number of samples

How could you use TF-IDF in production?

You could pre-compute all of the values for each word and bigram in a whole corpus. Then, a user could enter a search word and then the relative importance of that word could be provided. This requires a lot of computation to be done up front!


Ensembles d'études connexes

Personal Finance: Decision Making

View Set

Pseudocode Algorithm Workbench Ch 6-11

View Set

Midterm-Management Information System

View Set

Prescribing for Pediatric Patients

View Set

Jobs, Careers, & Education- Education and Training

View Set