Machine Learning CP 1
When is a Stochastic Gradient Descent Preferable to a Batch Gradient Descent?
When the cost function is very irregular
What is the difference between min-max scaling (normalization) and standardization
Standardization is z score normalization, meaning that at 0 is the mean, and std is always one. Standardization is better when there's lots of outliers, normalization might be better if the data isn't bell shaped.
How does a confusion matrix help with classification?
A confusion matrix counts the number of times instances of class A are classified as class B
Elastic Net
Middle ground between lasso and ridge regression. You can actually control both params.
Your data is underfitting. How do you fix this?
More sophisticated model most likely
What can SVM do?
Linear and nonlinear classification regression outlier detection
What is a hyper parameter
It controls the amount of regularization to apply during learning
What is unsupervised data mining?
P hacking
Make a series with correlation between objects
Step one: make a new dataframe that calculates all the correlations: dataframe.corr() Then tease out the one you want corr_matric['column']
What is the F1 score?
The F1 measure is the harmonic mean, or weighted average, of the precision and recall scores. Also called the f-measure or the f-score, the F1 score is calculated using the following formula: F1 = 2(PR/(P + R)) The F1 measure penalizes classifiers with imbalanced precision and recall scores, like the trivial classifier that always predicts the positive class. A model with perfect precision and recall scores will achieve an F1 score of one.
You're given ordered data, how should you treat it?
You need to make sure to always shuffle ordered data. 1. K-Fold validations will be sure to have each necessary classification/ category 2. You don't want a bunch of the same category or similar in a row
What's a great pandas tool to see correlation, linear and nonlinear
scatter_matrix
What is unsupervised learning good for?
- Classifying data into different areas - potentially like real review from fake reviews - feed it review length, product price, review frequency over time. - Anomaly detection
What is stratified sampling?
- If you're interviewing 1,000 people, then make sure that 513 are female and 487 are male, just like the pop you're trying to extrapolate to
In sklearn, you want to work with categories, how do?
1. Make a dataframe only of the categories (like df_categories) just call LabelBinarizer encoder = LabelBinarizer() housing_cat_1hot = encoder.fit_transform(housing_cat)
What does a real data pipeline workflow look like?
1. Run a pipeline on your numbers 2. Run a pipeline on your categories 3. Run a FeatureUnion with the transformer_list of the two+ pipelines
What are the two worst parts of decision trees?
1. They are senstitive to small variations in training data 2. They are sensitive to training data rotation because their orthogonal decision boundaries
you want to create calculated fields to pass through to sklearn. How do you do this?
1. Use FunctionTransformer 2. Pass through the dataframe as an arg 3. Make the fields The reason we make this as a function is to make it extensible. We want to reuse this code as often as possible.
What does it mean to be ε -insensitive
Adding more training instances does not affect the model's prediction
What is reinforcement learning?
An agent observes an environment, and gets rewards or penalties for actions. It generates a policy.
Why is CART a greedy algorithm
CART, or Classification and Regression Tree, searches for an optimum split at the top level. It looks for the top level factor that will produce the most "pure" results for a decision tree right away, rather than the complete split that will lead to the best solution downstream
What is the difference between Closed form and GA
Closed for directly computes model parameters that fit the model to the training set GA is an iterative approach that gradually tweaks model parameters to minimize the cost function over the training set.
What is non representative training data?
Data that doesn't match the new data you want to make predictions on
You're training a shload of models. How do you save them for later analysis?
Either pickle or sklearn joblib.
What is Bias and Variance Tradeoff?
Error Bias: Assuming the data is linear when it's quadratic (or similar mistake). Leads to underfitting training data. Variance: When a model is excessively sensitive to small variations in a data model and overfits the data.
Irreducible Error
Error given to the model from the noisiness of the data itself.
Association rule learning
Examining data to discover new and interesting relationships among attributes that can be stated as business rules.
You're using an SVM, which two kernels should you try first?
First: Linear Second: Gaussian RBF
Ridge Regression
Forces the model to keep weights as small as possible. Should only be used during training. More useful for noisey data
What is a pure node in a decision tree (use the keyword, gini)
Gini measures purity. If all the remaining data in a decision tree are in one class or another, it is pure, and the gini is 0
What is logistic regression
Gives a probability that the instance belongs to a class
Hard margin vs soft margin classification
Hard margin in SVM only works if data is clearly seperated. It forces data points to be outside the optimal hyperplane to be classified. Soft margin does not.
What is an optimal hyperplane?
In SVM's there are many ways to seperate clusters of data points. SVM's find the parallel lines that give the maximum margin between two clusters of data.
What is an epoch?
In gradient descent it's each round it goes through.
What is online learning?
Incrementally feed data so it continues to learn.
What is model based learning?
Indirect learning that breaks down an environment and uses observations of data to attempt to predict
How do you effectively grid search a Gradient descent?
Interrupt the algorithm when the gradient vector drops below a tolerance. This means that the gradient descent has almost reached a minimum.
What is a kernel trick?
It allows you to pass through a polynomial function to an SVM
What is a dot product used for?
It calculates the angle between two non-zero vectors.
What is an NP-complete problem?
It can only be solved by brute force, meaning every single option needs to be tried first
What is the learning rate of a gradient descent?
It determines how many iterations it takes to get to the bottom of the RMSE. Too small and it's short. Too large and you could jump across a valley
What is instance based learning
It gets a "positive" and then takes a measure of similarity to determine if the new point of data is similar enough to the other positives.
What is a Decision Tree good for?
It helps finding complex nonlinear relationships in data
What is ensemble learning?
It is a model built on other models. Like using the random forest model on multiple sets and taking the mean
What does logistic regression output?
It is commonly used for classification and gives a probability of something belonging to a class or category. Also used for predicting values based on known inputs.
What is a decision threshold?
It is just what it sounds like. At what percentage of certainty do I classify something?
How does a linear regression model get trained?
It just finds the parameters that minimize RMSE
What is a cost function?
It measures how bad a model is.
What is a utility function?
It measures how good a function is.
What is a similarity function?
It measures how much an instance resembles a certain landmark. It helps discover a linearly seperable model with an SVM
What is an ROC curve?
It plots the true positive rate against the false positive rate Receiver operating characteristic
What is k-fold cross-validation?
It randomly splits data into subsets and trains and evaluates the model a defined number of times, giving us a number of scores.
What is a standard scaler?
It removes variance by standardizing data
What is dimensionality reduction?
It simplifies data without losing too much information. It will merge several correlated features into one.
Why is batch gradient descent slow?
It uses the whole batch of training data at every step
What is a stochastic gradient descent?
It's a gradient descent that picks a random instance at every step and computes the gradients based only on that instance. It's very fast but will always bounce around.
What is polynomial regression
It's a linear model with an extended set of features. Just an equation with powers.
Tell me about the Predictor in the sklearn API
It's an estimator that can be intelligent. It has the .predict() method and the .score() method to meausre quality.
Tell me about the Estimators in a sklearn API
It's any object that can estimate parameters based on a dataset, like imputer which fills na with a value. It takes a .fit() method.
How does a decision tree work?
It's basically a bunch of IF statements lol
What's a graph to see if a model is underfitting or overfitting, based on how large the training set size is? Why use this?
Learning curve. It shows you that for as long as you're adding more training data, how much more effective the algorithm gets.
Lasso Regression
Least Absolute Shrinkage and Selection Operator Regression. - Tends to completely eliminate weights of the least important features. that's the key feature that differentiates it from ridge regression
Why do you need feature scaling?
ML doesn't do well when there's some data that has big numbers, and some that doesn't. You need to feature scale so that these dumb algos get it right.
You're about to do data exploration, what should you do?
Make a copy of the dataframe boi!!!
What do you want to use if your dataset has a lot of outliers and you're using regression?
Mean Absolute Error
What is a softmax regression?
Multi category logistic regression
In order to keep results and testing consistent, even when you refresh your data, your test set...
Needs to remain the same always. Save the UUID's or whatever somehow
What is batch learning?
Offline learning- the model is trained before put to production.
When creating categories, you often find that the arbitrarily assigned numbers to the categories wrongly influence machine learning algo's. How do you fix this with sklearn?
One Hot Encoding. Rather than having one column with numerical categories, you have multiple with a Boolean field for each category. To do this, use LabelBinarizer directly
How do you make your model better?
One way is through gridsearch, which typically looks for combinations of hyperparameter values. Import it via GridSearchCV
When should you prefer an ROC curve to a PR curve?
PR - Precision/Recall curve PR is better when the positive class is rare or when you care more about false positives than false negatives (don't classify things as positive, incorrectly)
How can you find if you want to choose between RMSE and MAE (plus, what are they?)
RMSE = Root Mean Square Error MAE = Mean Absolute Error If your data is closer to normalized, use RMSE, if it is not (lots of outliers), use MAE
What is the standard performance measure for regression problems?
Root Mean Square Error
How do you measure effectiveness of an algo with RMSE
Root Mean squared error # Get your predicted values based off of input prediction_set = lin_reg.predict(cleaned_data) # Make a new object of the mean squared error linear_mse = mean_squared_error(answer_key, prediction_set) # take the squareroot linear_rmse = np.sqrt(linear_mse) return linear_rmse
How do you see density in a scatter plot with overlapping values?
Set the alpha to 0.1
What's one way to make stochastic gradient descent better? What are the pitfalls of the method?
Simulated Annealing. It starts with a fast learning schedule but slows down over time. This can help get to the global minimum. Too fast and you can end up in a local minimum, or frozen halfway down. Too slow and you can jump around the minimum for too long.
What is early stopping?
Stopping a model when it begins to overfit a training model, and works best on the validation set.
How do you do stratified based sampling in sklearn?
StratifiedShuffleSplit
What is supervised learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).
You want to predict a target numeric value, or classify data, what kind of machine learning do you use?
Supervised learning.
Define SVM
Support Vector Machine
What is a sklearn data pipeline? Give an example
The idea is that in order to work with data you need to clean it in a certain way. Pipelines allow you to run functions in order, in a predefined way. An example is: 1. Run imputer with a strategy to fillna 2. Run FunctionTransformer to add calculated fields 3. Run a StandarScaler() to normalize or standardize the data mypipeline =Pipeline([methods]) my_data = mypipeline.fit_transform(old_data)
After you run something from the sklearn API, like a imputer function, how do you see what it returned?
These are classes that learn stuff. Just place a _ at the end of a value. Like: imputer.statistics_
What are support vectors?
They are data points close to the hyperplane that influce the position of the hyperplane in an SVM algo
Tell me about Transformers in the sklearn API
They are estimators that can change datasets, like an imputer. The transformation is performed by the .transform() method on the tranformer itself, returning an ndarray
Why are decision trees non-parametric models? Why can this be bad?
They develop their own parameters from the data, rather than being built into an algorithm like a linear regression. It can lead to overfitting.
What is recall?
True Positive / True Positive + False Negative A lower value means you're missing what should properly be classified
What is Precision?
True Positive / True Positive + False Positive A lower value means you are overclassifying positives
What is mini-batch gradient descent
Uses random small sets of insances called mini-baches. It's typically very fast as you can use your GPU.
What is unsupervised learning?
Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning.
You want to test 1,000 hyperparameter methods in sklearn. How do?
Use RandomizedSearchCV
You want to pass through a dataframe to a sklearn function, how do?
Write a function to get the data you want returned as a np.array Typically you just take in the df, the type of data you want, and return the values
What's one way to ensure that your data is representative, given a column for salary? What pandas method should be used here?
You can break down your column into categories, and ensure that each category is represented as it should be. Make sure the categories are discrete. If there are a lot of outliers, let's say people between 100k-100,000,000, just assign those to one category. To do this in pandas, use pd.cut()
What is the sklearn API for cross-validation?
cross_val_score
How do you find the sklearn most important features in a grid search?
grid_search.best_estimator_.feature_importances_
Make a new linear regression machine learning model with sklearn
my_lin_reg = LinearRegression() my_lin_reg.fit(cleaned_data_with_attributes, answer_key)
Make a simple prediction with sklearn after the data is cleaned
my_predictor_instance = ML_Model() my_predictions = my_predictor_instance.predict(cleaned_data) # Then calculate RMSE # Then look at test data
Make a super simple bar chart based on a series
series.hist()
Can you use null values in a ML algo? What should you do?
sklearn has imputer to fill things. Drop all non-numerical columns and use it.
How do you split out training data in sklearn?
train_test_split()