GCP - ML Engineer
ai platform training CLI flag for standard distributed training
--scale-tier BASIC_GPU, BASIC_TPU
ai platform training default worker configuration
--scale-tier BASIC single worker node
ai platform training CLI flag for custom machine types?
--scale-tier CUSTOM then add in parameters as flags (--master-machine-type n1-highcpu-16, etc.) or config.yaml
AUTO_CLASS_WEIGHTS
-BQML parameter -used if need to balance the classes
Early Stopping
-Form of regularization -Enact when validation error begins to increase -Indicates that overfitting is beginning
L1 Regularization
-Goal: make unimportant weights exactly 0 -helps decrease sparsity helpful w/ feature selection
L2 Regularization
-Goal: make weights close to 0 (not exactly 0)
Dropout
-Regularization for neural networks -randomly dropout neurons / unit activations for a singular gradient step
Embeddings
-allow for lower dimensionality representation of feature crosses to help w/ sparsity
TFDV: purpose and parts
-analyze data/validate it is correct statisticsgen, schema gen, example validator
what is "placement" for recommendations AI?
-area on website where to locate the recommendation
parameterserver worker strategy
-asynchronous distributed training -some machines are workers and some are parameter servers -workers calculate gradients; parameter server updates the weights and passes that to the workers
Accuracy
-bad for imbalanced class set (TP + TN) ---------- (TP + FP + TN + FN)
recommendations AI - what is rejoining?
-best practice: ensure product catalog is up-to-date and if you are importing catalog while recording events, you will need to rejoin on product ID -events that can't be associated w/ an ID are not used during training
Recommendations AI
-configure to set up A/B testing -integrate with Google Tag Manager to record events (like clicks, etc.) -integrate with Merchant Center to upload product catalog
rolling average
-dataprep preproccesing function -smooths out noise -preferred over daily min/max
how to enable continuous evaluation w/ ai platform prediction?
-establish ground truth as either yourself or use data labeling service -must already have a model version deployed -then you can run a daily evaluation job. this job will compare online prediction results by storing them in BQ and comparing to existing ground truth -you can then analyze evaluation metrics in console
Relu
-example of an activation function -used between hidden layers -allows you to cap negative x-coordinates to be 0 rather than negative y-values
Clipping
-fix for outliers -handles extreme outliers by setting outlier to be = max value ex: if housing data shows a house w/ 500 rooms, instead adjust the 500 = max of dataset (such as 10)
when to use tf.data?
-if dataset can't fit in memory -if you need preprocessing -need access to different hardware/batches
3 ways to record events in Recommendations AI
-javascript pixel -API: eventStores.userEvents.write -google tag manager (creates a trigger that will fire whenever the event occurs)
When to use TPUs?
-large batches -sharded data -large models use tf.data DNN on tf.keras
what is tf.data?
-library for reading TFRecords as datasets -lets you significantly reduce latency by enabling prefetching for letting training happen on the accelerator while CPU does transformations (reduces CPU idle time)
Logistic Regression
-linear classification -probably of something happening -good for low-latency
when to use parameterserver worker strategy? (3 reasons)
-low latency -want to continue if a machine crashes (such as using preemptible machines) -machines all have different performance
storage transfer service
-moves data to GCS (from S3, URL, or other GCS bucket) - data > 1 TB
When to use GPUs?
-need lots of parallelization -lots of math ops
in continuous training, if you have model drift, what should be done?
-need to retrain model, redeploy new model -retrigger whole CI/CD pipeline
what is schemagen?
-part of TFDV -takes raw data and infers schema -this is stored as metadata and used later in pipeline to ensure consistency (such as during tf transform)
what is examplevalidator?
-part of TFDV -validates data/schema to make sure data conforms (such as making sure it is an int, etc.) -also used by tf.transform to look for training/serving skew since it knows previous shape of data
statisticsgen
-part of TFDV -visual report/graphical distribution of data -can detect outliers, anamolies, skews, missing data
AUC ROC
-plots TPR vs. FPR -tells you the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example -good default
AUC PR
-plots precision vs. recall -use this if you care more about positive than negative class / dataset is imbalanced example: need to detect fraud
what happens during "trainer" phase in ci/cd?
-produces serialized "SavedModel" to be stored in GCS -keras/estimator phase
what is the data labeling service?
-provide dataset, instructions, list of labels -assigns humans to give labels to data -part of continuous evaluation strategy -assigns ground truth to data
how to optimize online prediction?
-scale out with GKE, GAE, CAIP prediction -make each type of prediction microservice
SavedModel
-serialized model artifact -allows for model-agnostic deployment (CPU/TPU/GPU)
TFRecords
-stores data as protobuf instead of bytes -improves readability
central storage distributed training & when to use?
-synchronous -1 machine/worker that is attached to multiple GPUs -each GPU calculates gradients and sends to CPU of machine. CPU updates weights and sends to GPUs to calculate gradients -good for large embeddings that don't fit on single GPU
multi-worker mirror strategy
-synchronous -multiple machines each with multiple GPUs
mirror strategy
-synchronous -one machine attached to multiple GPUs/TPUs -each GPU/TPU has a copy of the model -each machine shares its weights with the other machines -all weights are then aggregated together (usually mean) -requires good connection between GPU/TPUs
what happens if you do feature engineering during "train model"?
-this would mean using tf -this is not helpful if you need to compute averages or any aggregations over multiple inputs
what happens if you do feature engineering during "feature creation" with BQML?
-training/serving skew -you can fix this by adding TRANSFORM clause and putting all SELECT logic inside of it -this bakes it into the prediction graph
what happens if you do feature engineering during "feature creation" phase with beam?
-training/serving skew -you will need to run the same pipeline at prediction time to compute the same aggregations
what happens during transformation phase in ci/cd?
-transform data (such as string -> int, bucketizing, etc.) in dataflow -important to use tf.transform to reduce training/serving skew
3 ways to fix class imbalance
-upsampling (SMOTE) -downsampling -weighted classes (give more attention to minority class)
what happens during model evaluation phase in ci/cd?
-use TFMA in dataflow -can compare two models and see how performance differs -can also slice data by certain metrics (such as comparing dates or features)
phases of tf.transform
1. analysis -doing during training -for numeric might be searching for min/max over whole dataset -for categorical might be searching for all unique values -uses beam 2. transform -done during prediction -scales individual input by min/max for numeric; might be changing to one-hot encoding for categorical -uses tensorflow
how does evaluation job work if using own groundtruth?
1. data labeling service creates an ai platform dataset w/ all of the new rows in BQ since the last run 2. you must have already added in groundtruth labels in the column BEFORE the evaluation job runs (evaluation job will skip any rows w/o groundtruth label) 3. data labeling service will then calculate evaluation metrics
how does evaluation job work if using data labeling service?
1. data labeling service creates an ai platform dataset w/ all of the new rows in BQ since the last run --> both input/output of model 2. data labeling service sends labeling request on this new data to generate groundtruth 3. data labeling service will calculate evaluation metrics for the day before it ran (so parallel evaluation jobs will always sample day before's data to ensure different samples)
steps to train a custom model
1. develop tf model/code 2. create dockerfile with model code 3. build the image 4. upload the image to GCR 5. start training job
Recommendations AI - model types
1. others you may like 2. frequently bought together 3. recommended for you 4. recently viewed each have a default placement & optimization (such as CTR, revenue per order, conversion)
ai platform training steps
1. train locally 2. upload to gcs 3. submit to ai platform training to run on cloud
which products have explainability built-in?
AutoML & AI Platform *look for keyword "trust"
Regularization
Avoids overfitting; helps generalize
if customer can't move data outside of EDW for compliance what should they choose?
BQML
if customer wants model ASAP/cheapest, which should they choose?
BQML
Linear Regression: loss function
RMSE
Recall
TP ----- TP + FN -out of all positive predictions, how many were correct? -if you want to minimize false negatives, then maximize recall
Precision
TP ------- TP + FP -looks to see if all positives were correct -if you want to minimize false positives, then maximize precision
How to fix sparsity?
Use L1 regularization
which performance metric to use if class is balanced and each class is equally important?
accuracy
how to optimize offline prediction?
add more machines
asynchronous distributed training
all workers are independently training over the input data and updating variables asynchronously
synchronous distributed training
all workers train over different slices of input data in sync, and aggregating gradients at each step
what is prefetching?
allows for more efficient use of CPUs w/ an accelerator. specifically, if the CPU is preparing batch 2 of data, lets GPU/TPU simultaneously train batch 1 such that it is not idle time while waiting for batches
main difference between automated ML pipeline and full CI/CD pipelining
automatically deploying the model via Cloud Build triggers vs. manually deploying new version
what happens under-the-hood for ai platform training
bayesian optimization for hyperparameter tuning
which distributed training service to use to optimize wall-time?
centralstorage -> each GPU will compute weights w/o waiting for others
Logistic Regression: loss function
cross entropy/log loss
When to use tf.transform
during preprocessing
ai platform training - cloud job submit CLI
gcloud ai-platform jobs submit training $JOB_NAME / --job-dir $OUTPUT_PATH --runtime-version 1.13 / --module-name trainer.task --package-path trainer / --region $REGION --train-files $TRAIN_DATA / --eval-files $EVAL_DATA --num-epochs 1000 / --learning-rate 0.01
ai platform CLI for creating job with custom model
gcloud ai-platform jobs submit training my-job \ --region $REGION --master-image-uri \ gcr.io/my-project/my-repo:my-image --lr=0.01
ai platform training local CLI
gcloud ai-platform local train --module-name / trainer.task --package-path trainer / --train-files $TRAIN_DATA --eval-files $EVAL_DATA / --job-dir $OUTPUT_DIR
how to send prediction input to ai platform prediction?
gcloud ai-platform predict --model $NAME --version \ $VERSION --json-instances='data.txt' where data.txt is newline-delineated JSON
CLI to use explainability in prediction call?
gcloud beta ai-platform versions create $VERSION \ --model $NAME --explanation-method \ 'integrated-gradients' gcloud beta ai-platform explain --model $NAME \ --version $VERSION --json-instances='data.txt'
when to use normalization?
if the range of values is really large (such as age, income, city population, etc.)
xrai: type of data
images
if you want to do your own hyperparameter tuning in ai platform training how do you do it?
include --config flag & a config.yaml. use trainingInput in yaml and then specify maxTrials, enableEarlyStopping, metric, etc.
if you increase the classification threshold, what will happen to precision?
it will increase b/c false positive rate will decrease
if you increase the classification threshold, what will happen to recall?
it will stay the same or decrease b/c true positives will increase or stay the same
what metric do we want to optimize for spam detection?
minimize FP; optimize precision
differential models: type of model & explainability framework
neural nets can use integrated gradients or xrai
ai platform prediction - batch - frameworks available?
only tf
what is smote?
oversampling the minority group to make classes more balanced
Transfer Appliance
physical device connect from your network to upload to GCS
Which metric is: Did the boy cry wolf too often?
precision
of the things that the system predicted, how correct was it?
precision
Which metric is: Did the boy miss any wolves?
recall
did the system miss anything?
recall
in continuous training, if you find you have data drift, what should be done?
retrain model only
sampled shapley: type of data
tabular
integrated gradients: type of data
tabular, low-resolution images (such as x-rays), text
which frameworks use explainability in ai platform prediction?
tf
ai platform prediction - online - which frameworks available?
tf, xgboost, scikit-learn, etc.
why would you use tf.keras.layers.lambda?
to help with training/serving skew
if you don't know how to code and want to submit ai platform training job, what should you do?
use UI and choose "prebuilt algorithms" ex: linear linear, xgboost, wide and deep, object detection, image classificaiton, etc.
BQ DTS
used for ingesting Google Ad Data to BQ
when to use scaling for normalization?
when range is evenly distributed & you know lower/upper bound i.e. age NOT income
non-differential models: type of model & explainability framework
xgboost, decision trees sampled shapley