I - 10 End-to-End Machine Learning with TensorFlow on GCP
Method to select % of the dataset
"ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth" in query to get a new field Then use: "WHERE MOD(abs(hashmonth), 10) < 8" to get 80% (repeatable) You can then add "AND RAND() < 0.01" to the WHERE clause to get 1/100 of that (not sure if repeatable but seems unlikely)
What are the benefits of using the hashing and modulo operators for creating ML datasets ? 1. It allows you to create datasets in a repeatable manner. 2. It is more computationally efficient than using the rand() function. 3. It provides the best performing split for training and evaluation. 4. None of the above.
1. It allows you to create datasets in a repeatable manner.
What makes a feature 'good'?
1. Should be related to the objective 2. Known at prediction time 3. Should be numeric with meaningful magnitude 4. Have enough examples 5. Bring human insight to the problem
What does the scale-tier represent in CMLE training jobs? 1. The type of hardware you want your model to run on 2. The number of steps your model should scale to 3. The number of distributed machines your model should use
1. The type of hardware you want your model to run on
What is the purpose of task.py when creating a Python package for ML Engine? 1. To parse command-line parameters and to call the model training logic. 2. It defines the task that will be submitted to ML Engine. 3. It contains the hyperparameters that will be used for model training. 4. None of the above.
1. To parse command-line parameters and to call the model training logic.
Three keys to a successful ML model:
1. Training the model on as much data as you can 2. Feature engineering (human insights) 3. Using appropriate model architecture
What are the benefits of the TensorFlow Estimator API?
1. You can train both locally and in a distributed model training environment. 2. It provides a high-level API, simplifying model development. 3. It automatically saves summaries to TensorBoard.
True or False: In ML, you could train using all your data and decide not to hold out a test set and still get a good model
True
Question 2 What are the benefits of using the hashing and modulo operators for creating ML datasets ?
Allows you to create datasets in a repeatable way
What is Google datalab?
Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud Platform. It runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
Datalab vs. Data Studio
Cloud Datalab is intended for doing for data science and machine learning. While Data Studio is focused on reports, Datalab is focused on notebooks.
tf.estimator
High level API for distributed training
True or False: Apache Beam can only be executed on a local environment or using Cloud DataFlow.
False
True or False: Cloud DataFlow and ML Engine have servers
False, Cloud DataFlow and ML Engine are both serverless technologies
A useful BigQuery hashing function used to create training, test, and validation datasets
Farm Fingerprint
TensorFlow is a framework for:
Numeric processing
For productionalizing a model, a machine learning framework needs to be _________, _______, and _______.
Repeatable, scalable, and tuned
Describe Cloud ML Engine
The Google Cloud ML Engine is a hosted platform to run machine learning training jobs and predictions at scale. The service treats these two processes (training and predictions) independently. It is possible to use Google Cloud ML Engine just to train a complex model by leveraging the GPU and TPU infrastructure. The outcome from this step — a fully-trained machine learning model — can be hosted in other environments including on-prem infrastructure and public cloud. The service can also be used to deploy a model that is trained in external environments. Cloud ML Engine automates all resource provisioning and monitoring for running the jobs. It can also manage the lifecycle of deployed models and their versions.
Training-Serving Skew
The difference between what a model was trained on and what it is being presented at prediction time
What does the scale-tier represent in CMLE training jobs?
The type of hardware you want your model to run on and the number of distributed machines your model should use
What is the purpose of task.py when creating a Python package for ML Engine?
To parse command-line parameters and to call the model training logic
When do you use TensorFlow, other than for deep learning?
When your datasets won't fit in memory. As your data size increases, batching and distribution become important. Better than sampling (losing data) or using huge hardware.