SIMULATED TEST QUESTIONS - PRACTICE EXAM 2 - (please feel free to submit edits/corrections to Mike!)
Your team is designing a fraud detection system for a major Bank. The requirements are: Various banking applications will send transactions to the new system in real-time and in standard/normalized format. The data will be stored in real-time with some statistical aggregations. An ML model will be periodically trained for outlier detection. The ML model will issue the probability of fraud for each transaction. It is preferable to have no labeling and as little software development as possible. Which kinds of ML models could be used (pick2)? A. K-means B. Decision Tree C. Random Forest D. Matrix Factorization E. Boosted Tree - XGBoost
A. K-means E. Boosted Tree - XGBoost SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. The k-means clustering is a mathematical and statistical method on numerical vectors that divides ann observes k clusters. Each example belongs to the cluster with the closest mean (cluster centroid). In ML, it is an unsupervised classification method and is widely used to detect unusual or outlier movements. For these reasons, it is one of the main methods for fraud detection. But it is not the only method because not all frauds are linked to strange movements. There may be other factors.XGBoost, which as you can see from the figure, is an evolution of the decision trees, has recently been widely used in this field and has had many positive results. It is an open-source project and this is the description from its Github page: XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. The same code runs on major distributed environments (Kubernetes, Hadoop, SGE, MPI, Dask) and can solve problems beyond billions of examples. B and C are suboptimal because of just Decision Trees. D is wrong because Matrix Factorization is for recommender systems. So, it predicts the preference of an item based on the experience of other users. Not suitable for us. For any further detail: https://cloud.google.com/solutions/building-anomaly-detection-dataflow-bigqueryml-dlp https://cloud.google.com/architecture/detecting-anomalies-in-financial-transactions https://medium.com/@adityakumar24jun/xgboost-algorithm-the-new-king-c4a64ea677bf
You are a junior Data Scientist, and you need to create a new classification Machine Learning model with Tensorflow. You have a limited set of data on which you build your model. You know the rule to create training, test and validation datasets, but you're afraid you won't have enough to make something satisfying. Which solution is the best one? A. Use Cross-Validation B. All data for learning C. Split data between Training and Test D. Split data between Training and Test and Validation
A. Use Cross-Validation SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Cross-validation involves running our modeling process on various subsets of data, called "folds". Obviously, this creates a computational load. Therefore, it can be prohibitive in very large datasets, but it is great when you have small datasets. B is wrong because it is the best way to obtain overfitting. C and D are wrong because with small datasets cross-validation achieves far better results. For any further detail: https://developers.google.com/machine-learning/glossary?hl=en#cross-validation https://www.kaggle.com/alexisbcook/cross-validation
You are using an AI Platform, and you are working with a series of demanding training jobs. So, you want to use TPUs instead of CPUs. You are not using Docker images or custom containers. What is the simplest configuration to indicate if you do not have particular needs to customize in the YAML configuration file? A. Use scale-tier to BASIC_TPU B. Set Master-machine-type C. Set Worker-machine-type D. Set parameterServerType
A. Use scale-tier to BASIC_TPU AI Platform lets you perform distributed training and serving with accelerators (TPUs and GPUs). You usually must specify the number and types of machines you need for master and worker VMs. But you can also use scale tiers that are predefined cluster specifications. In our case, scale-tier=BASIC_TPU covers all the given requirements. B, C and D are wrong because it is not the easiest way. Moreover, workerType, parameterServerType, evaluatorType, workerCount, parameterServerCount, and evaluatorCount for jobs use custom containers and for TensorFlow jobs. For any further detail: https://cloud.google.com/ai-platform/training/docs/machine-types#scale_tiers https://cloud.google.com/ai-platform/training/docs https://cloud.google.com/ai-platform/training/docs/using-tpus#configuring_a_custom_tpu_machine https://cloud.google.com/tpu/docs/tpus
The purpose of your current project is the recognition of genuine or forged signatures on checks and documents against regular signatures already stored by the Bank. There is obviously a very low incidence of fake signatures. The system must recognize which customer the signature belongs to and whether the signature is identified as genuine or skilled forged. Which of the following technical specifications can't you use with CNN? A. Kernel Selection B. Feature Cross C. Stride D. Max pooling layer
B. Feature Cross SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. A cross of functions is a dome that creates new functions by multiplying (crossing) two or more functions. It has proved to be an important technique and is also used to introduce non-linearity to the model. We don't need it in our case. Filters or kernels are a computation on a sub-matrix of pixels. Stride is obtained by sliding the kernel by 1 pixel. A Max pooling layer is created taking the max value of a small region. It is used for simplification. Dropout is also for simplification or regularization. It randomly zeroes some of the matrix values in order to find out what can be discarded with minor loss (and no overfitting) For any further detail: Convolutional Neural Networks — A Beginner's Guide | by Krut Patel
Your client has an e-commerce site for commercial spare parts for cars with competitive prices. It started with the small car sector but is continually adding products. Since 80% of them operate in a B2B market, he wants to ensure that his customers are encouraged to use the new products that he gradually offers on the site quickly and profitably. You decided on Recommendations AI. What specific recommendation model type is not useful for new products? A. Others You May Like B. Frequently Bought Together C. Recommended for You D. Recently Viewed
D. Recently Viewed The "Recently Viewed" recommendation is not for new products, and it is not a recommendation either. It provides the list of products the user has recently viewed, starting with the last. For any further detail: "Others You May Like" "Frequently Bought Together" (shopping cart expansion) "Recommended for You" "Recently Viewed"
You are working on a deep neural network model with Tensorflow on a cluster of VMs for a Bank. Your model is complex, and you work with huge datasets with complex matrix computations. You have a big problem: your training jobs last for weeks. You are not going to deliver your project in time. Which is the best solution that you can adopt? A. Cloud TPU B. Nvidia GPU C. Intel CPU D. AMD CPU
A. Cloud TPU Given these requirements, it is the best solution. GCP documentation states that the use of TPUs is advisable with models that: use TensorFlown eed training for weeks or months have huge matrix computationshave deals with big datasets and effective batch sizes So, A is better than B, while C and D are wrong because the CPUs turned out to be inadequate for our purpose. For any further detail: https://cloud.google.com/tpu/docs/tpus https://cloud.google.com/tpu/docs/how-to
TerramEarth is a company that builds heavy equipment for mining and agriculture. It is developing a series of ML models for different activities: manufacturing, procurement, logistics, marketing, customer service and vehicle tracking. TerramEarth uses Google Cloud AI Platform and wants to scale training and inference processes in a managed way. It is necessary to forecast whether a vehicle, based on the data collected during the maintenance service, has risks of failures in the next six months in order to recommend an extraordinary service operation. Which kind of technology/model should you advise using? A. Feedforward Neural Network B. Convolutional Neural Network C. Recurrent Neural Network D. Transformers E. Reinforcement Learning F. GAN Generative Adversarial Network
A. Feedforward Neural Network SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Feedforward neural networks are the classic example of neural networks. In fact, they were the first and most elementary type of artificial neural network. Feedforward neural networks are mainly used for supervised learning when the data, mainly numerical, to be learned is neither time-series nor sequential (such as NLP). These networks do not have any loops or loops in the network. Information moves in one direction only, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. All the other techniques are more complex and suitable for different applications (images, NLP, recommendations). Following a brief explanation of all of them. The convolutional neural network (CNN) is a type of artificial neural network extensively used for image recognition and classification. It uses the convolutional layers, that is, the reworking of sets of pixels by running filters on the input pixels. A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. A transformer is a deep learning model that can give different importance to each part of the input data. It is used for NLP - natural language processing and in computer vision. Reinforcement Learning provides a software agent that evaluates possible solutions through a progressive reward in repeated attempts. It does not need to provide labels. But it requires a lot of data and several trials, and the possibility to evaluate the validity of each attempt. GAN is a special class of machine learning frameworks used for the automatic generation of facial images. Autoencoder is a neural network aimed to transform and learn with a compressed representation of raw data. For any further detail: https://en.wikipedia.org/wiki/Feedforward_neural_network https://en.wikipedia.org/wiki/Feedforward_neural_network
In your company, you train and deploy several ML models with Tensorflow. You use on-prem servers, but you often find it challenging to manage the most expensive training and control and update the models. You are looking for a system that can handle all these tasks. Which solutions can you adopt (pick 2)? A. Kubeflow to run on Google Kubernetes Engine B. AI Platform (currently Vertex AI) C. Use Scikit-Learn that is simple and powerful D. Use SageMaker managed services
A. Kubeflow to run on Google Kubernetes Engine B. AI Platform (currently Vertex AI) SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Kubeflow Pipelines is an open-source platform designed specifically for creating and deploying ML workflows based on Docker containers. Their main features: Using packaged templates in Docker images in a K8s environment Manage your various tests/experiments Simplifying the orchestration of ML pipelines Reuse components and pipelines AI Platform is an integrated suite of ML managed products aimed at: Train an ML model Evaluate and tune a model Deploy models Manage prediction: Batch, Online and monitoring Manage model versions: workflows and retraining C. is wrong because Scikit-learn is an ML platform with many standard algorithms easy and immediate to use. TensorFlow (from the official doc) is an end-to-end open-source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art into ML, and developers easily build and deploy ML-powered applications. So, there are 2 different platforms, even if there is Scikit Flow that integrates the two.Scikit-learn doesn't manage ML Pipelines. D is wrong because SageMaker is an AWS ML product. For any further detail: https://cloud.google.com/ai-platform/training/docs/tensorflow-2 https://cloud.google.com/ai-platform/docs/technical-overview
You work in a company that has acquired an advanced consulting services company. Management wants to analyze all past important projects and key customer relationships. The consulting company does not have an application that manages this data in a structured way but is certified for the quality of its services. All its documents follow specific rules. It was decided to acquire structured information on projects, areas of expertise and customers through the analysis of these documents. You're looking for ML methodologies that make this process quicker and easier. What are the better choices in GCP? A. Cloud Vision B. Cloud Natural Language AP C. Document AI D. AutoML Natural Language
C. Document AI Document AI is the ideal broad-spectrum solution. It is a service that gives a complete solution with computer vision and OCR, NLP and data management. It allows you to extract and structure information automatically from documents. It can also enrich them with the Google Knowledge Graph to verify company names, addresses, and telephone numbers to draw additional or updated information. All other answers are incorrect because their functions are already built into Document AI. https://cloud.google.com/document-ai Cloud Vision, Cloud Natural Language API, or AutoML Natural Language.
You are training a set of models that should be simple, using regression techniques. During training, your model seems to work. But the tests are giving unsatisfactory results. You discover that you have several data errors and missing data. You need a tool that helps you cope with them. Which of the following problems is not related to Data Validation? A. Omitted values. B. Categories C. Duplicate examples. D. Bad labels. E. Bad feature values
B. Categories SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Categories are not related to Data Validation. Usually, they are categorical, string variables that in ML usually are mapped in a numerical set before training. A is OK because omitted values are a problem because they may change fundamental statistics like average, for example. C is OK because duplicate examples may change fundamental statistics, too.For example, we may have duplicates when a program loops and creates the same data several times. D and E are OK because having bad labels (with supervised learning) or bad features means obtaining a bad model. For any further detail: https://developers.google.com/machine-learning/crash-course/representation/cleaning-data
You work in a major banking institution. The Management has decided to rapidly launch a bank loan service, as the Government has created a series of "first home" facilities for the younger population. The goal is to carry out the automatic management of the required documents (certificates, origin documents, legal information) so that the practice can be built and verified automatically using the data and documents provided by customers and can be managed in a short time and with the minimum contribution of the scarce specialized personnel. Which of these GCP services can you use? A. Dialogflow B. Document AI C. Cloud Natural Language API D. AutoML
B. Document AI SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Document AI is the perfect solution because it is a complete service for the automatic understanding of documents and their management. It integrates computer natural language processing, OCR and vision and can create pre-trained templates aimed at intelligent document administration. A is wrong because Dialogflow is for speech Dialogs, not written documents. C is wrong because NLP is integrated into Document AI.D is wrong because functions like AutoML are integrated into Document AI, too. For any further detail: https://cloud.google.com/document-ai https://cloud.google.com/solutions/lending-doc-ai https://www.qwiklabs.com/focuses/12733?&parent=catalog https://cloud.google.com/automl https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-announces-document-ai-platform
You are working with a Linear Regression model for an important Financial Institution. Your model has many independent variables. You discovered that you could not achieve good results because many variables are correlated. You asked for advice from an experienced Data scientist that explains what you can do. Which techniques or algorithms did he advise to use (pick 3)? A. Multiple linear regression with MLE B. Partial Least Squares C. Principal components D. Maximum Likelihood Estimation E. Multivariate Multiple Regression
B. Partial Least Squares C. Principal components E. Multivariate Multiple Regression SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. If you have many independent variables, some of which are correlated with each other. You have multicollinearity; therefore, you cannot use classical linear regression. Partial Least Squares and Principal components create new variables that are uncorrelated. Partial Least Squares method uses projected new variables using functions. The main PCA components reduce the variables while maintaining their variance. Hence, the amount of variability contained in the original characteristics. Multivariate regression finds out ways to explain how different elements in variables react together to changes. A is wrong because Multiple linear regression is an OLS Ordinary Least Square method. D is wrong because Maximum Likelihood Estimation requires independence for variables, too. Maximum Likelihood Estimation finds model parameter values with probability, maximizing the likelihood of seeing the examples given the model. For any further detail: https://towardsdatascience.com/partial-least-squares-f4e6714452a https://en.wikipedia.org/wiki/Partial_least_squares_regression https://towardsdatascience.com/maximum-likelihood-estimation-984af2dcfcac https://en.wikipedia.org/wiki/Partial_least_squares_regression https://www.mygreatlearning.com/blog/introduction-to-multivariate-regression/ https://colab.research.google.com/github/kaustubholpadkar/Predicting-House-Price-using-Multivariate-Linear-Regression/blob/master/Multivariate_Linear_Regression_Python.ipynb https://en.wikipedia.org/wiki/Polynomial_regression
You are training a set of models that should be simple, using regression techniques. During training, your model seems to work. But the tests are giving unsatisfactory results. You discover that you have several missing data. You need a tool that helps you cope with them. WhichGCP product would you choose? A. Dataproc B. Dataprep C. Dataflow D. Data Fusion
B. Dataprep Dataprep is a serverless service that lets you examine clean and correct structured and unstructured data. So, it is fully compliant with our requirements. Dataproc is a managed Spark and Hadoop service. Therefore, it is for BigData processing.C loud Dataflow is a managed service to run Apache Beam-based data pipeline, both batch and streaming. Data Fusion is for data pipelines too. But it is visual and simpler, and it integrates multiple data sources to produce new data. For any further detail: https://cloud.google.com/dataprep https://docs.trifacta.com/display/dp/ https://developers.google.com/machine-learning/crash-course/representation/cleaning-data
Your team is designing a fraud detection system for a major Bank. The requirements are: Various banking applications will send transactions to the new system in real-time and in standard/normalized format. The data will be stored in real-time with some statistical aggregations.An ML model will be periodically trained for outlier detection. The ML model will issue the probability of fraud for each transaction. It is preferable to have no labeling and as little software development as possible. Which products would you choose (pick 3)? A. Dataprep B. Dataproc C. Dataflow Flex D. Pub/Sub E. Composer F. BigQuery
C. Dataflow Flex D. Pub/Sub F. BigQuery The Optimal procedure to achieve the goal is: Pub / Sub to capture the data stream Dataflow Flex to aggregate and extract insights in real-time in BigQuery BigQuery ML to create the models All the other solutions' usage will be sub-optimal and will need more effort. Practice with this lab for a detailed experience. For any further detail: https://cloud.google.com/solutions/building-anomaly-detection-dataflow-bigqueryml-dlp https://cloud.google.com/architecture/detecting-anomalies-in-financial-transactions
"Your team is working for a major apparel company that is developing an online business with significant investments.The company adopted Analytics-360. So, it can achieve a lot of data on the activities of its customers and on the interest of the various commercial initiatives of the websites, such as (from Google Analytics-360):" Average bounce rate per dimension Average number of product page views by purchaser type Average number of transactions per purchaser Average amount of money spent per session Sequence of hits (pathing analysis) Multiple custom dimensions at hit or session level Average number of user interactions before purchase The first thing management wants is to categorize customers to determine which types are more likely to buy. Subsequently, further models will be created to incentivize the most interesting customers better and boost sales. You have a lot of work to do and you want to start quickly. What techniques do you use in this first phase (pick 2)? A. BigQuery e BigQueryML B. Cloud Storage con AVRO C. AI Platform and Tensorflow D. Binary Classification E. K-means F. KNN
E. K-means It is necessary to create different groups of customers based on purchases and their characteristics for these requirements. We are in the field of unsupervised learning. BigQuery is already set up both for data acquisition and for training, validation and use of this kind of model. The K-means model in BigQuery ML uses a technique called clustering. Clustering is a statistical technique that allows, in our case, to classify customers with similar behaviors for marketing automatically. All the other answers address more complex and more cumbersome solutions. Furthermore, while the others are all supervised, we do not have ready-made solutions, but we want the model to provide us with the required categories. For any further detail: https://cloud.google.com/bigquery-ml/docs/kmeans-tutorial https://cloud.google.com/architecture/building-k-means-clustering-model
You are starting to operate as a Data Scientist and are working on a deep neural network model with Tensorflow to optimize customer satisfaction for after-sales services to create greater client loyalty. You are doing Feature Engineering, and your focus is to minimize bias and increase accuracy. Your coordinator has told you that by doing so you risk having problems. He explained to you that, in addition to the bias, you must consider another factor to be optimized. Which one? A. Blending B. Learning Rate C. Feature Cross D. Bagging E. Variance
E. Variance SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. The variance indicates how much function f (X) can change with a different training dataset. Obviously, different estimates will correspond to different training datasets, but a good model should reduce this gap to a minimum. The bias-variance dilemma is an attempt to minimize both bias and variance. The bias error is the non-estimable part of the learning algorithm. The higher it is, the more underfitting there is. Variance is the sensitivity to differences in the training set. The higher it is, the more overfitting there is. A is wrong because Blending indicates an ensemble of ML models. B is wrong because Learning Rate is a hyperparameter in neural networks. C is wrong because Feature Cross is the method for obtaining new features by multiplying other ones. D is wrong because Bagging is an ensemble method like Blending.For any further detail:https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
You work for a large retail company. You are preparing a marketing model. The model will have to make predictions based on the historical and analytical data of the e-commerce site (analytics-360). In particular, customer loyalty and remarketing possibilities should be studied. You work on historical tabular data. You want to quickly create an optimal model, both from the point of view of the algorithm used and the tuning and life cycle of the model. What are the two best services you can use? A. AutoML Tables B. BigQuery ML C. Vertex AI D. GKE
A. AutoML Tables C. Vertex AI SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. AutoML Tables can select the best model for your needs without having to experiment.The architectures currently used (they are added at the same time) are: LinearFeedforward deep neural network Gradient Boosted Decision Tree AdaNetEnsembles of various model architectures In addition, AutoML Tables automatically performs feature engineering tasks, too, such as: Normalization Encoding and embeddings for categorical features Timestamp columns management (important in our case) So, it has special features for time columns: for example, it can correctly split the input data into training, validation and testing.Vertex AI is a new API that combines AutoML and AI Platform. You can use both AutoML training and custom training in the same environment. B is wrong because AutoML Tables has additional automated feature engineering and is integrated into Vertex AI. D is wrong because GKE doesn't supply all the ML features of Vertex AI. It is an advanced K8s managed environment. For any further detail: https://cloud.google.com/automl-tables/docs/features https://cloud.google.com/vertex-ai/docs/pipelines/introduction https://cloud.google.com/automl-tables/docs/beginners-guide
Your company operates an innovative auction site for furniture from all times. You have to create a series of ML models that allow you to establish the period, style and type of the piece of furniture depicted starting from the photos. Furthermore, the model must be able to determine whether the furniture is interesting and require it to be subject to a more detailed estimate. You created the model, but your manager said that he wants to supply this service to mobile users when they go to the fly markets. Which of the following services do you think is the most suitable? A. AutoML Vision Edge B. Vision AI C. Video AI D. AutoML Vision
A. AutoML Vision Edge SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. AutoML Vision Edge lets your model be deployed on edge devices and, therefore, mobile phones, too. All the other answers refer to Cloud solutions; so, they are wrong. Vision AI uses pre-trained models trained by Google. AutoML Vision lets you train models to classify your images with your own characteristics and labels; so, you can tailor your work as you want. Video AI manages videos, not pictures. It can extract metadata from any streaming video, get insights in a far shorter time, and let trigger events. For any further detail: https://cloud.google.com/vision/automl/docs/edge-quickstart https://cloud.google.com/vision/automl/docs/beginners-guide https://firebase.google.com/docs/ml/automl-image-labeling
Your company is designing a series of models aimed at optimal customer care management. For this purpose, all written and voice communications with customers are recorded so that they can be classified and managed. The problem is that Clients often provide private information that cannot be distributed and disclosed. Which of the following techniques can you use (pick 3)? A. Cloud Data Loss Prevention API (DLP) B. CNN - Convolutional Neural Network C. Cloud Speech API D. Cloud Vision API
A. Cloud Data Loss Prevention API (DLP) C. Cloud Speech API D. Cloud Vision API Cloud Data Loss Prevention is a managed service specially designed to discover sensitive data automatically that may be protected. It could be used for personal codes, credit card numbers, addresses and any private contact details, etc. Cloud Speech API is useful if you have audio recordings as it is a speech-to-text service. Cloud Vision API has a built-in text-detection service. So you can get text from images. B is wrong because A Convolutional Neural Network is a Deep Neural Network in which the layers are made up of processed sections of the source image. So, it is a successful method for image and shape classification. For any further detail: https://cloud.google.com/architecture/sensitive-data-and-ml-datasets
TerramEarth is a company that builds heavy equipment for mining and agriculture. During maintenance services for vehicles produced by TerramEarth at the service centers, information relating to their use is downloaded. Every evening, this data flows into the data center, is consolidated and sent to the Cloud. TerramEarth has an ML model that predicts component failures and optimizes the procurement of spare parts for service centers to offer customers the highest level of service. TerramEarth wants to automate the redevelopment and distribution process every time it receives a new file. What is the best service to start the process? A. Cloud Storage trigger with Cloud Functions B. Cloud Scheduler every night C. Pub/Sub D. Cloud Run and Cloud Build
A. Cloud Storage trigger with Cloud Functions SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Files are received from Cloud Storage, which has native triggers for all the events related to its file management. So, we may start a Cloud Function that may activate any Cloud Service as soon as the file is received. Cloud Storage triggers may also activate a Pub/Sub notification, just a little more complex. It is the simplest and most direct solution of all the answers. For any further detail: https://cloud.google.com/functions/docs/calling/storage https://cloud.google.com/blog/products/gcp/cloud-storage-introduces-cloud-pub-sub-notifications
Your team prepared a custom model with Tensorflow that forecasts, based on diagnostic images, which cases need more analysis and medical support. The accuracy of the model is very high. But when it is deployed in production, the medical staff is very dissatisfied. What is the most likely motivation? A. Logistic regression with a classification threshold too high B. DNN Model with overfitting C. DNN Model with underfitting D. You have to perform feature crosses
A. Logistic regression with a classification threshold too high When there is an imbalance between true and false ratios in binary classification, it is necessary to modify the classification threshold so that the most probable errors are those with minor consequences. In our case, it is better to be wrong with a healthy person than with a sick one. Accuracy is the number of correct predictions on the total of predictions done. Let's imagine that we have 100 predictions, and 95 of them are correct. That is 95%. It looks almost perfect. But we assume that the system has foreseen 94 true negative cases and only one true positive case, and one case of false positive, and 4 cases of false negative. So, the model predicted 98 healthy when they were 95 and 2 suspected cases when they were 5. The problem is that sick patients are, luckily, a minimal percentage. But it is important that they are intercepted. So, our model failed because it correctly identified only 1 case out of the total of 5 real positives that is 20% (recall). It also identified 2 positives, one of which was negative, i.e. 50% (precision). It's not good at all. Precision: Rate of correct positive identifications Recall: Rate of real positives correctly identified To calibrate the result, we need to change the threshold we use to decide between positive and negative. The model does not return 0 and 1 but a value between 0 and 1 (sigmoid activation function). In our case, we have to choose a threshold lower than 0.5 to classify it as positive. In this way, we risk carrying out further investigations on the healthy but being able to treat more sick patients. It is definitely the desired result. For any further detail: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
You are a junior Data Scientist. You are working with a linear regression model with sklearn. Your outcome model presented a good R-square - coefficient of determination, but the final results were poor. When you asked for advice, your mentor laughed and said that you failed because of the Anscombe Quartet problem. What are the other possible problems described by the famous Anscombe Quartet? A. Not linear relation between independent and dependent variables B. Outliers that change the result C. Correlation among variables D. Uncorrect Data
A. Not linear relation between independent and dependent variables B. Outliers that change the result SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. The most common problems are: Not linear relation and Outliers As you may see in the referenced picture, you may have data without a linear relationship between X and Y that gives you good statistics. C and D are wrong because correlation and incorrect data prevent the model from working, but they do not give good theoretical results. For any further detail: https://en.wikipedia.org/wiki/Anscombe%27s_quartet https://www.r-bloggers.com/2015/01/k-means-clustering-is-not-a-free-lunch/
You work for a video game company. Your management came up with the idea of creating a game in which the characteristics of the characters were taken from those of the human players. You have been asked to generate not only the avatars but also the various visual expressions during the game actions. You are working with GAN - Generative Adversarial Network models, but the training is intensive and time-consuming. You want to increase the power of your training quickly, but your management wants to keep costs down. What solutions could you adopt (pick 3)? A. Use preemptible Cloud TPU B. Use AI Platform with TPUs C. Use the Cloud TPU Profiler TensorBoard plugin D. Use one Compute Engine Cloud TPU VM and install TensorFlow
A. Use preemptible Cloud TPU B. Use AI Platform with TPUs C. Use the Cloud TPU Profiler TensorBoard plugin SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. All these solutions are ideal for increasing power and speed at the right cost for your training. You may use preemptible Cloud TPU (70% cheaper) for your fault-tolerant machine learning workloads. You may use TPUs in the AI Platform because TensorFlow APIs and custom templates can allow the managed environment to use TPUs and GPUs using scale tiers. You may optimize your workload using the Profiler with TensorBoard. TensorBoard is a visual tool for ML experimentation for Tensorflow. D is not advisable because there are AI Platform Deep Learning VM Image types. So, you don't have to install your own ML tools and libraries and you can use managed services that help you with more productivity and savings. For any further detail: https://storage.googleapis.com/nexttpu/index.html https://cloud.google.com/ai-platform/training/docs/using-tpus
You work for a digital publishing website with an excellent technical and cultural level, where you have both famous authors and unknown experts who express ideas and insights. You, therefore, have an extremely demanding audience with strong interests that can be of various types. Users have a small set of articles that they can read for free every month. Then they need to sign up for a paid subscription. You have been asked to prepare an ML training model that processes user readings and article preferences. You need to predict trends and topics that users will prefer. But when you train your DNN with Tensorflow, your input data does not fit into RAM memory. What can you do in the simplest way? A. Use tf.data.Dataset B. Use a queue with tf.train.shuffle_batch C. Use pandas.DataFrame D. Use a NumPy array
A. Use tf.data.Dataset SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. The tf.data.Dataset allows you to manage a set of complex elements made up of several inner components. It is designed to create efficient input pipelines and to iterate over the data for their processing. These iterations happen in streaming. So, they work even if the input matrix is very large and doesn't fit in memory. B is wrong because it is far more complex, even if it is feasible. C and D are wrong because they work in real memory, so they don't solve the problem at all. For any further detail: https://www.tensorflow.org/api_docs/python/tf/data/Dataset https://www.kaggle.com/jalammar/intro-to-data-input-pipelines-with-tf-data
Your client has a large e-commerce Website that sells sports goods and especially scuba diving equipment. It has a seasonal business and has collected many sales data from its structured ERP and market trend databases. It wants to predict the demand of its customers both to increase business and improve logistics processes. Which of the following types of models and techniques should you focus on to obtain results quickly and with minimum effort? A. Custom Tensorflow model with an autoencoder neural network B. Bigquery ML ARIMA C. BigQuery Boosted Tree D. BigQuery Linear regression
B. Bigquery ML ARIMA We need to manage time-series data. Bigquery ML ARIMA_PLUS can manage time-series forecasts. The model automatically handles anomalies, seasonality, and holidays. A is wrong because a custom Tensorflow model needs more time and effort. Moreover, an autoencoder is a type of artificial neural network that is used in the case of unlabeled data (unsupervised learning). The autoencoder is an excellent system for generalization and therefore to reduce dimensionality, training the network to ignore insignificant data ("noise") is not our scope. C is wrong because a Boosted Tree is an ensemble of Decision Trees, so not suitable for time series. D is wrong because Linear Regression cuts off seasonality. It is not what the customer wants. For any further detail: https://cloud.google.com/bigquery-ml/docs/arima-single-time-series-forecasting-tutorial https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-time-series https://cloud.google.com/bigquery-ml/docs/introduction
TerramEarth is a company that builds heavy equipment for mining and agriculture. It is developing a series of ML models for different activities: manufacturing, procurement, logistics, marketing, customer service and vehicle tracking. TerramEarth uses Google Cloud AI Platform and wants to scale training and inference processes in a managed way. During the maintenance service, snapshots of the various components of the vehicle will be taken. Your new model should be able to determine both the degree of deterioration and any breakages or possible failures. Which kind of technology/model should you advise using? A. Feedforward Neural Network B. Convolutional Neural Network C. Recurrent Neural Network D. Transformers E. Reinforcement Learning F. GAN Generative Adversarial Network
B. Convolutional Neural Network SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. The convolutional neural network (CNN) is a type of artificial neural network extensively used for image recognition and classification. It uses the convolutional layers, that is, the reworking of sets of pixels by running filters on the input pixels. All the other technologies are not specialized for images. Feedforward neural networks are the classic example of neural networks. In fact, they were the first and most elementary type of artificial neural network. They are mainly used for supervised learning when the data, mainly numerical, to be learned is neither time-series nor sequential (such as NLP). A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. A transformer is a deep learning model that can give different importance to each part of the input data. It is used for NLP - natural language processing and in computer vision. Reinforcement Learning provides a software agent that evaluates possible solutions through a progressive reward in repeated attempts. It does not need to provide labels. But it requires a lot of data and several trials and the possibility to evaluate the validity of each attempt. GAN is a special class of machine learning frameworks used for the automatic generation of facial images. Autoencoder is a neural network aimed to transform and learn with a compressed representation of raw data. For any further detail: https://en.wikipedia.org/wiki/Convolutional_neural_network
You are a Data Scientist, and you work in a large organization. A fellow programmer, who is working on a project with Dataflow, asked you what GCP techniques support the computer's ability to entertain almost human dialogues and if these techniques can be used individually. Which of the following choices do you think is wrong? A. Speech to Text B. Polly C. Cloud NLP D. Text to Speech E. Speech Synthesis Markup Language (SSML)
B. Polly Amazon Polly is a text-to-speech service from AWS, not GCP. A is OK because Speech to Text can convert voice to written text. C is OK because Cloud Natural Language API can understand text meanings such as syntax, feelings, content, entities and can create classifications. D is OK because Text to Speech can convert written text to voice. E is OK because Speech Synthesis Markup Language (SSML) is not a service but a language used in Text-to-Speech requests. It gives you the ability to indicate how you want to format the audio, pauses, how to read acronyms, dates, times, abbreviations and so on. Really, it is useful for getting closer to human dialogue. For any further detail: https://cloud.google.com/speech-to-text https://cloud.google.com/text-to-speech/docs/basics https://cloud.google.com/text-to-speech/docs/ssml https://cloud.google.com/natural-language/
You have a Linear Regression model for the optimal management of supplies to a sales network based on a large number of different driving factors. You want to simplify the model to make it more efficient and faster. Your first goal is to synthesize the features without losing the information content that comes from them. Which of these is the best technique? A. Feature Crosses B. Principal component analysis (PCA) C. Embeddings D. Functional Data Analysis
B. Principal component analysis (PCA) SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Principal component analysis is a technique to reduce the number of features by creating new variables obtained from linear combinations or mixes of the original variables, which can then replace them but retain most of the information useful for the model. In addition, the new features are all independent of each other. The new variables are called principal components.A linear model is assumed as a basis. Therefore, the variables are independent of each other. A is incorrect because Feature Crosses are for the same objective, but they add non-linearity. C is incorrect because Embeddings, which transform large sparse vectors into smaller vectors are used for categorical data. D is incorrect because Functional Data Analysis has the goal to cope with complexity, but it is used when it is possible to substitute features with functions- not our case. For any further detail: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data https://builtin.com/data-science/step-step-explanation-principal-component-analysis https://en.wikipedia.org/wiki/Principal_component_analysis
Your team works on a smart city project with wireless sensor networks and a set of gateways for transmitting sensor data. You have to cope with many design choices. You want, for each of the problems under study, to find the simplest solution. For example, it is necessary to decide on the placement of nodes so that the result is the most economical and inclusive. An algorithm without data tagging must be used.Which of the following choices do you think is the most suitable? A. K-means B. Q-learning C. K-Nearest Neighbors D. Support Vector Machine(SVM)
B. Q-learning SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Q-learning is an RL Reinforcement Learning algorithm. RL provides a software agent that evaluates possible solutions through a progressive reward in repeated attempts. It does not need to provide labels. But it requires a lot of data and several trials and the possibility to evaluate the validity of each attempt.The main RL algorithms are deep Q-network (DQN) and deep deterministic policy gradient (DDPG). A is wrong because K-means is an unsupervised learning algorithm used for clustering problems. It is useful when you have to create similar groups of entities. So, even if there is no need to label data, it is not suitable for our scope. C. is wrong because K-NN is a supervised classification algorithm, therefore, labeled. New classifications are made by finding the closest known examples. D is wrong because SVM is a supervised ML algorithm, too. K-NN distances are computed. These distances are not between data points, but with a hyper-plane, that better divides different classifications. For any further detail: A Practical Application of K-Nearest Neighbours Analysis I Velocity Business Solutions Limited https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292
You are a junior Data Scientist, and you are being interviewed for a new job. A senior Data Scientist asked you: Which metric for classification models evaluation gives you the percentage of real spam email that was recognized correctly? What was the exact answer to this question? A. Precision B. Recall C. Accuracy D. F-Score
B. Recall SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Recall indicates how true positives were recalled (found). A is wrong because Precision is the metric that shows the percentage of true positives related to all your positive class predictions. C is wrong because Accuracy is the percentage of correct predictions on all outcomes. D is wrong because the F1 score is the harmonic mean between precision and recall. For any further detail: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall https://en.wikipedia.org/wiki/F-score https://en.wikipedia.org/wiki/Precision_and_recall#/media/File:Precisionrecall.svg
Your customer has an online dating platform that, among other things, analyzes the degree of affinity between the various people. Obviously, it already uses ML models and uses, in particular, XGBoost, the gradient boosting decision tree algorithm, and is obtaining excellent results. All its development processes follow CI / CD specifications and use Docker containers. The requirement is to classify users in various ways and update models frequently, based on new parameters entered into the platform by the users themselves. So, the problem you are called to solve is how to optimize frequently re-trained operations with an optimized workflow system. Which solution among these proposals can best solve your needs? A. Deploy the model on BigQuery ML and setup a job B. Use Kubeflow Pipelines to design and execute your workflow C. Use AI Platform D. Orchestrate activities with Google Cloud Workflows E. Develop procedures with Pub/Sub and Cloud Run F. Schedule processes with Cloud Composer
B. Use Kubeflow Pipelines to design and execute your workflow SEE 2 IMAGES IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Kubeflow Pipelines is the ideal solution because it is a platform designed specifically for creating and deploying ML workflows based on Docker containers. So, it is the only answer that meets all requirements. The main functions of Kubeflow Pipelines are: Using packaged templates in Docker images in a K8s environment Manage your various tests / experiments Simplifying the orchestration of ML pipelines Reuse components and pipelines SEE IMAGE 1 It is within the Kubeflow ecosystem, which is the machine learning toolkit for Kubernetes. SEE IMAGE 2 The other answers may be partially correct but do not resolve all items or need to add more coding. For any further detail: https://www.kubeflow.org/docs/components/pipelines/overview/pipelines-overview/ https://www.kubeflow.org/docs/started/kubeflow-overview/
Your client has an e-commerce site for commercial spare parts for cars with competitive prices. It started with the small car sector but is continually adding products. Since 80% of them operate in a B2B market, he wants to ensure that his customers are encouraged to use the new products that he gradually offers on the site quickly and profitably. Which GCP service can be valuable in this regard and in what way? A. Create a Tensorflow model using Matrix factorization B. Use Recommendations AI C. Import the Product Catalog D. Record / Import User events
B. Use Recommendations AI SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Recommendations AI is a ready-to-use service for all the requirements shown in the question. You don't need to create models, tune, train, all that is done by the service with your data. Also, the delivery is automatically done, with high-quality recommendations via web, mobile, email. So, it can be used directly on websites during user sessions. A could be OK, but it needs a lot of work. C and D deal only with data management, not creating recommendations. For any further detail: https://cloud.google.com/retail/recommendations-ai/docs/create-models https://cloud.google.com/recommendations
You have an NLP model for your company's Customer Care and Support Office. This model evaluates the general satisfaction of customers on the main categories of services offered and has always provided satisfactory performances. You have recently expanded the range of your services and want to refine / update your model. You also want to activate procedures that automate these processes. Which choices among the following do you prefer in the Cloud GCP? A. You don't need to change anything. If the model is well made and has no overfitting, it will be able to handle anything. B. Retrain using information from the last week of work only. C. Add examples with new product data and still regularly re-train and evaluate new models. D. Make a separate model with new product data and create the model ensemble.
C. Add examples with new product data and still regularly re-train and evaluate new models. Creating and using templates is not a one-shot activity. But, like most processes, it is an ongoing one, because the underlying factors can vary over time. Therefore, you need to continuously monitor the processes and retrain the model also on newer data, if you find that the frequency distributions of the data vary from the original configuration. It may also be necessary or desirable to create a new model. Generally, a periodic schedule is adopted every month or week. For this very reason, all the other answers are not exact. For any further detail: https://cloud.google.com/ai-platform/pipelines/docs https://medium.com/kubeflow/automated-model-retraining-with-kubeflow-pipelines-691a5f211701
Your business makes excellent use of ML models. Many of these were developed with Tensorflow. But lately, you've been making good use of AutoML to make your design work leaner, faster, and more efficient. You are looking for an environment that organizes and manages training, validation and tuning, and updating models with new data, distribution and monitoring in production. Which of these do you think is the best solution? A. Deploy Tensorflow on Kubernetes B. Leverage Kubeflow Pipelines C. Adopt Vertex AI: custom tooling and pipelines D. Migrate all models to BigQueryML with AutoML E. Migrate all models to AutoML Tables
C. Adopt Vertex AI: custom tooling and pipelines SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Vertex AI combines AutoML, custom models and ML pipeline management through to production. Vertex AI integrates many GCP ML services, especially AutoML and AI Platform, and includes many different tools to help you in every step of the ML workflow. So, Vertex AI offers two strategies for model training: AutoML and Personalized training. Machine learning operations (MLOps) is the practice of using DevOps for machine learning (ML). DevOps strategies automate the release of code changes and control of systems, resulting in greater security and less time to get systems up and running. All the other solutions are suitable for production. But, given these requirements, Vertex AI, with the AutoML solution's strong inclusion, is the best and the most productive one. For any further detail: https://cloud.google.com/vertex-ai/docs https://cloud.google.com/vertex-ai/docs/pipelines/introduction https://codelabs.developers.google.com/codelabs/vertex-ai-custom-models#1
Your company supplies environmental management services and has a network of sensors that acquire information uploaded to the Cloud to be pre-processed and managed with some ML models with dynamic dashboards used by customers. Periodically, the models are retrained and re-deployed, with a rather complex pipeline on VM clusters: New data is streamed from DataflowData is transformed through aggregations and normalizations (z-scores) The model is periodically retrained and evaluated New Docker images are created and stored You want to simplify the pipeline as much as possible and use fully managed or even serverless services as far as you can. Which do you choose from the following services? A. Kubeflow B. Platform AI - Vertex AI C. BigQuery and BigQuery ML D. TFX
C. BigQuery and BigQuery ML SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. BigQuery and BigQueryML are powerful services for data analysis and machine learning. They are fully serverless services that can process petabytes of data in public and private datasets and even data stored in files. BigQuery works with standard SQL and has a CLI interface: bq. You can use BigQuery jobs to automate and schedule tasks and operations. With BigQueryML, you can train models with a rich set of algorithms with data already stored in the Cloud. You may perform feature engineering and hyperparameter tuning and export a BigQuery ML model to a Docker image as required. All other services are useful in ML pipelines, but they aren't that easy and ready to use. Vertex AI is a new API that combines AutoML and AI Platform. You can use both AutoML training and custom training in the same environment. It obviously has a rich set of features for managing ML pipelines. For any further detail: https://cloud.google.com/bigquery-ml/docs/export-model-tutorial https://cloud.google.com/bigquery/docs/jobs-overviewhttps://cloud.google.com/bigquery/
TerramEarth is a company that builds heavy equipment for mining and agriculture. During maintenance services for vehicles produced by TerramEarth at the service centers, information relating to their use is collected together with administrative and billing data. All this information goes through a data pipeline process that you are asked to automate in the fastest and most managed way, possibly without code. Which service do you advise? A. Cloud Dataproc B. Cloud Dataflow C. Cloud Data Fusion D. Cloud Dataprep
C. Cloud Data Fusion SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Cloud Data Fusion is a managed service for quickly building data pipelines and ETL processes. It is based on the open-source CDAP project and therefore is portable to any environment.It has a visual interface that allows you to create codeless data pipelines as required. A is wrong because Cloud Dataproc is the managed Hadoop service. So, it could manage data pipelines but in a non-serverless and complex way. B is wrong because Dataflow is more complex, too, even though it has more functionality, such as batch and stream data processing with the same code. D is wrong because Cloud Dataprep is for cleaning, exploration and preparation, and is used primarily for ML processes. For any further detail: https://cloud.google.com/data-fusion https://www.youtube.com/watch?v=kehG0CJw2wo
The purpose of your current project is the recognition of genuine or forged signatures on checks and documents against regular signatures already stored by the Bank. There is obviously a very low incidence of fake signatures. The system must recognize which customer the signature belongs to and whether the signature is identified as genuine or skilled forged. What kind of ML model do you think is best to use? A. Binary logistic regression B. Matrix Factorization C. Convolutional Neural Networks D. Multiclass logistic regression
C. Convolutional Neural Networks SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. A Convolutional Neural Network is a Deep Neural Network in which the layers are made up of processed sections of the source image. This technique allows you to simplify images and highlight shapes and features regardless of the physical position in which they may be found. For example, if we have the same signature in the center or at the bottom right of an image, the object will be different. But the signature is the same. A neural network that compares these derived features and can simplify the model achieves the best results. A is wrong because Binary logistic regression deals with a classification problem that may result in true or false, like with spam emails. The issue here is far more complex. B is wrong because Matrix Factorization is used in recommender systems, like movies on Netflix. It is based on a user-item (movie) interaction matrix and the problem of reducing dimensionality. D is not exact because Multiclass logistic regression deals with a classification problem with multiple solutions, fixed and finite classes. It is an extension of binary logistic regression with basically the same principles with the assumption of several independent variables. But in image recognition problems, the best results are achieved with CNN because they are capable of finding and relating patterns positioned in different ways on the images. For any further detail: Convolutional Neural Networks — A Beginner's Guide | by Krut Patel https://research.google.com/pubs/archive/42455.pdf
You are a junior Data Scientist. You need to create a multi-class classification Machine Learning model with Keras Sequential model API. In particular, your model must indicate the main categories of a text. Which of the following techniques should not be used? A. Feedforward Neural Network B. N-grams for tokenize text C. K-means D. Softmax function E. Pre-trained embeddings F. Dropout layer G. Categorical cross-entropy
C. K-means SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. The answers identify the main techniques to be used for a multi-class classification Machine Learning model. For more details, see the step-by-step example. The only unsuitable element is K-means clustering, one of the most popular unsupervised machine learning algorithms. Therefore, it is out of this scope. A is OK because Feedforward Neural Network is a kind of DNN, widely used for many applications. B is OK because N-grams for tokenizing text is a contiguous sequence of items (usually words) in NLP. D is OK because Softmax is an activation function for multi-class classification. E is OK because embeddings are used for reducing high-dimensional tensors, so categories, too. F is OK because the Dropout layer is used for regularization, eliminating input features randomly. G is OK because categorical cross-entropy is a loss function for multi-class classification. For any further detail: https://developers.google.com/machine-learning/guides/text-classification/ https://en.wikipedia.org/wiki/N-gram https://en.wikipedia.org/wiki/K-means_clustering https://en.wikipedia.org/wiki/Multilayer_perceptron https://developers.google.com/machine-learning/crash-course/images/RegularizationTwoLossFunctions.svg
You are starting to operate as a Data Scientist and are working on a model of prizes optimization with products with a lot of categorical features. You don't know how to deal with them. Your manager told you that you had to encode them in a limited set of numbers.Which of the following methods will not help you with this task? A. Ordinal Encoding B. One-Hot Encoding C. Sigmoids D. Embeddings E. Feature Crosses
C. Sigmoids Sigmoids are the most common activation functions (logistic function) for binary classification. There is nothing to do with categorical variables. A is OK for categories because Ordinal encoding simply creates a correspondence between each unique category with an integer. B is OK for categories because One-hot encoding creates a sparse matrix with values (0 and 1, see the picture) that indicate the presence (or absence) of each possible value. D is OK for categories because Embeddings are often used with texts and in Natural Language Processing (NLP) and address the problem of complex categories linked together. E is OK for categories because Feature crosses creates a new feature created by joining or multiplying multiple variables to add further predictive capabilities, such as transforming the geographic location of properties into a region of interest. For any further detail: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data https://developers.google.com/machine-learning/crash-course/feature-crosses/crossing-one-hot-vectors https://www.kaggle.com/alexisbcook/categorical-variables
An industrial company wants to improve its quality system. It has developed its own deep neural network model with Tensorflow to identify the semi-finished products to be discarded with images taken from the production lines in the various production phases. During training, your custom model converges, but the tests are giving unsatisfactory results. What do you think might be the problem, and how could you proceed to fix it (pick 3)? A. You have used too few examples, you need to re-train with a larger set of images B. You have to change the type of algorithm and use XGBoost C. You have an overfitting problem D. Decrease your Learning Rate hyperparameter E. the model is too complex, you have to regularize the model and then make it simpler F. Use L2 Ridge Regression
C. You have an overfitting problem E. the model is too complex, you have to regularize the model and then make it simpler F. Use L2 Ridge Regression SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. A is wrong because when you have a different trend between training and validation, you have an overfitting problem. More data may help you, but you have to simplify the model first. B is wrong because the problem is not with the algorithm but is within feature management. D is wrong because decreasing the Learning Rate hyperparameter is useless. The model converges in training. For any further detail: https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization https://developers.google.com/machine-learning/crash-course/images/RegularizationTwoLossFunctions.svg
In your company you use Tensorflow and Keras as main libraries for Machine Learning and your data is stored in disk files, so block storage. Recently there has been the migration of all the management computing systems to Google Cloud and management has requested that the files should be stored in Cloud Storage and that the tabular data should be stored in BigQuery and pre-processed with Dataflow. Which of the following techniques is NOT suitable for accessing tabular data as required? A. BigQuery Python client library B. BigQuery I/O Connector C. tf.data.Iterator D. tf.data.dataset reader
C. tf.data.Iterator tf.data.Iterator is used for enumerating elements in a Dataset, using Tensorflow API, so it is not suitable for accessing tabular data. Option A is wrong because the Python BigQuery client library allows you to get BigQuery data in Panda, so it's definitely useful in this environment. Option B is wrong because BigQuery I/O Connector is used by Dataflow (Apache Beam) for reading Data for transformation and processing, as required. Option D is wrong because you must first access the data using the tf.data.dataset reader for BigQuery. For any further detail: https://cloud.google.com/architecture/ml-on-gcp-best-practices#store-tabular-data-in-bigquery https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas https://beam.apache.org/documentation/io/built-in/google-bigquery/
You are working on a deep neural network model with Tensorflow. Your model is complex, and you work with very large datasets full of numbers. You want to increase performances. But you cannot use further resources. You are afraid that you are not going to deliver your project in time. Your mentor said to you that normalization could be a solution. Which of the following choices do you think is not for data normalization? A. Scaling to a range B. Feature Clipping C. z-test D. log scaling E. z-score
C. z-test SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. z-test is not correct because it is a statistic that is used to prove if a sample mean belongs to a specific population. For example, it is used in medical trials to prove whether a new drug is effective or not. A is OK because Scaling to a range converts numbers into a standard range ( 0 to 1 or -1 to 1). B is OK because Feature Clipping caps all numbers outside a certain range. D is OK because Log Scaling uses the logarithms instead of your values to change the shape. This is possible because the log function preserves monotonicity. E is OK because Z-score is a variation of scaling: the resulting number is divided by the standard deviations. It is aimed at obtaining distributions with mean = 0 and std = 1. For any further detail: https://developers.google.com/machine-learning/data-prep/transform/transform-numeric https://developers.google.com/machine-learning/crash-course/images/RegularizationTwoLossFunctions.svg
Your company runs an e-commerce site. You produced static deep learning models with Tensorflow that process Analytics-360 data. They have been in production for some time. Initially, they gave you excellent results, but gradually, the accuracy has progressively decreased. You retrained the models with the new data and solved the problem.At this point, you want to automate the process using the Google Cloud environment. Which of these solutions allows you to quickly reach your goal? A. Cluster Compute Engine and KubeFlow B. GKE and TFX C. GKE and KubeFlow D. AI Platform and TFX
D. AI Platform and TFX SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. TFX is a platform that allows you to create scalable production ML pipelines for TensorFlow projects, therefore Kubeflow. It, therefore, allows you to manage the entire life cycle seamlessly from modeling, training and validation, up to production start-up and management of the inference service. AI Platform manages TFX, under AI Platform and pipelines: You can configure a ClusterSelect basic parameters and click create You get your Kubeflow and Kubernetes launched All the other answers are correct, but not optimal for a quick and managed solution. For any further detail: https://cloud.google.com/ai-platform/pipelines/docs https://developers.google.com/machine-learning/crash-course/production-ml-systems https://www.tensorflow.org/tfx/guide https://www.youtube.com/watch?v=Mxk4qmO_1B4
Your company operates an innovative auction site for furniture from all times. You have to create a series of ML models that allow you, starting from the photos, to establish the period, style and type of the piece of furniture depicted. Furthermore, the model must be able to determine whether the furniture is interesting and require it to be subject to a more detailed estimate. You want Google Cloud to help you reach this ambitious goal faster. Which of the following services do you think is the most suitable? A. AutoML Vision Edge B. Vision AI C. Video AI D. AutoML Vision
D. AutoML Vision SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Vision AI uses pre-trained models trained by Google. This is powerful, but not enough. But AutoML Vision lets you train models to classify your images with your own characteristics and labels. So, you can tailor your work as you want. A is wrong because AutoML Vision Edge is for local devices. C is wrong because Video AI manages videos, not pictures. It can extract metadata from any streaming video, get insights in a far shorter time, and let trigger events. For any further detail: https://cloud.google.com/vision/automl/docs/edge-quickstart https://cloud.google.com/vision/automl/docs/beginners-guide https://cloud.google.com/natural-language/ https://cloud.google.com/automlhttps://www.youtube.com/watch?v=hUzODH3uGg0
You are working on an NLP model. So, you are dealing with words and sentences, not numbers. Your problem is to categorize these words and make sense of them. Your manager told you that you have to use embeddings. Which of the following techniques are not related to embeddings? A. Count Vector B. TF-IDF Vector C. Co-Occurrence Matrix D. CoVariance Matrix
D. CoVariance Matrix SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. CoVariance Matrices are square matrices with the covariance between each pair of elements. It measures how much the change of one with respect to another is related. All the others are embeddings: A Count Vector gives a matrix with the count of every single word in every example. 0 if no occurrence. It is okay for small vocabularies. TF-IDF vectorization counts words in the entire experiment, not a single example or sentence. Co-Occurrence Matrix puts together words that occur together. So, it is more useful for text understanding. For any further detail: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data https://developers.google.com/machine-learning/crash-course/feature-crosses/crossing-one-hot-vectorsh ttps://www.wikiwand.com/en/Covariance_matrix https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ https://towardsdatascience.com/5-things-you-should-know-about-covariance-26b12a0516f1
You work for a digital publishing website with an excellent technical and cultural level, where you have both famous authors and unknown experts who express ideas and insights. You, therefore, have an extremely demanding audience with strong interests of various types. Users have a small set of articles that they can read for free every month; they need to sign up for a paid subscription. You aim to provide your audience with pointers to articles that they will indeed find of interest to themselves. Which of these models can be useful to you? A. Hierarchical Clustering B. Autoencoder and self-encoder C. Convolutional Neural Network D. Collaborative filtering using Matrix Factorization
D. Collaborative filtering using Matrix Factorization SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Collaborative filtering works on the idea that a user may like the same things of the people with similar profiles and preferences. So, exploiting the choices of other users, the recommendation system makes a guess and can advise people on things not yet been rated by them. A is wrong because Hierarchical Clustering creates clusters using a hierarchical tree. It may be effective, but it is heavy with lots of data, like in our example. B is wrong because Autoencoder and self-encoder are useful when you need to reduce the number of variables under consideration for the model, therefore for dimensionality reduction. C is wrong because Convolutional Neural Network is used for image classification. For any further detail: https://en.wikipedia.org/wiki/Collaborative_filtering https://www.youtube.com/playlist?list=PLQY2H8rRoyvy2MiyUBz5RWZr5MPFkV3qz
You need to develop and train a model capable of analyzing snapshots taken from a moving vehicle and detecting if obstacles arise. Your work environment is an AI Platform (currently Vertex AI). Which technique or algorithm do you think is best to use? A. TabNet algorithm with TensorFlow B. A linear learner with Tensorflow Estimator API C. XGBoost with BigQueryML D. TensorFlow Object Detection API
D. TensorFlow Object Detection API SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. TensorFlow Object Detection API is designed to identify and localize multiple objects within an image. So it is the best solution. A is wrong because TabNet is used with tabular data, not images. It is a neural network that chooses the best features at each decision step in such a way that the model is optimized simpler. B is wrong because a linear learner is not suitable for images too. It can be applied to regression and classification predictions. C is wrong because BigQueryML is designed for structured data, not images. For any further detail: https://github.com/tensorflow/models/tree/master/research/object_detection https://cloud.google.com/ai-platform/training/docs/algorithms https://cloud.google.com/ai-platform/training/docs https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/img/kites_detections_output.jpg
You have an ML model designed for an industrial company that provides the correct price to buy goods based on a series of elements, such as the quantity requested, the level of quality and other specific variables for different types of products. You have built a linear regression model that works well but whose performance you want to optimize.Which of these techniques could you use? A. Clipping B. Log scaling C. Z-score D. Scaling to a range E. All of them
E. All of them SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. Feature clipping eliminates outliers that are too high or too low. Scaling means transforming feature values into a standard range, from 0 and 1 or sometimes -1 to +1. It's okay when you have an even distribution between minimum and maximum. When you don't have a fairly uniform distribution, you can instead use Log Scaling which can compress the data range: x1 = log (x) Z-Score is similar to scaling, but uses the deviation from the mean divided by the standard deviation, which is the classic index of variability. So, it gives how many standard deviations each value is away from the mean. All these methods maintain the differences between values, but limit the range. So the computation is lighter.
Your company does not have an excellent ML experience. They want to start with a service that is as smooth, simple and managed as possible. The idea is to use BigQuery ML. Therefore, you are considering whether it can cover all the functionality you need. Which of the following features are not present in BigQuery ML natively? A. Exploratory data analysis B. Feature selection C. Model building D. Training E. Hyperparameter tuning F. Automatic deployment and serving
F. Automatic deployment and serving SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. BigQuery is perfect for Analytics. So, exploratory data analysis and feature selection are simple and very easy to perform with the power of SQL and the ability to query petabytes of data.BigQuery ML offers all other features except automatic deployment and serving. BigQuery ML can simply export a model (packaged in a container image) to Cloud Storage. For any further detail: https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-export-model https://cloud.google.com/blog/products/data-analytics/automl-tables-now-generally-available-bigquery-ml
You work for a video game company. Your management came up with the idea of creating a game in which the characteristics of the characters were taken from those of the human players. You have been asked to generate not only the avatars but also various visual expressions during the game actions. Which kind of technology/model should you advise using? A. Feedforward Neural Network B. Convolutional Neural Network C. Recurrent Neural Network D. Transformers E. Reinforcement Learning F. GAN Generative Adversarial Network
F. GAN Generative Adversarial Network SEE IMAGE IN "Images - PRACTICE EXAM 2 - Web-found ML Eng Questions" doc in the "Images for Quizlet Questions" folder. GAN is a special class of machine learning frameworks used for the automatic generation of facial images. GAN can create new characters from the provided images. It is also used with photographs and can generate new photos that look authentic. It is a kind of model highly specialized for this task. So, it is the best solution. Feedforward neural networks are the classic example of neural networks. In fact, they were the first and most elementary type of artificial neural network. Feedforward neural networks are mainly used for supervised learning when the data, mainly numerical, to be learned is neither time-series nor sequential (such as NLP). The convolutional neural network (CNN) is a type of artificial neural network extensively used for image recognition and classification. It uses the convolutional layers, that is, the reworking of sets of pixels by running filters on the input pixels. A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. A transformer is a deep learning model that can give different importance to each part of the input data.It is used for NLP - natural language processing and in computer vision. Reinforcement Learning provides a software agent that evaluates possible solutions through a progressive reward in repeated attempts. It does not need to provide labels, but it requires a lot of data and several trials, and the possibility to evaluate the validity of each attempt. Autoencoder is a neural network aimed to transform and learn with a compressed representation of raw data. For any further detail: https://en.wikipedia.org/wiki/Generative_adversarial_network https://developer.nvidia.com/blog/photo-editing-generative-adversarial-networks-2/