Databricks Certified Machine Learning Associate

Ace your homework & exams now with Quizwiz!

MACHINE LEARNING BASIC PART EXAM A data scientist has replaced missing values in their feature set with each respective feature variables median value. A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the feature set? A. Create a binary feature variable for each feature that contained missing values indicating where each row's value has been imputed. B. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them. C. Create a constant feature variable for each feature that contained missing values indicating the percentage of row from the feature that was originally missing. D. Impute the missing values using each respective feature variable's mean value instead of the median value E. Remove all feature variables that

A ?? B. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them. All the machine learning algorithms don't support missing values but some ML algorithms are robust to missing values in the dataset. The k-NN algorithm can ignore a column from a distance measure when a value is missing. Naive Bayes can also support missing values when making a prediction. These algorithms can be used when the dataset contains null or missing values.

SPARK ML PART EXAM A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame thay contains only the price from spark_df where the value in column price is greater than 0. Which of the following code blocks will accomplish this task? A. spark_df.filter(col ("price") > 0) B. SELECT * FROM spark of WHERE price > 0 C. spark df.loc[spark_df["price"] > 0, :] D. spark_df[spark_df ["price"] > 0] E. spark_df.loc[:,spark_df["price"] > 0]

A ??? D. spark_df[spark_df["price"] > 0] This option will accomplish the task of creating a new Spark DataFrame that contains only the rows from the original spark_df where the value in the "price" column is greater than 0. This code uses the indexing operator [ ] to select rows from spark_df where the "price" column is greater than 0. This returns a new DataFrame with only the rows that meet this condition.

MACHINE LEARNING BASIC PART EXAM Which of the following evaluation metrics is not suitable to evaluate run in AutoML experiment to regression problem? A. F1 B. R-Squared C. RMSE D. MAE E. MSE

A. F1 F1 is not a suitable evaluation metric for a regression problem. F1 is a metric that is used to evaluate the performance of a classifier, and it is not appropriate to use in the context of a regression problem. R-squared, RMSE, MAE, and MSE are all suitable evaluation metrics for a regression problem. R-squared is a measure of the goodness of fit of a regression model, RMSE is a measure of the average error of the model, MAE is a measure of the average absolute error of the model, and MSE is a measure of the average squared error of the model.

ML WORKFLOW PART EXAM A data scientist provides a machine learning engineering team with three notebooks for a machine learning pipeline: Notebook A, Notebook B and Notebook C, Notebook A and Notebook B perform feature engineering. Notebook C, which require Notebook A and Notebook B successfully finish running before it can begin, trains a series of number. Notebook A and B is not affect each in any way. Which of the following approaches can the machine learning engineering team take to orchestrate the pipeline to run at quickly and reliably as possible using Databricks? A. They can set up three-task job where task runs a notebook the fist two task run in parallel, and the final task depend in the previous to tasks completing. B. They can set up single-task job where an orchestration notebook runs each three notebooks successfully. C. They can set up a three-task job where each task runs a notebook and each task dep

A. They can set up three-task job where task runs a notebook the fist two task run in parallel, and the final task depend in the previous to tasks completing. This is correct. Because Notebook A and B is not relate each other, so they can run parallel, then they completed, Notebook C will run.

SPARK ML PART EXAM A machine learning engineer would like to develop a linear regression model with sparkML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model . The Spark DataFrame train_df has the following schema: hotel_room_id STRING, price DOUBLE, features UDT The machine learning engineer shares the following code block: lr = LinearRegression(featuresCol="features", labelCol = "price") lr_model = lr.fit(train_df) Which of the following changes does the machine learning engineer need to take the complete the task: A. They need to convert the features column to be a vector. B. They need to call the transform method on train_df . C. They do not need to make any changes. D. They need to split the features column out into one column for each feature. E. They need to utilize a Pipeline to fit the model.

A. They need to convert the features column to be a vector.

SPARK ML PART EXAM A machine learning engineer has created a Feature Table new_table using Feature Store Client. When creating the table, they specified a metadata description with key information about the feature Table. They now want to retrieve metadata programmatically. Which of the following lines of code will return the metadata description? A. fs.get_table("new_table").description B. fs.get_table("new_table").load_df() C. fs.create_training_set("new_table") D. fs.get_table("new_table") E. There is no way return the metadata description programmatically

A. fs.get_table("new_table").description This would be the correct line of code to retrieve the metadata description of the new_table feature table. The fs object is likely an instance of a Feature Store Client, and the get_table() method is used to retrieve a specific table from the feature store.

ML WORKFLOW PART EXAM A data scientist is utilizing MLflow Autologging to automatically tracks their machine learning experiments. After completing a series of runs for the experiment experiment_id the data scientist want to identity the run_id of the with root-mean-square error (RMSE) Which of the following lines of code can be used to identify the run_id of the the run with best RMSE in experiment_id? A. mlflow.search_runs( experiment_id, order_by = ["metrics.rmse DESC"] )["run_id"][0] B. mlflow.best_run( experiment id, order_by = ["metrics.rmse"] ) C. mlflow.best_run( experiment id, order_by = ["metrics.rmse DESC"] ) D. There is no way to programmatically identify the best run from an ow from an MLflow Experiment. E. mlflow.search_runs( experiment_id, order_by = ["metrics.rmse"] )["run_id"][0]

A. mlflow.search_runs( experiment_id, order_by = ["metrics.rmse DESC"] )["run_id"][0] This is correct. Because mlflow have search_runs operation and order_by argument with ["metrics.rmse DESC"] help sort the metric values in descending order, finally get best run_id by ["run_id"][0]

SPARK ML PART EXAM A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single node model. @pandas_udf("double") def predict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.Series]: model_path= f"runs:/(run.info.run_id/model" mflow.sklearn.load_model(model_path) for features in iterator: pdf = pd.concat(features, axis=1) yield pd.Series(model.predict(pdf)) They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df prediction_df= spark df.withColumn( "prediction", __________ ) Which of the following lines of code can be used to complete the code block to successfully complete the task? A.predict (*spark_df.columns) B. mapInPandas (predict(spark_df.colums)) C. predict(Iterator(spark_df)) D. predict (spark_df.columns) E.mapInPandas(predict)

A.predict (*spark_df.columns) This is correct. C. predict(Iterator(spark_df)) This is not correct. `predict (Iterator(spark_df))`, is incorrect because the `predict` function expects an iterator of Pandas DataFrames as input, not an iterator of Spark DataFrames.

ML WORKFLOW PART EXAM A data scientist is developing a machine learning pipeline use AutoML on Databricks machine learning. Which of the following steps will the data scientist need to perform outside of the AutoML experiment? A. Model turning B. Model deployment C. Model evaluation D. Exploratory data analyst E. Model training

B. Model deployment Model deployment refers to the process of taking a trained machine learning model and making it available for use in a production environment, such as a website or application. This typically involves creating a server or service that can load the model and use it to make predictions, as well as handling any necessary scaling, monitoring, or maintenance tasks. Model deployment is typically performed outside of the AutoML experiment, as it involves setting up and managing the infrastructure and systems needed to serve the model in production.

ML WORKFLOW PART EXAM In PySpark, ____ library is provided, which makes integrating Python with Apache Spark easy. A. Py5j B. Py4j C. Py3j D. Py2j

B. Py4j This is correct answer. In PySpark, Py4j library is provided, which makes integrating Python with Apache Spark easy.

SCALING ML MODELS EXAM PART A data scientist has written a data cleaning notebook that utilizes the pandas library but that colleague has suggest that the they refactor their notebook to scale with big data Which of the following approaches can the data scientist take to scale with big data? C. They can refactor their notebook to process the data in parallel D. They can refactor their notebook to use the Scala Dataset API B. They can refactor their notebook to utilize the pandas API on Spark A. They can refactor their notebook to use Spark SQL E. They can refactor their notebook to use the PySpark DataFrame

B. They can refactor their notebook to utilize the pandas API on Spark: The pandas API is a powerful tool for working with data in Python. Pandas provides functions for reading, manipulating, and summarizing data. Spark's pandas API allows you to use these functions on data stored in a Spark DataFrame. This option would be a good choice if the notebook is currently using pandas to process data and the data can fit in memory on a single machine.

ML WORKFLOW PART EXAM Which of the following approaches can be used to determine the exact date & time that an MLflow run was executed. A. There is no way to determine the exact date and time that an MLflow run was executed. B. Viewing the "Start Time" value in the Mlflow experiment page. C. Viewing the "Date" value in the MLflow run page. D. Viewing the "Duration value in the MLflow run page. E. Viewing the "Duration value in the MLfow experiment page.

B. Viewing the "Start Time" value in the Mflow experiment page. This approach can be used to determine the exact date and time that an MLflow run was executed because the "Start Time" value on the experiment page will show the date and time when the run was started. The other options do not provide this information.

MACHINE LEARNING BASIC PART EXAM A data scientist wants to efficiently tune the hyperparameters of skcit-learn model. They efect to use the Hyperopt library fmin operation to facilitate this process. Unfortunately, the final model is not very accuracy. The data scientist suspects that there is an issue with the objective_function being passed as an argument to fmin. They use the following code block to create the objective_function: Which of the following changes does the data scientist need to make to their objective_function is order to produce a more accurate model? A. Add test set validation process B. Remove the mean operation that is wrapping the cross_val_score operation C. Replace the r2 return value with -r2 D. Replace the fmin operation with the fmax operation E. Add a random_state argument to the RandomForestRegressor operation

C ?? A. Add test set validation process The data scientist should add test set validation to their objective function because cross-validation does not provide a reliable estimate of model performance on unseen data. Cross-validation simply splits the training data into a certain number of folds and uses each fold as a test set in turn, so it does not provide a true estimate of model generalization. By adding test set validation, the data scientist can evaluate the model's performance on a separate, unseen dataset and get a more accurate estimate of how well the model will perform on new data. This will help the data scientist tune the hyperparameters more effectively and produce a more accurate model.

MACHINE LEARNING BASIC PART EXAM A data scientist has created two linear regression models. The first model uses prices as a label variable and the second model uses log (price) as a label variable. When evaluating the RMSE of each model by comparing the label prediction to the actual price values, the data scientist notices that the RMSE for the second model is much large than RMSE of the first model. Which of the following possible explanations for this differences is invalid? A. The second model is much more accurate than the first model. B. The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE C. The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE D. The RMSE is an invalid evaluation metric for regression problems E. The first model is much more accurate than the second mode.

C ou E.??? The first model is much more accurate than the second model. This is correct. Transform label does not effect to evaluate model. If the RMSE for the second model is much larger than the RMSE for the first model, it could mean that the first model is more accurate than the second model

ML WORKFLOW PART EXAM Which of the following statements describes & Spark ML estimator? A. An estimator is an evaluation tool to assess to the quality of a model B. An estimator is a trained ML model which turns a DataFrame with features into DataFrame with prediction C An estimator is an algorithm which can be fit on a Dataframe and produce Transformer D. An estimator is a hyperparameters grid that can be used to train a model. E. An estimator chains multiple algorithms together to specify to specify an ML workflow.

C. An estimator is an algorithm which can be fit on a Dataframe and produce Transformer This is correct. An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

MACHINE LEARNING BASIC PART EXAM A data scientist is attempting to tune a logistic regression model logistic use skit-learn. They want to specify a search space for two hyperparammeters and let the tuning process randomly select values for each evaluation. They attempt to run the following code block, but it does not accomplish the desired task: distributions = dict(C = uniform (loc = 0, scale=4, penalty=['12', '11']) search_clf = GridsearchCV (logistic, distributions, random_stage =0) search = search_clf.fit(feature_data, target data) Which of the following changes can the data scientist make to accomplish the task? A. Replace the GridSearchCV operation with ParameterGrid B. Replace the GridSearchCV operation with cross_validate C. Replace the GridSearchCV operation with RandomizeSearchCV D. Replace the penalty=['12' 11'] argument with penalty=uniform('12', '11') E. Replace the random_state=0 argument with random

C. Replace the GridSearchCV operation with RandomizedSearchCV. The GridSearchCV operation performs an exhaustive search over a specified parameter grid, trying every combination of the hyperparameters in the grid. On the other hand, RandomizedSearchCV performs a randomized search over the hyperparameters, sampling from a specified distribution for each hyperparameter. Therefore, to accomplish the task of randomly sampling values for the hyperparameters, the data scientist should use the RandomizedSearchCV operation instead of GridSearchCV.

SPARK ML PART EXAM A machine learning engineer is converting a decision tree from skcit-learn to Spark ML. They noice that they are receiving different results despite all of their data and manually specified hyperparameter being identical. Which of the following describes a reason that the single-node sklearn decision tree and spark ML decision Tree differ? A. Spark ML decision trees test binned features values as representative split candidates. B. Spark ML decision trees test a random sample of feature variable in the splitting algorithm. C. Spark ML decision trees test more split candidates in the splitting algorithm. D. Spark ML decision trees test every feature variable in the splitting algorithm. E. Spark ML decision trees automatically prune overfit tree.

C. Spark ML decision trees test more split candidates in the splitting algorithm. One reason that the decision trees produced by sklearn and Spark ML may differ is that Spark ML decision trees test more split candidates in the splitting algorithm. This means that Spark ML decision trees consider more potential splits for each node in the tree, which can lead to a different tree structure and different predictions. It is possible that this difference in the number of split candidates tested could be causing the different results that the machine learning engineer is observing.

ML WORKFLOW PART EXAM Machine learning engineer has been notified that a new Staging version of a model registered to the Mlflow Model Registry has passed all tests. As a result, the machine learning engineer wants to this model into production by transition it from the Production stage in the Model Registry. From which of the following pages in Databricks Machine Learning can the machine learning engineer accomplish this task? A. The home page of the MLflow Model Registry B. The run page in the Experiments observatory C. The model version page in the MLflow Model Registry D. The model page in the MLflow Model Registry E. The experiment page in the Experiments observatory

C. The model version page in the MLflow Model Registry The machine learning engineer can accomplish this task from the model version page in the MLflow Model Registry. This is the page that displays information about a specific version of a model that is registered in the Model Registry. From this page, the machine learning engineer can view the details of the model version and perform actions such as transitioning the model from the Staging stage to the Production stage. The home page of the MLflow Model Registry is the main page for the Model Registry and does not provide specific information about individual model versions. The run page in the Experiments observatory displays information about a specific run of an experiment, but does not provide information about model versions. The model page in the MLflow Model Registry displays information about a specific model in the Model Registry, but does not allow the ma

ML WORKFLOW PART EXAM Which of the following is a benefit of using vectorized pandas UDFs instead of the standard Pyspark UDFs? A. The vectorized pandas UDFs process data in memory rather than splitting to task. B. The vectorized pandas UDFS allow for the use of type hints. C. The vectorized pandas UDFS allow for pandas API use inside of the function. D. The vectorized pandas UDFS work on distributed Dataframe E. The vectorized pandas UDFS process data in batches rather than one row at the time.

C. The vectorized pandas UDFS allow for pandas API use inside of the function. This is the correct answer because vectorized pandas UDFs are designed to allow users to use the pandas API inside of their functions. This means that users can use pandas functions like `apply`, `groupby`, and `rolling` inside of their UDFs, which can be very useful for certain types of data processing tasks.

ML WORKFLOW PART EXAM A machine learning engineering has grown tired of needing to install the MLflow python library on each of their cluster. They at a senior machine learning engineer how their notebooks can load the MLflow library without installing each time. The machine learning engineer suggest that they use Databricks Runtime for machine learning. Which of following approach describes how the machine learning engineer can use Databricks Runtime for machine learning? A. They can set the runtime-version variable in their Spark session to "ml" B. They can add a line enabling Databricks runtime ML in their init script when creating their clusters. C. They can select a Databricks Runtime ML version from the Databricks Runtime version drop-down then creating their clusters. D. They can set the runtime-version key in their Databricks widgets to "ml" E. They can check the Databricks Runtime ML box when creating their

C. They can select a Databricks Runtimes ML version from the Databricks Runtime version drop-down then creating their clusters. This is correct. Because in Databricks Runtimes ML has been mlflow library and it exited on clusters. They can use it by import mlflow

ML WORKFLOW PART EXAM A data scientist has developed a linear regression model using SparkML and computed the prediction in a Spark DataFrame preds_df with the following schema. prediction DOUBLE actual DOUBLE Which of the following code blocks can be used to compute the root mean-squared error of the model according to the data in preds_df and assign it to the rmse variable? A. rmse = Summarizer(predictionCol="prediction", labelCol="actual", metricName="rmse" ) B. rmse = BinaryClassificationEvaluator(predictionCol="prediction",labelCol="actual", metricName="mse" ) C. regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="actual", metricName="rmse") rmse= regression_evaluator.evaluate(pred_df) D. classification_evaluator = BinaryClassition( predictionCol="prediction", labelCol="actual", metricName="rmse") classification_evaluator.evaluate(pred_df) E. rmse = RegressionEvaluator( prediction

C. regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="actual", metricName="rmse") rmse= regression_evaluator.evaluate(pred_df) This is correct. This code block is correct because it uses the RegressionEvaluator class, which is a Spark MLlib utility for evaluating regression models, to compute the root mean-squared error (RMSE) of the model on the data in pred_df. The RMSE is a common evaluation metric for regression models and is calculated as the square root of the average of the squared differences between the predicted and actual values. The predictionCol and labelCol arguments specify the names of the prediction and actual value columns in the DataFrame, and the metricName argument specifies that the RMSE should be computed.

SPARK ML PART EXAM A data scientist wants to tune a set of hyperparameters for a machine learning model. They help wrapped a SparkML model in in the objective function and they have defined the search spark space search_space. As a result, they have the following code block: num_evals = 100 trials = SparkTrials() best_hyperparam = fmin( fn=objective_function, space=search_space, algo=tpe.suggest, max_evals=num_evals, trials=trials ) Which of the following changes do they need to make to the above code block order. A. Change fmin() to fmax() B. Reduce num_evals to be less than 10 C. Remove the algo=tpe.suggest argument D Change SparkTrials() to Trials () E. Remove the trials=trials argument

D. Change SparkTrials () to Trials () This change is correct.

MACHINE LEARNING BASIC PART EXAM A data scientist has produced two models for a single machine learning problem. One of models perform well when it of the features has a value of less than 5, and the other model perform well when the value of that feature greater and to equal to 5. The data scientist decides to combine the two modes. Which of the following terms is used to describe this combination of models: A. Bootstrap aggregation B. Stacking C. Support vector machines D. Ensemble learning E. Bucketing

D. Ensemble learning This is correct. Ensemble learning is a method of combining the predictions of multiple models to produce a more accurate final prediction. In the case described, the data scientist has created two models that each perform well under different circumstances (when the value of a particular feature is less than or greater than or equal to 5, respectively). By combining these two models into a single ensemble, the data scientist can create a more robust model that is able to make accurate predictions under a wider range of circumstances. Ensemble learning is a widely used approach in machine learning and can improve the performance of a model by reducing overfitting and improving the generalization of the model to new data.

MACHINE LEARNING BASIC PART EXAM What is the name of the method that transforms categorical feature into a series of binary indicator feature variables? A. Leave-one-out encoding B. String indexing C. Target encoding D. One-hot encoding E. Categorical embeddings

D. One-hot encoding The method that transforms categorical feature into a series of binary indicator feature variables is called one-hot encoding. One-hot encoding is a way to represent categorical variables as numerical data so that it can be used in machine learning algorithms. It involves creating a new binary column for each unique category in the categorical feature. For example, if a categorical feature has three categories, A, B, and C, then three new columns, one for each category, would be created. If a given data point belongs to category A, then the value in the A column would be 1, and the values in the B and C columns would be 0. One-hot encoding allows the model to treat each category as a separate entity, rather than as a numeric value that can be compared or ordered.

MACHINE LEARNING BASIC PART EXAM Which of the following machine learning algorithms typically use bagging: A. Decision tree B. K-means C. Linear regression D. Random forest E. Gradient boosted trees

D. Random forest This is correct. Random forest typically uses bagging. Bagging is a type of ensemble method in which multiple models are trained on different subsets of the training data and the predictions from all the models are combined to make the final prediction. This can help reduce overfitting and improve the generalizability of the model. Random forests are a type of ensemble model that is composed of multiple decision trees. In a random forest model, bagging is used to train each individual decision tree. Each tree is trained on a different random subset of the training data, and the final prediction is made by aggregating the predictions from all the trees. This can help improve the performance of the model by reducing overfitting and increasing the diversity of the trees.

SPARK ML PART EXAM A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once data is imported the data scientist performs machine learning tasks using Spark ML. Which of the following compute tools is best suited for this use case? A. None of these compute tools support this task B. Serverless SQL C. SQL Endpoint D. Standard cluster E. Single Node cluster.

D. Standard cluster is the best suited compute tool for this use case. Spark SQL is used to import data into a machine learning pipeline and perform machine learning tasks using Spark ML. A standard cluster is a compute tool that is designed to handle such tasks that involve large amounts of data and computation. It consists of a set of compute nodes that are connected together and can be used to process and analyze large datasets in parallel. Standard clusters are generally more powerful and efficient than other compute tools, such as single node clusters, which may not have sufficient resources to handle large amounts of data and computation. Therefore, a standard cluster is the best choice for this use case.

SCALING ML MODELS EXAM PART A machine learning engineer is trying to scale a machine learning pipeline by distributing is feature engineering process. Which of the following feature engineering tasks will be the least efficient to distribution? D. Target encoding categorical features E. Imputing missing feature values with the true median B. Imputing missing feature values with the mean C. Creating binary indicator features for missing A. One-hot encoding categorical features

D. Target encoding categorical features It is generally most efficient to distribute feature engineering tasks that can be easily parallelized. Of the options given, one-hot encoding categorical features and imputing missing feature values with the mean or median can all be easily parallelized, as each feature can be processed independently of the others. Creating binary indicator features for missing values can also be easily parallelized. Target encoding categorical features may be the least efficient to distribute, as it involves aggregating data across all the observations in the dataset for a given categorical feature in order to compute the target-mean encoding. This aggregation step may require communication between the worker processes, which can slow down the overall feature engineering process.

ML WORKFLOW PART EXAM A machine learning engineering teams has started using Repos in Databricks for its projects. One of the machine learning engineers want to save the notebook they have developed in Databricks to a remote Git provider. Which of the following approaches can machine learning engineer use to complete this task? A. They can commit and push the changes from the MLflow experiment page. B. They can commit the changes from the Repos dialog box and push the changes from the notebook revision history. C.They can commit the changes from notebooks's revision history and push the changes from the Repos dialog box. D. They can commit and push the change from the Repos dialog box. E. They can commit and push the changes from the notebook's Revision History.

D. They can commit and push the change from the Repos dialog box. This is correct. because, this is the most straightforward and efficient approach for the machine learning engineer to save their notebook to a remote Git provider. By using the Repos dialog box, the engineer can commit their changes and push them directly to the remote repository in a single step. This avoids the need to switch between different pages or interfaces, and ensures that the changes are properly recorded and synced with the remote repository.

SPARK ML PART EXAM A data scientist is using SparkML to engineer features for an exploratory mmachine learning project. They decide they want to standardize their features using the following code block: scaler = StandardScaler(withMean=True, inputCol="input_features", outputCol="output_features") scaler_model = scaler.fit(features_df) scaled_df=scaler_model.transform(features) train_df, test_df = scaled_df.randomsplit([.8, .2], seed = 42), Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into training set and a test set. Which of the following changes can the data scientist make to address the concern? A. Utilize the Pipeline API to standardize the training data according to the test data's summary statistic. B. Utilize the MinMaxScaler object to standardize the training data according to the global minimum and maximum values. C. Utilize a cross-valida

D. Utilize the Pipeline API to standardize the test data according to the train data's summary statistic. This is correct. Standardizing the data means to center the data around the mean and scale it to have unit variance. Standardizing the data can help the machine learning model converge faster and perform better, but it is important that the same transformation is applied to the training and test sets. If the training and test sets are standardized separately, then the resulting transformed datasets will have different distributions, and the model's performance on the test set will not be an accurate reflection of its performance on unseen data. The Pipeline API allows the data scientist to apply the same transformation to both the training and test sets. By standardizing the test data using the summary statistics (mean and standard deviation) of the training data, the data scientist can ensure that the training

MACHINE LEARNING BASIC PART EXAM In which of the following situations is it preferable to impute missing feature values with their median value over the mean value? A. When the features are of the boolean type B. When the features contain no outliers C. When the features are of the categorical type D. When the features contain a lot of extreme outliers E. When the features contain no missing values

D. When the features contain a lot of extreme outliers It is generally preferable to impute missing feature values with the median value rather than the mean value in situations where the feature contains a lot of extreme outliers. This is because the median is not as sensitive to outliers as the mean, and so it can give a more accurate representation of the central tendency of the data. In other situations, such as when the features are of the boolean type or categorical type, it is not necessary to impute missing values at all, as these types of features do not have a numerical representation and do not lend themselves to statistical calculations like mean and median. If the features contain no outliers and no missing values, then it would not be necessary to impute any values at all.

SPARK ML PART EXAM A data scientist has been given an incomplete notebook from the data engineering teams. The notebook use a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the pyspark DataFrame API. Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark: A. import pyspark.pandas as psdf = ps.to_pandas(spark_df) B. spark df.to_sql() C. import pandas as pddf=pd.DataFrame (spark_df) D. import pyspark.pandas as psdf = ps.DataFrame(spark_df) E. spark_df.to_pandas ()

D. import pyspark.pandas as psdf = ps.DataFrame(spark_df) This is correct. ps have DataFrame() operator with data argument is required. Its data type can be one of the following: numpy ndarray (structured or homogeneous), dict, pandas DataFrame, Spark DataFrame or pandas-on-Spark Series. A note that holds Spark DataFrame internally.

ML WORKFLOW PART EXAM A data scientist is calculating the importance of features as an MLflow run. The feature importance value are being stored in the pandas Dataframe importance_df and being written as a CSV to DBFS location importance_path. They would like to log these values with their active MLflow run. Which of the following lines of code can the data scientist use to log the feature importance values with their MLflow run? A. mlflow.log_artifact(importance_df) B. mlflow.log_artifact(importance_df, "importance_df") C. mlflow.log_metric(importance_path, "importance.csv") D. mlflow.log_artifact(importance_path, "importance.csv") E. mlflow.log_metric(importance_df, "importance_df")

D. mlflow.log_artifact(importance_path, "importance.csv") This is correct. The mlflow.log_artifact function is used to log local files or directories as artifacts of the run. In this case, the data scientist wants to log the feature importance values which are stored in the CSV file located at importance_path. By passing importance_path as the first argument and "importance.csv" as the second argument, the data scientist is specifying the location of the file and the desired name of the artifact in the run.

ML WORKFLOW PART EXAM A machine learning engineer has identified the best run from an MLflow experiment. They have stored the run ID is the run_id variable and identified the logged model name as "model". They now to register that model is the MlFlow Model registry with name "best model". Which of the following lines of code can they use to register the model associated with run_id to the MlFlow Model Registry? A. mlflow.register_model(f"run:/{run_id}/model", "model") B. mlflow.register_model(run_id, "best_model") C. mlflow.register_model(f"run:/{run_id}/model") D. mlflow.register_model(f"run:/{run_id}/model", "best_model") E. mlflow.register_model(run_id, "model")

D. mlflow.register_model(f"run:/{run_id}/model", "best_model") This is correct. with mlflow.register_model() have 2 arguments are model_uri and name. For this method, you will need the run_id as part of the runs:URI argument, and name argument is the name was set in MlFlow Model registry

SCALING ML MODELS EXAM PART The implementation of linear regression in Spark ML first attempts as to solve the linear regression problem using refer decomposition, but this method does not scale well to large numbers of variables. Which of the following approaches does Spark ML use to distribute the training of a linear regression model large data: A. Singular value decomposition C. Spark ML cannot distribute linear regression training E. Iterative optimization B. Logistic regression D. Least-squares method

E ?? A or C ???

MACHINE LEARNING BASIC PART EXAM An organization is developing a feature repository and is electing to one-hot encoding. A data scientist suggests that the categorical feature variables should not be one-hot encoded with the feature repository. Which of the following explanations justifies this suggestion? A. One-hot encoding is not supported by most machine learning library. B. One-hot encoding is dependent on the target variable's values which differ for each application. C. One-hot encoding is computationally intensive and should only learning performed on small samples of training set for individual machine learning problems. D. One-hot encoding is not a common strategy for representing categorical feature variables numerical. E. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning application.

E ?? C. One-hot encoding is computationally intensive and should only learning performed on small samples of training set for individual machine learning problems. One-hot encoding is a common technique for representing categorical variables numerically in machine learning. However, it can be problematic in some cases because it expands the dimensionality of the data, which can increase the complexity of the model and decrease the interpretability of the results. In addition, one-hot encoded variables can suffer from the curse of dimensionality, which can make it more difficult for the model to learn patterns in the data. Therefore, it is important to consider whether one-hot encoding is appropriate for a given machine learning problem.

SPARK ML PART EXAM A data scientist is trying to train a model with their teams existing Feature Store table features. They want to look a row from features matching the customer_id column in training_df. They use the following block of code to create the training set using Feature Client: lookups = [FeatureLookup("features" "customer_df")] training_set = fs.create_trainning_set( training_df, lookups, exclude_columns = ["customer_id"] ) Which of the following represents the features that will be pulled from Feature Store table feet on the being used? A. No features will be pulled from features. B. All features will be pulled from features. C. Only customer_id will be pulled from features. D. More Information is needed to determine which features will be pulled from features E. All features except customer_id will be pulled from features

E ?? D. More Information is needed to determine which features will be pulled from features There is not enough information provided to determine which features will be pulled from the "features" table in the Feature Store. The `FeatureLookup` function only specifies that a lookup should be performed on the "features" table, but it does not specify which columns should be included in the lookup. Additionally, the `exclude_columns` parameter in the `create_training_set` function specifies that the "customer_id" column should be excluded from the training set, but it does not specify which other columns should be included. To determine which features will be included in the training set, we would need more information about the structure and contents of the "features" table and the training_df DataFrame.

MACHINE LEARNING BASIC PART EXAM A data scientist uses 3-fold cross-validation when optimizing model hyper parameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation fold: 10.0 12.0 17.0 Which of the following values represents the overall cross-validation root-mean-squared error? A. 17.0 B. 12.0 C. 39.0 D. 10.0 E. 13.0

E. 13.0 The overall cross-validation root-mean-squared error is the average of the root-mean-squared errors calculated on each of the validation folds. Therefore, the correct answer is E. 13.0. To calculate this value, we first add the root-mean-squared errors from all three folds: 10.0 + 12.0 + 17.0 = 39.0. Then, we divide this sum by the number of folds to get the average root-mean-squared error: 39.0 / 3 = 13.0. This is the correct answer, so E. 13.0 is the correct choice.

ML WORKFLOW PART EXAM Which of the Spark operations can be used to randomly split a Spark DataFrame into training DataFrame and test DataFrame for downstream use? A TrainValidationSplit B. TrainValidationSplitModel C. CrossValidator D. DataFrame.where E. DataFrame.randomSplit

E. DataFrame.randomSplit This is correct. This method can split data to train, test DataFrame. We can use train set for training & validating model (can use CrossValidator, TrainValidationSplit) , after use test set for downstream.

MACHINE LEARNING BASIC PART EXAM A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult. Which of the following describes why? A. Gradient boosting is not a linear algebra-based algorithm which is required for parallelizations. B. Gradient boosting uses decision trees in each iteration which cannot be parallelized. C. Gradient boosting requires access to all data at once which cannot happen during parallelizations D. Gradient boosting calculates gradients in evaluation metrics use all cores which prevents parallelizations. E. Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

E. Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step. The main idea behind Gradient boosting algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model.

ML WORKFLOW PART EXAM Which of the following tools can be used to parallelize the hyperparameters turning process for single-node machine learning models using a Spark cluster? A. MlFlow Experiment Tracking B. Delta Lake C. Autoscaling clusters D. Spark ML E. Hyperopt

E. Hyperopt This is correct. Hyperopt works with both distributed ML algorithms such as Apache Spark MLlib and Horovod, as well as with single-machine ML models such as scikit-learn and TensorFlow.

MACHINE LEARNING BASIC PART EXAM A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of cases identified by the model. Which of the following classification metrics should be used to evaluate the model? A. Accuracy B. RMSE C. Area under the residual operating curve D. Precision E. Recall

E. Recall It is important to consider multiple evaluation metrics when evaluating a classification model, rather than just focusing on one metric. However, if the organization's leaders want to maximize the number of cases identified by the model, they might be particularly interested in the recall of the model. Recall measures the proportion of actual positive cases that were correctly identified by the model. It is calculated as the number of true positives divided by the sum of the true positives and false negatives. A model with high recall is able to identify a large proportion of positive cases, even if it also has a large number of false positives. Accuracy, RMSE, and the area under the residual operating curve are not appropriate metrics for evaluating a classification model.

SCALING ML MODELS EXAM PART A machine learning engineer is trying to scale a machine learning pipelines that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process they are using the following code block. cv = CrossValidator( estimator = pipeline, evaluator=evaluator, parallelism = 2, estimatorParamMaps = param_grid, numFolds = 3, seed=42 ) A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model. Which of the following is a negative consequence of the approach suggested by the approach suggest by colleague? E. The model will be refit one more per cross-validation fold B. The cross-validation process will no longer be reproducible A. The cross-validation process will no lon

E. The model will be refit one more per cross-validation fold The original code block appears to be using the CrossValidator class from the pyspark.ml.tuning module. This class performs cross-validation by fitting the estimator on a subset of the data, called the "training set", and evaluating it on a different subset called the "validation set". The process is repeated a specified number of times (controlled by the numFolds parameter) with different subsets of the data being used as the training and validation sets each time. If the model object is passed as the estimator parameter and the updated cv object is placed as the final stage of the pipeline, then the model would be refit one more time per cross-validation fold. This is because the CrossValidator would first fit the model on the training set, then the pipeline would be run with the model as the final stage, which would refit the model on the validation set

SPARK ML PART EXAM Which of the following describes the relationship between the native Spark Dataframe and pandas API on spark Dataframe? A. pandas API on Spark DataFrames are single-node versions of Spark DaFrames with additional metadata. B. pandas API on Spark DataFrames are unrelated to Spark DaFrames. C. pandas API on Spark DataFrames are less mutable versions of Spark Dataframes. D. pandas API on Spark DataFrames are more performant than Spark Dataframes. E. pandas API on Spark DataFrames are made up of Spark Dataframes and additional metadata.

E. pandas API on Spark DataFrames are made up of Spark dataframes and additional metadata. The pandas API on Spark DataFrames allows users to utilize the familiar pandas API to perform operations on data stored in Spark DataFrames. The pandas API on Spark DataFrames is implemented as a thin wrapper around the Spark DataFrames API, which means that it is built on top of the Spark DataFrames API and shares the same underlying data and metadata. As a result, the pandas API on Spark DataFrames can be thought of as a combination of a Spark DataFrame and additional metadata that allows the API to behave in a way that is familiar to users of pandas. This can make it easier for users who are familiar with the pandas API to work with data stored in Spark DataFrames, and can also make it easier to integrate pandas-based code with Spark-based code.

SPARK ML PART EXAM A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist want to see the count, mean, standard deviation, minimum, maximum, and interquartile range (QR) for each numeric feature Which of the following lines of code can the data scientist run to accomplish task? A. spark_df.stats( ) B. spark_df.describe().head() C. spark_df.printSchema() D. spark_df.toPandas() E. spark_df.summary()

E. spark_df.summary() This is correct. spark_df.summary() computes specified statistics for numeric and string columns. Available statistics are: - count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (e.g., 75%)


Related study sets

ACCT 3306: Chapter 5: Cost-Volume-Profit Relationships

View Set

Nutrition in Health: Chapter 11: Water and Major Minerals.

View Set

predicting weather and climate change

View Set

Health Promotions and Psychosocial

View Set

NCLEX Book questions Lung Cancer

View Set

Financial Markets and Institutions Ch. 2 HW

View Set