Certified Machine Learning Associate Practice Exam

Join Us Among the Stars

Sign Up & unlock 100% of Exam Questions

Log in / Sign up

No Strings Attached!

133Practice Questions

3Study Modes

Free

TopicSort

Question 1

Data Processing

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?

A spark_df[spark_df["price"] > 0]
B spark_df.filter(col("price") > 0)
C SELECT * FROM spark_df WHERE price > 0
D spark_df.loc[spark_df["price"] > 0,:]
E spark_df.loc[:,spark_df["price"] > 0]

Question 2

Model Development

Question 3

Model Development

Question 4

Data Processing

Question 5

Model Development

Question 6

Model Deployment

Question 7

Databricks Machine Learning

Question 8

Model Development

Question 9

Data Processing

Question 10

Data Processing

Question 11

Data Processing

Question 12

Databricks Machine Learning

Question 13

Data Processing

Question 14

Databricks Machine Learning

Question 15

Model Development

Question 16

Data Processing

Question 17

Data Processing

Question 18

Model Development

Question 19

Data Processing

Question 20

Model Development

Question 21

Model Development

Question 22

Data Processing

Question 23

Data Processing

Question 24

Model Deployment

Question 25

Model Development

Page 1 of 6 • Questions 1-25 of 133

1 2 3 4 5

→

Know a question that should be here? Contribute to this exam

Which of the following statements describes a Spark ML estimator?

A An estimator is a hyperparameter grid that can be used to train a model
B An estimator chains multiple algorithms together to specify an ML workflow
C An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions
D An estimator is an algorithm which can be fit on a DataFrame to produce a Transformer
E An estimator is an evaluation tool to assess to the quality of a model

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space.
As a result, they have the following code block:

Question Image

Which of the following changes do they need to make to the above code block in order to accomplish the task?

A Change SparkTrials() to Trials()
B Reduce num_evals to be less than 10
C Change fmin() to fmax()
D Remove the trials=trials argument
E Remove the algo=tpe.suggest argument

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.
Which of the following code blocks will accomplish this task?

A spark_df.loc[:,spark_df["discount"] <= 0]
B spark_df[spark_df["discount"] <= 0]
C spark_df.filter (col("discount") <= 0)
D spark_df.loc(spark_df["discount"] <= 0, :]

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).
Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A
B
C
D

Question 6

Model Deployment

Question 7

Databricks Machine Learning

Question 8

Model Development

Question 9

Data Processing

Question 10

Data Processing

Question 11

Data Processing

Question 12

Databricks Machine Learning

Question 13

Data Processing

Question 14

Databricks Machine Learning

Question 15

Model Development

Question 16

Data Processing

Question 17

Data Processing

Question 18

Model Development

Question 19

Data Processing

Question 20

Model Development

Question 21

Model Development

Question 22

Data Processing

Question 23

Data Processing

Question 24

Model Deployment

Question 25

Model Development

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

Question Image

They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df:

Question Image

Which of the following lines of code can be used to complete the code block to successfully complete the task?

A predict(*spark_df.columns)
B mapInPandas(predict)
C predict(Iterator(spark_df))
D mapInPandas(predict(spark_df.columns))
E predict(spark_df.columns)

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.
Which of the following lines of code will return the metadata description?

A There is no way to return the metadata description programmatically.
B fs.create_training_set("new_table")
C fs.get_table("new_table").description
D fs.get_table("new_table").load_df()
E fs.get_table("new_table")

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?

A RMSE
B Precision
C Area under the residual operating curve
D Accuracy
E Recall

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

A When the features are of the categorical type
B When the features are of the boolean type
C When the features contain a lot of extreme outliers
D When the features contain no outliers
E When the features contain no missing values

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?

A One-hot encoding is not supported by most machine learning libraries.
B One-hot encoding is dependent on the target variable’s values which differ for each application.
C One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D One-hot encoding is not a common strategy for representing categorical feature variables numerically.
E One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.
Which of the following feature engineering tasks will be the least efficient to distribute?

A One-hot encoding categorical features
B Target encoding categorical features
C Imputing missing feature values with the mean
D Imputing missing feature values with the true median
E Creating binary indicator features for missing values

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.
Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

A Model tuning
B Model evaluation
C Model deployment
D Exploratory data analysis

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

A import pyspark.pandas as psdf = ps.DataFrame(spark_df)
B import pyspark.pandas as psdf = ps.to_pandas(spark_df)
C spark_df.to_sql()
D import pandas as pddf = pd.DataFrame(spark_df)
E spark_df.to_pandas()

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

A The vectorized pandas UDFs allow for the use of type hints
B The vectorized pandas UDFs process data in batches rather than one row at a time
C The vectorized pandas UDFs allow for pandas API use inside of the function
D The vectorized pandas UDFs work on distributed DataFrames
E The vectorized pandas UDFs process data in memory rather than spilling to disk

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.
Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

A Spark ML decision trees test every feature variable in the splitting algorithm
B Spark ML decision trees automatically prune overfit trees
C Spark ML decision trees test more split candidates in the splitting algorithm
D Spark ML decision trees test a random sample of feature variables in the splitting algorithm
E Spark ML decision trees test binned features values as representative split candidates

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.
Which of the following approaches can they take to include as much information as possible in the feature set?

A Impute the missing values using each respective feature variable’s mean value instead of the median value
B Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them
C Remove all feature variables that originally contained missing values from the feature set
D Create a binary feature variable for each feature that contained missing values indicating whether each row’s value has been imputed
E Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

A TrainValidationSplit
B DataFrame.where
C CrossValidator
D TrainValidationSplitModel
E DataFrame.randomSplit

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library's fmin operation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with the objective_function being passed as an argument to fmin.
They use the following code block to create the objective_function:

Question Image

Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?

A Add test set validation process
B Add a random_state argument to the RandomForestRegressor operation
C Remove the mean operation that is wrapping the cross_val_score operation
D Replace the r2 return value with -r2
E Replace the fmin operation with the fmax operation

A data scientist is using Spark ML to engineer features for an exploratory machine learning project.
They decide they want to standardize their features using the following code block:

Question Image

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?

A Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values
B Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values
C Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data
D Utilize the Pipeline API to standardize the training data according to the test data's summary statistics
E Utilize the Pipeline API to standardize the test data according to the training data's summary statistics

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model.
The Spark DataFrame train_df has the following schema:

Question Image

The machine learning engineer shares the following code block:

Question Image

Which of the following changes does the machine learning engineer need to make to complete the task?

A They need to call the transform method on train_df
B They need to convert the features column to be a vector
C They do not need to make any changes
D They need to utilize a Pipeline to fit the model
E They need to split the features column out into one column for each feature

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
Hyperparameter 1: [2, 5, 10]
Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?

A 3
B 5
C 6
D 18

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?

A spark_df.summary ()
B spark_df.stats()
C spark_df.describe().head()
D spark_df.printSchema()
E spark_df.toPandas()

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.
Which of the following lines of code can the data scientist run to accomplish the task?

A spark_df.describe()
B dbutils.data(spark_df).summarize()
C This task cannot be accomplished in a single line of code.
D spark_df.summary()
E dbutils.data.summarize (spark_df)

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the path model_uri for the DataFrame batch_df. batch_df has the following schema: customer_id STRING
The machine learning engineer runs the following code block to perform inference on batch_df using the linear regression model at model_uri:

Question Image

In which situation will the machine learning engineer’s code block perform the desired inference?

A When the Feature Store feature set was logged with the model at model_uri
B When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark
C When the model at model_uri only uses customer_id as a feature
D This code block will not perform the desired inference in any situation.
E When all of the features used by the model at model_uri are in a single Feature Store table

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:
• 10.0
• 12.0
• 17.0
Which of the following values represents the overall cross-validation root-mean-squared error?

A 13.0
B 17.0
C 12.0
D 39.0
E 10.0

Join Us Among the Stars

Certified Machine Learning AssociatePreview

About the Certified Machine Learning Associate Exam

Mode Selection

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25