Free preview mode

Enjoy the free questions and consider upgrading to gain full access!

Certified Data Engineer ProfessionalFree trialFree trial

By databricks
Aug, 2025

Verified

25Q per page

Question 26

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

  • A: The total count of rows is calculated by scanning all data files
  • B: The total count of rows will be returned from cached results unless REFRESH is run
  • C: The total count of records is calculated from the Delta transaction logs
  • D: The total count of records is calculated from the parquet file metadata

Question 27

A Delta Lake table was created with the below query:

Image 1

Consider the following query:

DROP TABLE prod.sales_by_store -

If this statement is executed by a workspace admin, which result will occur?

  • A: Data will be marked as deleted but still recoverable with Time Travel.
  • B: The table will be removed from the catalog but the data will remain in storage.
  • C: The table will be removed from the catalog and the data will be deleted.
  • D: An error will occur because Delta Lake prevents the deletion of production data.

Question 28

A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates?

  • A: Use Repos to create a new branch, commit all changes, and push changes to the remote Git repository.
  • B: Use Repos to create a fork of the remote repository, commit all changes, and make a pull request on the source repository.
  • C: Use Repos to pull changes from the remote Git repository; commit and push changes to a branch that appeared as changes were pulled.
  • D: Use Repos to merge all differences and make a pull request back to the remote repository.

Question 29

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Image 1

Which statement describes what will happen when the above code is executed?

  • A: The connection to the external table will succeed; the string "REDACTED" will be printed.
  • B: An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.
  • C: An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.
  • D: The connection to the external table will succeed; the string value of password will be printed in plain text.

Question 30

The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.

Image 1

Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?

  • A: df.map(lambda x:model(x[columns])).select("customer_id, predictions")
  • B: df.select("customer_id", model(*columns).alias("predictions"))
  • C: model.predict(df, columns)
  • D: df.apply(model, columns).select("customer_id, predictions")

Question 31

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.

Image 1

What will be the outcome of executing these command cells m order m an interactive notebook?

  • A: Both commands will succeed. Executing SHOW TABLES will show that countries_af and sales_af have been registered as views.
  • B: Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries_af: if this entity exists, Cmd 2 will succeed.
  • C: Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable representing a PySpark DataFrame.
  • D: Cmd 1 will succeed and Cmd 2 will fail. countries_af will be a Python variable containing a list of strings.

Question 32

The data science team has requested assistance in accelerating queries on free-form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer’s suggestion is correct?

  • A: Delta Lake statistics are not optimized for free text fields with high cardinality.
  • B: Delta Lake statistics are only collected on the first 4 columns in a table.
  • C: ZORDER ON review will need to be run to see performance gains.
  • D: The Delta log creates a term matrix for free text fields to support selective filtering.

Question 33

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

  • A: Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
  • B: Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
  • C: Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.
  • D: Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

Question 34

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

  • A: configure
  • B: fs
  • C: workspace
  • D: libraries

Question 35

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?

  • A: Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.
  • B: Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.
  • C: Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log.
  • D: Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
  • E: Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

Question 36

The following table consists of items found in user carts within an e-commerce website.

Image 1

The following MERGE statement is used to update this table using an updates view, with schema evolution enabled on this table.

Image 2

How would the following update be handled?

Image 3
  • A: The update throws an error because changes to existing columns in the target schema are not supported.
  • B: The new nested Field is added to the target schema, and dynamically read as NULL for existing unmatched records.
  • C: The update is moved to a separate "rescued" column because it is missing a column expected in the target schema.
  • D: The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

Question 37

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

  • A: Iterate through an ordered set of changes to the table, applying each in turn to create the current state of the table, (insert, update, delete), timestamp of change, and the values.
  • B: Use merge into to insert, update, or delete the most recent entry for each pk_id into a table, then propagate all changes throughout the system.
  • C: Deduplicate records in each batch by pk_id and overwrite the target table.
  • D: Use Delta Lake’s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.

Question 38

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Which implementation can be used to efficiently update the described account_current table as part of each hourly batch job assuming there are millions of user accounts and tens of thousands of records processed hourly?

  • A: Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
  • B: Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger available job to batch update newly detected files into the account_current table.
  • C: Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
  • D: Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_login by user_id write a merge statement to update or insert the most recent value for each user_id.

Question 39

The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

Image 1

For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields:

Image 2

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

  • A: Populate the dashboard by configuring a nightly batch job to save the required values as a table overwritten with each update.
  • B: Use Structured Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.
  • C: Define a view against the products_per_order table and define the dashboard against this view.
  • D: Use the Delta Cache to persist the products_per_order table in memory to quickly update the dashboard with each query.

Question 40

A Delta lake table with CDF enabled table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

  • A: Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
  • B: Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
  • C: Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
  • D: Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

Question 41

A view is registered with the following code:

Image 1

Both users and orders are Delta Lake tables.

Which statement describes the results of querying recent_orders?

  • A: All logic will execute when the view is defined and store the result of joining tables to the DBFS; this stored data will be returned when the view is queried.
  • B: Results will be computed and cached when the view is defined; these cached results will incrementally update as new records are inserted into source tables.
  • C: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.
  • D: All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

Question 42

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

  • A: Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
  • B: Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
  • C: Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
  • D: Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Question 43

Which statement regarding stream-static joins and static Delta tables is correct?

  • A: The checkpoint directory will be used to track updates to the static Delta table.
  • B: Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.
  • C: The checkpoint directory will be used to track state information for the unique keys present in the join.
  • D: Stream-static joins cannot use static Delta tables because of consistency issues.

Question 44

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Image 1

Which line of code correctly fills in the blank within the code block to complete this task?

  • A: to_interval("event_time", "5 minutes").alias("time")
  • B: window("event_time", "5 minutes").alias("time")
  • C: "event_time"
  • D: lag("event_time", "10 minutes").alias("time")

Question 45

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution.

Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

  • A: Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.
  • B: Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.
  • C: Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost.
  • D: Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.

That’s the end of your free questions

You’ve reached the preview limit for Certified Data Engineer Professional

Consider upgrading to gain full access!

Page 2 of 9 • Questions 26-50 of 223

Free preview mode

Enjoy the free questions and consider upgrading to gain full access!