Certified Data Engineer Professional
Free trial
Verified
Question 1
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)")
Which code block should be used to create the date Python variable used in the above code block?
- A: date = spark.conf.get("date")
- B: input_dict = input() date= input_dict["date"]
- C: import sys date = sys.argv[1]
- D: date = dbutils.notebooks.getParam("date")
- E: dbutils.widgets.text("date", "null") date = dbutils.widgets.get("date")
Question 2
A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?
- A: All records are cached to an operational database and then the filter is applied
- B: The Parquet file footers are scanned for min and max statistics for the latitude column
- C: All records are cached to attached storage and then the filter is applied
- D: The Delta log is scanned for min and max statistics for the latitude column
- E: The Hive metastore is scanned for min and max statistics for the latitude column
Question 3
The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.
A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.
Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?
- A: "Manage" permissions should be set on a secret key mapped to those credentials that will be used by a given team.
- B: "Read" permissions should be set on a secret key mapped to those credentials that will be used by a given team.
- C: "Read" permissions should be set on a secret scope containing only those credentials that will be used by a given team.
- D: "Manage" permissions should be set on a secret scope containing only those credentials that will be used by a given team. No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
Question 4
Which indicators would you look for in the Spark UI’s Storage tab to signal that a cached table is not performing optimally? Assume you are using Spark’s MEMORY_ONLY storage level.
- A: Size on Disk is < Size in Memory
- B: The RDD Block Name includes the “*” annotation signaling a failure to cache
- C: Size on Disk is > 0
- D: The number of Cached Partitions > the number of Spark Partitions
- E: On Heap Memory Usage is within 75% of Off Heap Memory Usage
Question 5
What is the first line of a Databricks Python notebook when viewed in a text editor?
- A: %python
- B: // Databricks notebook source
- C: # Databricks notebook source
- D: -- Databricks notebook source
- E: # MAGIC %python
Question 6
Which statement describes a key benefit of an end-to-end test?
- A: Makes it easier to automate your test suite
- B: Pinpoints errors in the building blocks of your application
- C: Provides testing coverage for all code paths and branches
- D: Closely simulates real world usage of your application
- E: Ensures code is optimized for a real-life workflow
Question 7
The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a field run_id.
Which statement describes what the number alongside this field represents?
- A: The job_id and number of times the job has been run are concatenated and returned.
- B: The total number of jobs that have been run in the workspace.
- C: The number of times the job definition has been run in this workspace.
- D: The job_id is returned in this field.
- E: The globally unique ID of the newly triggered run.
Question 8
The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.
The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.
Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?
- A: df.map(lambda x:model(x[columns])).select("customer_id, predictions")
- B: df.select("customer_id", model(*columns).alias("predictions"))
- C: model.predict(df, columns)
- D: df.select("customer_id", pandas_udf(model, columns).alias("predictions"))
- E: df.apply(model, columns).select("customer_id, predictions")
Question 9
A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:
user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING
The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.
Which solution minimizes the compute costs to propagate this batch of data?
- A: Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp.
- B: Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.
- C: Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.
- D: Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.
- E: Reprocess all records in reviews_raw and overwrite the next table in the pipeline.
Question 10
Which statement describes Delta Lake optimized writes?
- A: Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
- B: An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
- C: Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
- D: Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
- E: A shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.
Question 11
Which statement describes the default execution mode for Databricks Auto Loader?
- A: Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; the target table is materialized by directly querying all valid files in the source directory.
- B: New files are identified by listing the input directory; the target table is materialized by directly querying all valid files in the source directory.
- C: Webhooks trigger a Databricks job to run anytime new data arrives in a source directory; new data are automatically merged into target tables using rules inferred from the data.
- D: New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
- E: Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and idempotently loaded into the target Delta Lake table.
Question 12
A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
- A: post_time
- B: latitude
- C: post_id
- D: user_id
- E: date
Question 13
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.
The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?
- A: Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
- B: Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
- C: Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
- D: Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
- E: Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.
Question 14
A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.
Which of the following solutions would you implement to achieve this requirement?
- A: Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
- B: Partition ingestion tables by a small time duration to allow for many data files to be written in parallel.
- C: Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
- D: Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
- E: Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.
Question 15
Which describes a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?
- A: Run source env/bin/activate in a notebook setup script
- B: Use b in a notebook cell
- C: Use %pip install in a notebook cell
- D: Use %sh pip install in a notebook cell
- E: Install libraries from PyPI using the cluster UI
Question 16
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM.
Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?
- A: • Total VMs: 8 • 50 GB per Executor • 20 Cores / Executor
- B: • Total VMs: 16 • 25 GB per Executor • 10 Cores / Executor
- C: • Total VMs: 1 • 400 GB per Executor • 160 Cores/Executor
- D: • Total VMs: 4 • 100 GB per Executor • 40 Cores / Executor
- E: • Total VMs: 2 • 200 GB per Executor • 80 Cores / Executor
Question 17
A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.
Given the current implementation, which method can be used?
- A: Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and lime travel functionality.
- B: Parse the Delta Lake transaction log to identify all newly written data files.
- C: Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
- D: Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
- E: Use Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.
Question 18
A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.
Original query:
Proposed query:
Which step must also be completed to put the proposed query into production?
- A: Specify a new checkpointLocation
- B: Remove .option('mergeSchema', 'true') from the streaming write
- C: Increase the shuffle partitions to account for additional aggregates
- D: Run REFRESH TABLE delta.‛/item_agg‛
Question 19
When using CLI or REST API to get results from jobs with multiple tasks, which statement correctly describes the response structure?
- A: Each run of a job will have a unique job_id; all tasks within this job will have a unique job_id
- B: Each run of a job will have a unique job_id; all tasks within this job will have a unique task_id
- C: Each run of a job will have a unique orchestration_id; all tasks within this job will have a unique run_id
- D: Each run of a job will have a unique run_id; all tasks within this job will have a unique task_id
- E: Each run of a job will have a unique run_id; all tasks within this job will also have a unique run_id
Question 20
The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.
A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.
Which statement captures best practices for this situation?
- A: All development, testing, and production code and data should exist in a single, unified workspace; creating separate environments for testing and development complicates administrative overhead.
- B: In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
- C: As long as code in the development environment declares USE dev_db at the top of each notebook, there is no possibility of inadvertently committing changes back to production data sources.
- D: Because Delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data; as such, it is generally safe to mount production data anywhere.
- E: Because access to production data will always be verified using passthrough credentials, it is safe to mount data to any Databricks development environment.
Question 21
A data engineer, User A, has promoted a pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.
A workspace admin, User C, inherits responsibility for managing this pipeline. User C uses the Databricks Jobs UI to take "Owner" privileges of each job. Jobs continue to be triggered using the credentials and tooling configured by User B.
An application has been configured to collect and parse run information returned by the REST API. Which statement describes the value returned in the creator_user_name field?
- A: Once User C takes "Owner" privileges, their email address will appear in this field; prior to this, User A’s email address will appear in this field.
- B: User B’s email address will always appear in this field, as their credentials are always used to trigger the run.
- C: User A’s email address will always appear in this field, as they still own the underlying notebooks.
- D: Once User C takes "Owner" privileges, their email address will appear in this field; prior to this, User B’s email address will appear in this field.
- E: User C will only ever appear in this field if they manually trigger the job, otherwise it will indicate User B.
Question 22
A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.
Which command should be removed from the notebook before scheduling it as a job?
- A: Cmd 2
- B: Cmd 3
- C: Cmd 4
- D: Cmd 5
Question 23
Which statement regarding Spark configuration on the Databricks platform is true?
- A: The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs currently running on the cluster.
- B: Spark configurations set within a notebook will affect all SparkSessions attached to the same interactive cluster.
- C: When the same Spark configuration property is set for an interactive cluster and a notebook attached to that cluster, the notebook setting will always be ignored.
- D: Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.
Question 24
A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.
Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?
- A: Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.
- B: The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
- C: Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.
- D: One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.
- E: The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
Question 25
The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts, transforms, and loads the data for their pipeline runs in 10 minutes.
Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?
- A: Configure a job that executes every time new data lands in a given directory
- B: Schedule a job to execute the pipeline once an hour on a new job cluster
- C: Schedule a Structured Streaming job with a trigger interval of 60 minutes
- D: Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster
Free preview mode
Enjoy the free questions and consider upgrading to gain full access!