An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)")
Which code block should be used to create the date Python variable used in the above code block?
Adate = spark.conf.get("date")
Binput_dict = input()
date= input_dict["date"]
Cimport sys
date = sys.argv[1]
Ddate = dbutils.notebooks.getParam("date")
Edbutils.widgets.text("date", "null")
date = dbutils.widgets.get("date")
A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT
To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3
Which statement describes how the Delta engine identifies which files to load?
AAll records are cached to an operational database and then the filter is applied
BThe Parquet file footers are scanned for min and max statistics for the latitude column
CAll records are cached to attached storage and then the filter is applied
DThe Delta log is scanned for min and max statistics for the latitude column
EThe Hive metastore is scanned for min and max statistics for the latitude column
The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company.
A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.
Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?
A"Manage" permissions should be set on a secret key mapped to those credentials that will be used by a given team.
B"Read" permissions should be set on a secret key mapped to those credentials that will be used by a given team.
C"Read" permissions should be set on a secret scope containing only those credentials that will be used by a given team.
D"Manage" permissions should be set on a secret scope containing only those credentials that will be used by a given team.
No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.
Which indicators would you look for in the Spark UI’s Storage tab to signal that a cached table is not performing optimally? Assume you are using Spark’s MEMORY_ONLY storage level.
ASize on Disk is < Size in Memory
BThe RDD Block Name includes the “*” annotation signaling a failure to cache
CSize on Disk is > 0
DThe number of Cached Partitions > the number of Spark Partitions
EOn Heap Memory Usage is within 75% of Off Heap Memory Usage
What is the first line of a Databricks Python notebook when viewed in a text editor?
A%python
B// Databricks notebook source
C
Databricks notebook source
D-- Databricks notebook source
E
MAGIC %python
Which statement describes a key benefit of an end-to-end test?
AMakes it easier to automate your test suite
BPinpoints errors in the building blocks of your application
CProvides testing coverage for all code paths and branches
DClosely simulates real world usage of your application
EEnsures code is optimized for a real-life workflow
The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a field run_id.
Which statement describes what the number alongside this field represents?
AThe job_id and number of times the job has been run are concatenated and returned.
BThe total number of jobs that have been run in the workspace.
CThe number of times the job definition has been run in this workspace.
DThe job_id is returned in this field.
EThe globally unique ID of the newly triggered run.
The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.
The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.
Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?
A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema:
The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched.
Which solution minimizes the compute costs to propagate this batch of data?
APerform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp.
BConfigure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.
CUse Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table.
DFilter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours.
EReprocess all records in reviews_raw and overwrite the next table in the pipeline.
Which statement describes Delta Lake optimized writes?
ABefore a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
BAn asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
CData is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
DOptimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
EA shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.
Which statement describes the default execution mode for Databricks Auto Loader?
ACloud vendor-specific queue storage and notification services are configured to track newly arriving files; the target table is materialized by directly querying all valid files in the source directory.
BNew files are identified by listing the input directory; the target table is materialized by directly querying all valid files in the source directory.
CWebhooks trigger a Databricks job to run anytime new data arrives in a source directory; new data are automatically merged into target tables using rules inferred from the data.
DNew files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
ECloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and idempotently loaded into the target Delta Lake table.
A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
Apost_time
Blatitude
Cpost_id
Duser_id
Edate
The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.
The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.
The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.
Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?
ABecause the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.
BBecause the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day.
CBecause Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.
DBecause Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.
EBecause the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.
A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.
Which of the following solutions would you implement to achieve this requirement?
AUse Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.
BPartition ingestion tables by a small time duration to allow for many data files to be written in parallel.
CConfigure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly.
DIsolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors.
EStore all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.
Which describes a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?
ARun source env/bin/activate in a notebook setup script
BUse b in a notebook cell
CUse %pip install in a notebook cell
DUse %sh pip install in a notebook cell
EInstall libraries from PyPI using the cluster UI
Each configuration below is identical to the extent that each cluster has 400 GB total of RAM 160 total cores and only one Executor per VM.
Given an extremely long-running job for which completion must be guaranteed, which cluster configuration will be able to guarantee completion of the job in light of one or more VM failures?
A• Total VMs: 8
• 50 GB per Executor
• 20 Cores / Executor
B• Total VMs: 16
• 25 GB per Executor
• 10 Cores / Executor
C• Total VMs: 1
• 400 GB per Executor
• 160 Cores/Executor
D• Total VMs: 4
• 100 GB per Executor
• 40 Cores / Executor
E• Total VMs: 2
• 200 GB per Executor
• 80 Cores / Executor
A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.
Given the current implementation, which method can be used?
AExecute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and lime travel functionality.
BParse the Delta Lake transaction log to identify all newly written data files.
CParse the Spark event logs to identify those rows that were updated, inserted, or deleted.
DExecute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
EUse Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.
A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new promotion, and they would like to add a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows. Note that proposed changes are in bold.
Original query:
Proposed query:
Which step must also be completed to put the proposed query into production?
ASpecify a new checkpointLocation
BRemove .option('mergeSchema', 'true') from the streaming write
CIncrease the shuffle partitions to account for additional aggregates
DRun REFRESH TABLE delta.‛/item_agg‛
When using CLI or REST API to get results from jobs with multiple tasks, which statement correctly describes the response structure?
AEach run of a job will have a unique job_id; all tasks within this job will have a unique job_id
BEach run of a job will have a unique job_id; all tasks within this job will have a unique task_id
CEach run of a job will have a unique orchestration_id; all tasks within this job will have a unique run_id
DEach run of a job will have a unique run_id; all tasks within this job will have a unique task_id
EEach run of a job will have a unique run_id; all tasks within this job will also have a unique run_id
The data engineering team is configuring environments for development, testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team wants to develop and test against data as similar to production data as possible.
A junior data engineer suggests that production data can be mounted to the development and testing environments, allowing pre-production code to execute against production data. Because all users have admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.
Which statement captures best practices for this situation?
AAll development, testing, and production code and data should exist in a single, unified workspace; creating separate environments for testing and development complicates administrative overhead.
BIn environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
CAs long as code in the development environment declares USE dev_db at the top of each notebook, there is no possibility of inadvertently committing changes back to production data sources.
DBecause Delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data; as such, it is generally safe to mount production data anywhere.
EBecause access to production data will always be verified using passthrough credentials, it is safe to mount data to any Databricks development environment.
A data engineer, User A, has promoted a pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.
A workspace admin, User C, inherits responsibility for managing this pipeline. User C uses the Databricks Jobs UI to take "Owner" privileges of each job. Jobs continue to be triggered using the credentials and tooling configured by User B.
An application has been configured to collect and parse run information returned by the REST API. Which statement describes the value returned in the creator_user_name field?
AOnce User C takes "Owner" privileges, their email address will appear in this field; prior to this, User A’s email address will appear in this field.
BUser B’s email address will always appear in this field, as their credentials are always used to trigger the run.
CUser A’s email address will always appear in this field, as they still own the underlying notebooks.
DOnce User C takes "Owner" privileges, their email address will appear in this field; prior to this, User B’s email address will appear in this field.
EUser C will only ever appear in this field if they manually trigger the job, otherwise it will indicate User B.
A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.
Which command should be removed from the notebook before scheduling it as a job?
ACmd 2
BCmd 3
CCmd 4
DCmd 5
Which statement regarding Spark configuration on the Databricks platform is true?
AThe Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs currently running on the cluster.
BSpark configurations set within a notebook will affect all SparkSessions attached to the same interactive cluster.
CWhen the same Spark configuration property is set for an interactive cluster and a notebook attached to that cluster, the notebook setting will always be ignored.
DSpark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.
A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.
Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?
AThree new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.
BThe logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
CThree new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.
DOne new job named "Ingest new data" will be defined in the workspace, but it will not be executed.
EThe logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts, transforms, and loads the data for their pipeline runs in 10 minutes.
Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?
AConfigure a job that executes every time new data lands in a given directory
BSchedule a job to execute the pipeline once an hour on a new job cluster
CSchedule a Structured Streaming job with a trigger interval of 60 minutes
DSchedule a job to execute the pipeline once an hour on a dedicated interactive cluster