An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?
AEach write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.
BEach write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
CEach write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.
DEach write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will fail.
EEach write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.
An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT
New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.
Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?
AUse Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table.
BOverwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.
CFilter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.
DUse Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current.
EFilter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete this task.
The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:
The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is: store_id INT, sales_date DATE, total_sales FLOAT
If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?
AImplement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.
BImplement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.
CImplement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
DImplement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.
EUse Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.
Review the following error traceback:
Which statement describes the error being raised?
AThe code executed was PySpark but was executed in a Scala notebook.
BThere is no column in the table named heartrateheartrateheartrate
CThere is a type error because a column object cannot be multiplied.
DThere is a type error because a DataFrame object cannot be multiplied.
EThere is a syntax error because the heartrate column is not correctly identified as a column.
A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.
The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?
AThe Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
BBecause Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
CHuman labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
DBecause Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
ESchema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
A nightly job ingests data into a Delta Lake table using the following code:
The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.
Which code snippet completes this function definition?
def new_records():
A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.
Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?
AThree new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.
BThe logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
CThree new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.
DOne new job named "Ingest new data" will be defined in the workspace, but it will not be executed.
EThe logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
ATask queueing resulting from improper thread pool assignment.
BSpill resulting from attached volume storage being too small.
CNetwork latency due to some cluster nodes being in different regions from the source data
DSkew caused by more data being assigned to a subset of spark-partitions.
ECredential validation errors while pulling data from an external system.
A junior data engineer on your team has implemented the following code block.
The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.
When this query is executed, what will happen with new records that have the same event_id as an existing record?
AThey are merged.
BThey are ignored.
CThey are updated.
DThey are inserted.
EThey are deleted.
The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.
The below query is used to create the alert:
The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?
AThe total average temperature across all sensors exceeded 120 on three consecutive executions of the query
BThe recent_sensor_recordings table was unresponsive for three consecutive runs of the query
CThe source query failed to update properly for three consecutive minutes and then restarted
DThe maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query
EThe average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query
To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.
The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.
Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?
ASend all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.
BConfigure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.
CCreate a new table with the required schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table.
DReplace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.
EAdd a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.
Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.
Which statement describes a limitation of Databricks Secrets?
ABecause the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.
BAccount administrators can see all secrets in plain text by logging on to the Databricks Accounts console.
CSecrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.
DIterating through a stored secret and printing each character will display secret contents in plain text.
EThe Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.
A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?
AUse Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9
BUse Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.
CUse Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch
DMerge all changes back to the main branch in the remote Git repository and clone the repo again
EUse Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository
The data engineering team maintains the following code:
Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?
AA batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.
BThe enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.
CAn incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.
DAn incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.
ENo computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.
A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20
Which statement describes how data will be filtered?
AStatistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
BNo file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
CThe Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.
DStatistics in the Delta Log will be used to identify data files that might include records in the filtered range.
EThe Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.
A Delta Lake table was created with the below query:
Consider the following query:
DROP TABLE prod.sales_by_store -
If this statement is executed by a workspace admin, which result will occur?
ANothing will occur until a COMMIT command is executed.
BThe table will be removed from the catalog but the data will remain in storage.
CThe table will be removed from the catalog and the data will be deleted.
DAn error will occur because Delta Lake prevents the deletion of production data.
EData will be marked as deleted but still recoverable with Time Travel.
The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.
The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.
Which statement exemplifies best practices for implementing this system?
AIsolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.
BBecause databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.
CStoring all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.
DWorking in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.
EBecause all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.
Which approach will ensure that this requirement is met?
AWhenever a database is being created, make sure that the LOCATION keyword is used
BWhen configuring an external data warehouse for all table storage, leverage Databricks for all ELT.
CWhenever a table is being created, make sure that the LOCATION keyword is used.
DWhen tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.
EWhen the workspace is being configured, make sure that external cloud object storage has been mounted.
A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:
A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.
Which statement describes the outcome of this batch insert?
AThe write will fail when the violating record is reached; any records previously processed will be recorded to the target table.
BThe write will fail completely because of the constraint violation and no records will be inserted into the target table.
CThe write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.
DThe write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.
EThe write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.
A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.
The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.
Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?
ADatabricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.
BDatabricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.
CCross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.
DDatabricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.
EDatabricks notebooks send all executable code from the user’s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.
A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
ASet the configuration delta.deduplicate = true.
BVACUUM the Delta table after each batch completes.
CPerform an insert-only merge with a matching condition on a unique key.
DPerform a full outer join on a unique key and overwrite existing data.
ERely on Delta Lake schema enforcement to prevent duplicate records.
Which of the following is true of Delta Lake and the Lakehouse?
ABecause Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.
BDelta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.
CViews in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.
DPrimary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.
EZ-order can only be applied to numeric values stored in Delta Lake tables.
A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:
An analyst who is not a member of the marketing group executes the following query:
SELECT * FROM email_ltv -
Which statement describes the results returned by this query?
AThree columns will be returned, but one column will be named "REDACTED" and contain only null values.
BOnly the email and ltv columns will be returned; the email column will contain all null values.
CThe email and ltv columns will be returned with the values in user_ltv.
DThe email.age, and ltv columns will be returned with the values in user_ltv.
EOnly the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.
The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.
The following logic is used to process these records.
Which statement describes this implementation?
AThe customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.
BThe customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.
CThe customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.
DThe customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.
EThe customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.