A data engineer is working with two tables. Each of these tables is displayed below in its entirety.
The data engineer runs the following query to join these tables together:
Which of the following will be returned by the above query?
A
B
C
D
E
Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?
AWhen they are working interactively with a small amount of data
BWhen they are running automated reports to be refreshed as quickly as possible
CWhen they are working with SQL within Databricks SQL
DWhen they are concerned about the ability to automatically scale with larger data
EWhen they are manually running reports with a large amount of data
A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.
Which of the following keywords can be used to compact the small files?
AREDUCE
BOPTIMIZE
CCOMPACTION
DREPARTITION
EVACUUM
Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?
ANone of these
BData lake
CData warehouse
DAll of these
EData lakehouse
Question 6
Productionizing Data Pipelines
0
Question 7
Productionizing Data Pipelines
Question 8
Development and Ingestion
Question 9
Data Processing & Transformations
Question 10
Databricks Intelligence Platform
Question 11
Development and Ingestion
Question 12
Data Processing & Transformations
Question 13
Development and Ingestion
Question 14
Data Processing & Transformations
Question 15
Development and Ingestion
Question 16
Data Governance & Quality
Question 17
Development and Ingestion
Question 18
Data Processing & Transformations
Question 19
Productionizing Data Pipelines
Question 20
Productionizing Data Pipelines
Question 21
Data Governance & Quality
Question 22
Development and Ingestion
Question 23
Data Governance & Quality
Question 24
Development and Ingestion
Question 28
Data Processing & Transformations
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ad
Want a break from the ads?
Become a Supporter and enjoy a completely ad-free experience, plus unlock Learn Mode, Exam Mode, AstroTutor AI, and more.
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
Ask AstroTutor
0
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Development mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?
AAll datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
BAll datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.
CAll datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
DAll datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
EAll datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job.
Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?
AThey can turn on the Auto Stop feature for the SQL endpoint.
BThey can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint.
CThey can reduce the cluster size of the SQL endpoint.
DThey can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints.
EThey can set up the dashboard's SQL endpoint to be serverless.
Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?
AParquet files can be partitioned
BCREATE TABLE AS SELECT statements cannot be used on files
CParquet files have a well-defined schema
DParquet files have the ability to be optimized
EParquet files will become Delta tables
Which of the following SQL keywords can be used to convert a table from a long format to a wide format?
ATRANSFORM
BPIVOT
CSUM
DCONVERT
EWHERE
A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells.
Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?
AIt is not possible to use SQL in a Python notebook
BThey can attach the cell to a SQL endpoint rather than a Databricks cluster
CThey can simply write SQL syntax in the cell
DThey can add %sql to the first line of the cell
EThey can change the default language of the notebook to SQL
A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.
Which of the following describes why Auto Loader inferred all of the columns to be of the string type?
AThere was a type mismatch between the specific schema and the inferred schema
BJSON data is a text-based format
CAuto Loader only works with string data
DAll of the fields had at least one null value
EAuto Loader cannot infer the schema of ingested data
Which of the following data workloads will utilize a Gold table as its source?
AA job that enriches data by parsing its timestamps into a human-readable format
BA job that aggregates uncleaned data to create standard summary statistics
CA job that cleans data by removing malformatted records
DA job that queries aggregated data designed to feed into a dashboard
EA job that ingests raw data from a streaming source into the Lakehouse
A data engineer that is new to using Python needs to create a Python function to add two integers together and return the sum?
Which of the following code blocks can the data engineer use to complete this task?
A
B
C
D
E
A data engineer needs to apply custom logic to identify employees with more than 5 years of experience in array column employees in table stores. The custom logic should create a new column exp_employees that is an array of all of the employees with more than 5 years of experience for each row. In order to apply this custom logic at scale, the data engineer wants to use the FILTER higher-order function.
Which of the following code blocks successfully completes this task?
A
B
C
D
E
Which of the following describes the type of workloads that are always compatible with Auto Loader?
AStreaming workloads
BMachine learning workloads
CServerless workloads
DBatch workloads
EDashboard workloads
A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE
What is the expected behavior when a batch of data containing data that violates these constraints is processed?
ARecords that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
BRecords that violate the expectation cause the job to fail.
CRecords that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
DRecords that violate the expectation are added to the target dataset and recorded as invalid in the event log.
ERecords that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.
A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:
Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?
AReplace predict with a stream-friendly prediction function
BReplace schema(schema) with option ("maxFilesPerTrigger", 1)
CReplace "transactions" with the path to the location of the Delta table
DReplace format("delta") with format("stream")
EReplace spark.read with spark.readStream
Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?
ASilver tables contain a less refined, less clean view of data than Bronze data.
BSilver tables contain aggregates while Bronze data is unaggregated.
CSilver tables contain more data than Bronze tables.
DSilver tables contain a more refined and cleaner view of data than Bronze tables.
ESilver tables contain less data than Bronze tables.
A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved.
Which of the following approaches can the team use to reduce the time it takes to return results in this scenario?
AThey can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized."
BThey can turn on the Auto Stop feature for the SQL endpoint.
CThey can increase the cluster size of the SQL endpoint.
DThey can turn on the Serverless feature for the SQL endpoint.
EThey can increase the maximum bound of the SQL endpoint's scaling range.
Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails?
AManually programming in an alert system in each cell of the Notebook
BSetting up an Alert in the Job page
CSetting up an Alert in the Notebook
DThere is no way to notify the Job owner in the case of Job failure
EMLflow Model Registry Webhooks
A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to a data analytics dashboard for a retail use case. The job has a Databricks SQL query that returns the number of store-level records where sales is equal to zero. The data engineer wants their entire team to be notified via a messaging webhook whenever this value is greater than 0.
Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of stores with $0 in sales is greater than zero?
AThey can set up an Alert with a custom template.
BThey can set up an Alert with a new email alert destination.
CThey can set up an Alert with one-time notifications.
DThey can set up an Alert with a new webhook alert destination.
EThey can set up an Alert without notifications.
Which of the following queries is performing a streaming hop from raw data to a Bronze table?
A
B
C
D
E
A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:
DROP TABLE IF EXISTS my_table;
After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.
What is the reason behind the deletion of all these files?
AThe table was managed
BThe table's data was smaller than 10 GB
CThe table did not have a location
DThe table was external
A data engineer that is new to using Python needs to create a Python function to add two integers together and return the sum?
Which code block can the data engineer use to complete this task?
A
B
C
D
Which file format is used for storing Delta Lake Table?