Certified Associate Developer for Apache Spark Practice Exam

Loading provider exams...

Certified Associate Developer for Apache Spark Practice Exam

A data engineer is working on a Spark job that must join multiple DataFrames. Given this code snippet that performs the joins:

data1 = [Row(id=1, value=”A”)]  
df1= spark.createDataFrame (datal)  
data2 = [Row(id=1, description= “Descl”)]  
df2 = spark.createDataFrame (data2)  
data3 = [Row(id=1, details="Details1")]  
df3 spark.createDataFrame (data3)  
  
joined df = df1.join (broadcast (df2), df1.id == df2.id, “inner”) .join (broadcast (df3), df2.id == df3.id, “inner”)

What output will this code produce?

A The code will fail because only one broadcast join can be performed at a time.
B The code will fail because the second join condition (df2.id == df3.id) is not correct. The correct condition should be (df1.id == df3.id).
C The code will work correctly and perform two broadcast joins simultaneously to join df1 with df2 and then the result with df3.
D The code will result in an error because the broadcast function is used incorrectly, df2 and df3 must be explicitly broadcasted before performing the joins.

Explanation

broadcast() marks each supplied DataFrame as suitable for a broadcast join, and DataFrame.join() returns a joined DataFrame that can be joined again. Thus, broadcasting df2 in the first inner join and df3 in the chained inner join is valid. The predicate df2.id == df3.id correctly links the second join to the df2 column retained from the first join.

Learn more

The code block below contains a logical error that results in inefficiency. It is intended to efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId. Identify the logical error.

Code block:

storesDF.join(broadcast(employeesDF), "storeId")

A The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
B There is never a need to call the broadcast() operation in Apache Spark 3.
C The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.
D The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.
E Only one of the DataFrames is being broadcasted rather than both of the DataFrames.

Explanation

A broadcast hint marks a DataFrame as small enough for a broadcast join, making that relation the broadcast/build side. Broadcasting the much larger employeesDF replicates it across executors and is inefficient; the smaller storesDF should be broadcast instead. Apache Spark documents that a BROADCAST hint makes the hinted relation the broadcast side, regardless of spark.sql.autoBroadcastJoinThreshold.

Learn more

QuestionQ6

Troubleshooting and Tuning Apache Spark DataFrame API Applications.

Community Discussion

No comments yet. Be the first to start the discussion!

QuestionQ7

Structured Streaming

QuestionQ8

Using Spark SQL

QuestionQ9

Troubleshooting and Tuning Apache Spark DataFrame API Applications.

QuestionQ10

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ11

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ12

Troubleshooting and Tuning Apache Spark DataFrame API Applications.

QuestionQ13

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ14

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ15

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ16

Troubleshooting and Tuning Apache Spark DataFrame API Applications.

QuestionQ17

Apache Spark Architecture and Components

QuestionQ18

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ19

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ20

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ21

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ22

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ23

Developing Apache Spark™ DataFrame/DataSet API Applications

QuestionQ24

Using Spark SQL

QuestionQ25

Developing Apache Spark™ DataFrame/DataSet API Applications

The code block below contains an error. It is intended to return the exact number of distinct values in the division column in DataFrame storesDF. Identify the error.

Code block:

storesDF.agg(approx_count_distinct(col(“division”)).alias(“divisionDistinct”))

A The approx_count_distinct() operation needs a second argument to set the rsd parameter to ensure it returns the exact number of distinct values.
B There is no alias() operation for the approx_count_distinct() operation's output.
C There is no way to return an exact distinct number in Spark because the data Is distributed across partitions.
D The approx_count_distinct()operation is not a standalone function - it should be used as a method from a Column object.
E The approx_count_distinct() operation cannot determine an exact number of distinct values in a column.

Explanation

Spark SQL's approx_count_distinct returns an estimated distinct count (using HyperLogLog++), not an exact one. Its optional relative-standard-deviation parameter controls the approximation tolerance; supplying it does not make the result exact. Use an exact distinct-count aggregate, such as count_distinct, when an exact value is required.

Learn more

The following code fragment produces an error:

@F.udf(T.IntegerType())  
def simple_udf(t: str) -&gt; str:  
    return answer * 3.14159

Which code fragment should be used instead?

A @F.udf(T.IntegerType())def simple_udf(t: int) -> int:return t * 3.14159
B @F.udf(T.DoubleType())def simple_udf(t: float) -> float:return t * 3.14159
C @F.udf(T.DoubleType())def simple_udf(t: int) -> int:return t * 3.14159
D @F.udf(T.IntegerType())def simple_udf(t: float) -> float:return t * 3.14159

Explanation

Multiplying a numeric value by 3.14159 produces a floating-point result. A PySpark UDF’s declared returnType specifies its Spark SQL output type, so this result should use T.DoubleType() and a Python float return annotation. The function must also multiply its defined parameter, t, rather than the undefined name answer.

Learn more

The code block below has an error. It is intended to return a new DataFrame that results from an outer join between DataFrame storesDF and DataFrame employeesDF on the storeId column. Identify the error.

Code block:

storesDF.join(employeesDF, "storeId")

A The default argument to the how parameter is "inner" – an additional argument of "outer" must be specified.
B The key column storeId needs to be wrapped in the col() operation.
C The key column storeId needs to be specified in an expression of both Data Frame columns like storesDF.storeId == employeesDF.storeId.
D The key column storeId needs to be in a list like ["storeId"].
E There is no DataFrame.join() operation – DataFrame.merge() should be used instead.

Explanation

PySpark SQL DataFrame.join accepts a string as the name of a shared join column, but the how parameter defaults to inner. An outer join therefore requires an explicit join type such as "outer", e.g. storesDF.join(employeesDF, "storeId", "outer").

Learn more

pyspark.sql.DataFrame.join — PySpark documentation

A data engineer is working with the num_df DataFrame:

num_df = spark.range(5).toDF(“num”)

The engineer is using the Python UDF:

def cubefunc(val):  
    return val ** 3

Which code fragment registers and invokes this UDF as a Spark SQL function for use with the num_df DataFrame?

A spark.udf.register(“cubeudf”, cubefunc, IntegerType())num_df.selectExpr(“cubeudf(num)”)
B cubeudf = udf (cubefunc)num_df.select(cubeudf(col(“num”)))
C spark.udf.register(“cubeudf”, cubefunc, DoubleType())num_df.selectExpr(“cubeudf(num)”)
D cubeudf = udf(cubefunc)num_df.selectExpr(cubeudf(num)”)

Explanation

spark.udf.register associates a Python function with a SQL function name, which can then be invoked in selectExpr. The cube operation returns Python integer values, so IntegerType() is the matching declared Spark SQL return type. PySpark documents that a registered Python function’s produced value must match its specified return type.

Learn more