Certified Associate Developer for Apache Spark
Free trial
Verified
Question 1
Which of the following describes the Spark driver?
- A: The Spark driver is responsible for performing all execution in all execution modes – it is the entire Spark application.
- B: The Spare driver is fault tolerant – if it fails, it will recover the entire Spark application.
- C: The Spark driver is the coarsest level of the Spark execution hierarchy – it is synonymous with the Spark application.
- D: The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application.
- E: The Spark driver is horizontally scaled to increase overall processing throughput of a Spark application.
Question 2
Which of the following DataFrame operations is classified as a wide transformation?
- A: DataFrame.filter()
- B: DataFrame.join()
- C: DataFrame.select()
- D: DataFrame.drop()
- E: DataFrame.union()
Question 3
The code block shown below contains an error. The code block is intended to return the exact number of distinct values in column division in DataFrame storesDF. Identify the error.
Code block:
storesDF.agg(approx_count_distinct(col(“division”)).alias(“divisionDistinct”))
- A: The approx_count_distinct() operation needs a second argument to set the rsd parameter to ensure it returns the exact number of distinct values.
- B: There is no alias() operation for the approx_count_distinct() operation's output.
- C: There is no way to return an exact distinct number in Spark because the data Is distributed across partitions.
- D: The approx_count_distinct()operation is not a standalone function - it should be used as a method from a Column object.
- E: The approx_count_distinct() operation cannot determine an exact number of distinct values in a column.
Question 4
Which of the following code blocks returns the number of rows in DataFrame storesDF for each distinct combination of values in column division and column storeCategory?
- A: storesDF.groupBy(Seq(col(“division”), col(“storeCategory”))).count()
- B: storesDF.groupBy(division, storeCategory).count()
- C: storesDF.groupBy(“division”, “storeCategory”).count()
- D: storesDF.groupBy(“division”).groupBy(“StoreCategory”).count()
- E: storesDF.groupBy(Seq(“division”, “storeCategory”)).count()
Question 5
The code block shown below contains an error. The code block is intended to return a collection of summary statistics for column sqft in Data Frame storesDF. Identify the error.
Code block:
storesDF.describes(col(“sgft”))
- A: The column sqft should be subsetted from DataFrame storesDF prior to computing summary statistics on it alone.
- B: The describe() operation does not accept a Column object as an argument outside of a sequence — the sequence Seq(col(“sqft”)) should be specified instead.
- C: The describe()operation doesn’t compute summary statistics for a single column — the summary() operation should be used instead.
- D: The describe()operation doesn't compute summary statistics for numeric columns — the summary() operation should be used instead.
- E: The describe()operation does not accept a Column object as an argument — the column name string “sqft” should be specified instead.
Question 6
The code block shown below should extract the integer value for column sqft from the first row of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
1.2.3Int
- A: 1. storesDF 2. first() 3. getAs() 4. “sqft”
- B: 1. storesDF 2. first 3. getAs 4. sqft
- C: 1. storesDF 2. first() 3. getAs 4. col(“sqft”)
- D: 1. storesDF 2. first 3. getAs 4. “sqft”
Question 7
The code block shown below should print the schema of DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
1.2
- A: 1. storesDF 2. printSchema(“all”)
- B: 1. storesDF 2. schema
- C: 1. storesDF 2. getAs[str]
- D: 1. storesDF 2. printSchema(true)
- E: 1. storesDF 2. printSchema
Question 8
The code block shown below contains an error. The code block is intended to create and register a SQL UDF named “ASSESS_PERFORMANCE” using the Scala function assessPerformance() and apply it to column customerSatisfaction in the table stores. Identify the error.
Code block:
spark.udf.register(“ASSESS_PERFORMANCE”, assessPerforance)
spark.sql(“SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores”)
- A: The customerSatisfaction column cannot be called twice inside the SQL statement.
- B: Registered UDFs cannot be applied inside of a SQL statement.
- C: The order of the arguments to spark.udf.register() should be reversed.
- D: The wrong SQL function is used to compute column result - it should be ASSESS_PERFORMANCE instead of assessPerformance.
- E: There is no sql() operation - the DataFrame API must be used to apply the UDF assessPerformance().
Question 9
The code block shown below contains an error. The code block is intended to create the Scala UDF assessPerformanceUDF() and apply it to the integer column customers1t1sfaction in Data Frame storesDF. Identify the error.
Code block:
- A: The input type of customerSatisfaction is not specified in the udf() operation.
- B: The return type of assessPerformanceUDF() must be specified.
- C: The withColumn() operation is not appropriate here - UDFs should be applied by iterating over rows instead.
- D: The assessPerformanceUDF() must first be defined as a Scala function and then converted to a UDF.
- E: UDFs can only be applied via SQL and not through the Data Frame API.
Question 10
The code block shown below should create a single-column DataFrame from Scala list years which is made up of integers. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
1.2(3).4
- A: 1. spark 2. createDataFrame 3. years 4. IntegerType
- B: 1. spark 2. createDataset 3. years 4. IntegerType
- C: 1. spark 2. createDataset 3. List(years) 4. toDF
- D: 1. spark 2. createDataFrame 3. List(years) 4. IntegerType
Question 11
The code block shown below should cache DataFrame storesDF only in Spark's memory. Choose the response that correctly fil ls in the numbered blanks within the code block to complete this task.
Code block:
1.2(3).count()
- A: 1. storesDF 2. cache 3. StorageLevel.MEMORY_ONLY
- B: 1. storesDF 2. storageLevel 3. cache
- C: 1. storesDF 2. cache 3. Nothing
- D: 1. storesDF 2. persist 3. Nothing
- E: 1. storesDF 2. persist 3. StorageLevel.MEMORY_ONLY
Question 12
Which of the following code blocks returns a DataFrame containing a column month, an integer representation of the day of the year from column openDate from DataFrame storesDF.
Note that column openDate is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970.
A sample of storesDF is displayed below:
Code block:
stored.withColumn(“openTimestamp”, col(“openDate”).cast(1))
.withColumn(2, 3(4))
- A: 1. “Data” 2. month 3. “month” 4. “openTimestamp”
- B: 1. “Timestamp” 2. month 3. “month” 4. col(“openTimestamp”)
- C: 1. “Timestamp” 2. month 3. getMonth 4. col(“openTimestamp”)
- D: 1. “Timestamp” 2. “month” 3. month 4. col(“openTimestamp”)
Question 13
Which of the following describes the difference between cluster and client execution modes?
- A: The cluster execution mode runs the driver on a worker node within a cluster, while the client execution mode runs the driver on the client machine (also known as a gateway machine or edge node).
- B: The cluster execution mode is run on a local cluster, while the client execution mode is run in the cloud.
- C: The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode runs a Spark job entirely on one client machine.
- D: The cluster execution mode runs the driver on the cluster machine (also known as a gateway machine or edge node), while the client execution mode runs the driver on a worker node within a cluster.
- E: The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode submits a Spark job from a remote machine to be run on a remote, unconfigurable cluster.
Question 14
The code block shown below contains an error. The code block intended to return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId. Identify the error.
Code block:
StoresDF.join(employeesDF, Seq("storeId")
- A: The key column storeId needs to be a string like “storeId”.
- B: The key column storeId needs to be specified in an expression of both Data Frame columns like storesDF.storeId ===employeesDF.storeId.
- C: The default argument to the joinType parameter is “inner” - an additional argument of “left” must be specified.
- D: There is no DataFrame.join() operation - DataFrame.merge() should be used instead.
- E: The key column storeId needs to be wrapped in the col() operation.
Question 15
Which of the following pairs of arguments cannot be used in DataFrame.join() to perform an inner join on two DataFrames, named and aliased with "a" and "b" respectively, to specify two key columns column1 and column2?
- A: joinExprs = col(“a.column1”) === col(“b.column1”) and col(“a.column2”) === col(“b.column2”)
- B: usingColumns = Seq(col(“column1”), col(“column2”))
- C: All of these options can be used to perform an inner join with two key columns.
- D: joinExprs = storesDF(“column1”) === employeesDF(“column1”) and storesDF(“column2”) === employeesDF (“column2”)
- E: usingColumns = Seq(“column1”, “column2”)
Question 16
The code block shown below contains an error. The code block is intended to return a new DataFrame that is the result of a position-wise union between DataFrame storesDF and DataFrame acquiredStoresDF.
- A: concat(storesDF, acquiredStoresDF)
- B: storesDF.unionByName(acquiredStoresDF)
- C: union(storesDF, acquiredStoresDF)
- D: unionAll(storesDF, acquiredStoresDF)
- E: storesDF.union(acquiredStoresDF)
Question 17
Which of the following code blocks writes DataFrame storesDF to file path filePath as parquet overwriting any existing files in that location?
- A: storesDF.write(filePath, mode = “overwrite”)
- B: storesDF.write().mode(“overwrite”).parquet(filePath)
- C: storesDF.write.mode(“overwrite”).parquet(filePath)
- D: storesDF.write.option(“parquet”, “overwrite”).path(filePath)
- E: storesDF.write.mode(“overwrite”).path(filePath)
Question 18
Which of the following code blocks reads a CSV at the file path filePath into a Data Frame with the specified schema schema?
- A: spark.read().csv(filePath)
- B: spark.read().schema(“schema”).csv(filePath)
- C: spark.read.schema(schema).csv(filePath)
- D: spark.read.schema(“schema”).csv(filePath)
- E: spark.read().schema(schema).csv(filePath)
Question 19
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 AND the value in column customerSatisfaction is greater than or equal to 30?
- A: storesDF.filter(col("sqft") <= 25000 and col("customerSatisfaction") >= 30)
- B: storesDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30)
- C: storesDF.filter(sqft) <= 25000 and customerSatisfaction >= 30)
- D: storesDF.filter(col("sqft") <= 25000 & col("customerSatisfaction") >= 30)
- E: storesDF.filter(sqft <= 25000) & customerSatisfaction >= 30)
Question 20
Which of the following sets of DataFrame methods will both return a new DataFrame only containing rows that meet a specified logical condition?
- A: drop(), where()
- B: filter(), select()
- C: filter(), where()
- D: select(), where()
- E: filter(), drop()
Question 21
The code block shown below should return a DataFrame containing all columns from DataFrame storesDF except for column sqft and column customerSatisfaction. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
1.2(3)
- A: 1. drop 2. storesDF 3. col(“sqft”), col(“customerSatisfaction”)
- B: 1. storesDF 2. drop 3. sqft, customerSatisfaction
- C: 1. storesDF 2. drop 3. “sqft”, “customerSatisfaction”
- D: 1. storesDF 2. drop 3. col(sqft), col(customerSatisfaction)
- E: 1. drop 2. storesDF 3. col(sqft), col(customerSatisfaction)
Question 22
Which of the following describes the difference between DataFrame.repartition(n) and DataFrame.coalesce(n)?
- A: DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions.
- B: While the results are similar, DataFrame.repartition(n) will be more efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column.
- C: DataFrame.repartition(n) will split a Data Frame into any number of new partitions while minimizing shuffling. DataFrame.coalesce(n) will split a DataFrame onto any number of new partitions utilizing a full shuffle.
- D: While the results are similar, DataFrame.repartition(n) will be less efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column.
- E: DataFrame.repartition(n) will combine the existing partitions of a DataFrame but may result in an uneven distribution of data across the new partitions. DataFrame.coalesce(n) will more slowly split a Data Frame into n number of new partitions with data distributed evenly.
Question 23
Which of the following cluster configurations is most likely to experience delays due to garbage collection of a large Dataframe?
Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores.
- A: More information is needed to determine an answer.
- B: Scenario #5
- C: Scenario #4
- D: Scenario #1
- E: Scenario #2
Question 24
Which of the following statements about Spark’s stability is incorrect?
- A: Spark is designed to support the loss of any set of worker nodes.
- B: Spark will rerun any failed tasks due to failed worker nodes.
- C: Spark will recompute data cached on failed worker nodes.
- D: Spark will spill data to disk if it does not fit in memory.
- E: Spark will reassign the driver to a worker node if the driver’s node fails.
Question 25
Which of the following DataFrame operations is classified as a transformation?
- A: DataFrame.select()
- B: DataFrame.count()
- C: DataFrame.show()
- D: DataFrame.first()
- E: DataFrame.collect()
Free preview mode
Enjoy the free questions and consider upgrading to gain full access!