Which of the following operations can be used to return a DataFrame with no duplicate rows? Please select the most complete answer.
ADataFrame.distinct()
BDataFrame.dropDuplicates() and DataFrame.distinct()
CDataFrame.dropDuplicates()
DDataFrame.drop_duplicates()
EDataFrame.dropDuplicates(), DataFrame.distinct() and DataFrame.drop_duplicates()
Which of the following Spark properties is used to configure the maximum size of an automatically broadcasted DataFrame when performing a join?
Aspark.sql.broadcastTimeout
Bspark.sql.autoBroadcastJoinThreshold
Cspark.sql.shuffle.partitions
Dspark.sql.inMemoryColumnarStorage.batchSize
Espark.sql.adaptive.skewedJoin.enabled
The code block shown below contains an error. The code block intended to return a DataFrame containing a column dayOfYear, an integer representation of the day of the year from column openDate from DataFrame storesDF. Identify the error.
Note that column openDate is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970.
AThe dayofyear() operation cannot extract the day of year from a column of type integer – column openDate must first be converted to type Timestamp.
BThe dayofyear() operation takes a quoted column name rather than a Column object as its first argument – the first argument should be "openDate".
CThe dayofyear() operation cannot extract the day of year from a column of type integer – column openDate must first be converted to type Date.
DThe dayofyear() operation is not applicable in a withColumn() call – the newColumn() operation must be used instead.
EThere is no dayofyear() operation – the day of year number must be extracted using substring utilities.
Which of the following statements about slots is incorrect?
ASlots are the most granular level of execution in the Spark execution hierarchy.
BSlots are resources for parallelization within an executor.
CTasks are assigned to slots for computation.
DThere can be more slots than tasks.
EThere must be at least as many slots as there are executors.
Which of the following storage levels should be used to store as much data as possible in memory on two cluster nodes while storing any data that does not fit in memory on disk to be read in when needed?
AMEMORY_ONLY_2
BMEMORY_AND_DISK_SER
CMEMORY_AND_DISK
DMEMORY_AND_DISK_2
EMEMORY_ONLY
In what order should the below lines of code be run in order to write DataFrame storesDF to file path filePath as parquet and partition by values in column division?
Lines of code:
.write() \
.partitionBy("division") \
.parquet(filePath)
.storesDF \
.repartition("division")
.write \
.path(filePath, "parquet")
A4, 1, 2, 3
B4, 1, 5, 7
C4, 6, 2, 3
D4, 1, 5, 3
E4, 6, 2, 7
Which of the following operations can be used to return the number of rows in a DataFrame?
ADataFrame.numberOfRows()
BDataFrame.n()
CDataFrame.sum()
DDataFrame.count()
EDataFrame.countDistinct()
Which of the following operations returns a GroupedData object?
ADataFrame.GroupBy()
BDataFrame.cubed()
CDataFrame.group()
DDataFrame.groupBy()
EDataFrame.grouping_id()
Which of the following code blocks fails to return a new DataFrame that is the result of an inner join between DataFrame storesDF and DataFrame employeesDF on column storeId and column employeeId?
EstoresDF.alias("s").join(employeesDF.alias("e"), col("s.storeId") === col("e.storeId") and col("s.employeeId") === col("e.employeeId"))
Which of the following cluster configurations is most likely to experience delays due to garbage collection of a large Dataframe?
Note: each configuration has roughly the same compute power using 100GB of RAM and 200 cores.
AMore information is needed to determine an answer.
BScenario #5
CScenario #4
DScenario #1
EScenario #2
The code block shown below should cache DataFrame storesDF only in Spark's memory. Choose the response that correctly fil ls in the numbered blanks within the code block to complete this task.
Code block:
1.2(3).count()
A
storesDF2. cache3. StorageLevel.MEMORY_ONLY
B
storesDF2. storageLevel3. cache
C
storesDF2. cache3. Nothing
D
storesDF2. persist3. Nothing
E
storesDF2. persist3. StorageLevel.MEMORY_ONLY
Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30?
AstoresDF.filter(col("sqft") <= 25000 and col("customerSatisfaction") >= 30)
EstoresDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30)
The code block shown below contains an error. The code block is intended to return a new DataFrame from DataFrame storesDF where column storeId is of the type string. Identify the error.
ACalls to withColumn() cannot create a new column of the same name on which it is operating.
BDataFrame columns cannot be converted to a new type inside of a call to withColumn().
CThe call to StringType should not be followed by parentheses.
DThe column name storeId inside the col() operation should not be quoted.
EThe cast() operation is a method in the Column class rather than a standalone function.
The code block shown below should return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
The code block shown below should read a CSV at the file path filePath into a DataFrame with the specified schema schema. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
The code block shown below should return a collection of summary statistics for column sqft in DataFrame storesDF. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF.1(2)
A
summary2. col("sqft")
B
describe2. col("sqft")
C
summary2. "sqft"
D
describe2. "sqft"
E
summary2. "all"
The code block shown below should return a new DataFrame where rows in DataFrame storesDF with missing values in every column have been dropped. Choose the response that correctly fills in the numbered blanks within the code block to complete this task.
Code block:
storesDF.1.2__(3__ = 4)
A
na2.drop3. how4."any"
B
na2.drop3. subset4."all"
C
na2.drop3.subset4. "any"
D
na2.drop3. how4. "all"
E
drop2. na3. how4. "all"
Which of the following code blocks returns a new DataFrame with column storeReview where the pattern "End" has been removed from the end of column storeReview in DataFrame storesDF?
The code block shown below contains an error. The code block is intended to create a single-column DataFrame from Python list years which is made up of integers. Identify the error.
Code block:
spark.createDataFrame(years, IntegerType)
AThe column name must be specified.
BThe years list should be wrapped in another list like [years] to make clear that it is a column rather than a row.
CThere is no createDataFrame operation in spark.
DThe IntegerType call must be followed by parentheses.
EThe IntegerType call should not be present — Spark can tell that list years is full of integers.
Which of the following operations will fail to trigger evaluation?
ADataFrame.collect()
BDataFrame.count()
CDataFrame.first()
DDataFrame.join()
EDataFrame.take()
Which of the following code blocks returns a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000?
A sample of DataFrame storesDF is below:
AstoresDF.na.fill(30000, Seq("sqft"))
BstoresDF.nafill(30000, col("sqft"))
CstoresDF.na.fill(30000, col("sqft"))
DstoresDF.fillna(30000, col("sqft"))
EstoresDF.na.fill(30000, "sqft")
Which of the following statements about the Spark DataFrame is true?
ASpark DataFrames are mutable unless they've been collected to the driver.
BA Spark DataFrame is rarely used aside from the import and export of data.
CSpark DataFrames cannot be distributed into partitions.
DA Spark DataFrame is a tabular data structure that is the most common Structured API in Spark.
EA Spark DataFrame is exactly the same as a data frame in Python or R.
Which of the following code blocks returns the number of rows in DataFrame storesDF for each unique value in column division?
AstoresDF.groupBy("division").agg(count())
BstoresDF.agg(groupBy("division").count())
CstoresDF.groupby.count("division")
DstoresDF.groupBy().count("division")
EstoresDF.groupBy("division").count()
Which of the following code blocks will always return a new 4-partition DataFrame from the 8-partition DataFrame storesDF without inducing a shuffle?