Professional Data Engineer
Free trial
Verified
Question 1
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?
- A: Threading
- B: Serialization
- C: Dropout Methods
- D: Dimensionality Reduction
Question 2
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)
- A: Disable writes to certain tables.
- B: Restrict access to tables by role.
- C: Ensure that the data is encrypted at all times.
- D: Restrict BigQuery API access to approved users.
- E: Segregate data across multiple tables or databases.
- F: Use Google Stackdriver Audit Logging to determine policy violations.
Question 3
You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends. What should you do?
- A: Use bq load to load a batch of sensor data every 60 seconds.
- B: Use a Cloud Dataflow pipeline to stream data into the BigQuery table.
- C: Use the INSERT statement to insert a batch of data every 60 seconds.
- D: Use the MERGE statement to apply updates in batch every 60 seconds.
Question 4
You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?
- A: Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
- B: Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
- C: Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
- D: Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
Question 5
You need to create a near real-time inventory dashboard that reads the main inventory tables in your BigQuery data warehouse. Historical inventory data is stored as inventory balances by item and location. You have several thousand updates to inventory every hour. You want to maximize performance of the dashboard and ensure that the data is accurate. What should you do?
- A: Leverage BigQuery UPDATE statements to update the inventory balances as they are changing.
- B: Partition the inventory balance table by item to reduce the amount of data scanned with each inventory update.
- C: Use the BigQuery streaming the stream changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.
- D: Use the BigQuery bulk loader to batch load inventory changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.
Question 6
You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You need to define a storage, backup, and recovery strategy of this data that minimizes cost. How should you configure the BigQuery table that have a recovery point objective (RPO) of 30 days?
- A: Set the BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
- B: Set the BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
- C: Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
- D: Set the BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
Question 7
You used Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?
- A: Create a cron schedule in Dataprep.
- B: Create an App Engine cron job to schedule the execution of the Dataprep job.
- C: Export the recipe as a Dataprep template, and create a job in Cloud Scheduler.
- D: Export the Dataprep job as a Dataflow template, and incorporate it into a Composer job.
Question 8
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Dataproc and Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?
- A: cron
- B: Cloud Composer
- C: Cloud Scheduler
- D: Workflow Templates on Dataproc
Question 9
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters. What should you do?
- A: Increase the cluster size with more non-preemptible workers.
- B: Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
- C: Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
- D: Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
Question 10
You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit tracking numbers when events are sent to Kafka topics. A recent software update caused the scanners to accidentally transmit recipients' personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems. What should you do?
- A: Create an authorized view in BigQuery to restrict access to tables with sensitive data.
- B: Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
- C: Use Cloud Logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
- D: Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention (Cloud DLP) API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.
Question 11
You have developed three data processing jobs. One executes a Cloud Dataflow pipeline that transforms data uploaded to Cloud Storage and writes results to
BigQuery. The second ingests data from on-premises servers and uploads it to Cloud Storage. The third is a Cloud Dataflow pipeline that gets information from third-party data providers and uploads the information to Cloud Storage. You need to be able to schedule and monitor the execution of these three workflows and manually execute them when needed. What should you do?
- A: Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.
- B: Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.
- C: Develop an App Engine application to schedule and request the status of the jobs using GCP API calls.
- D: Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using GCP API calls.
Question 12
You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Cloud Logging. What are the two most likely causes of this problem? (Choose two.)
- A: Publisher throughput quota is too small.
- B: Total outstanding messages exceed the 10-MB maximum.
- C: Error handling in the subscriber code is not handling run-time errors properly.
- D: The subscriber code cannot keep up with the messages.
- E: The subscriber code does not acknowledge the messages that it pulls.
Question 13
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
✑ No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket
✑ Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
- A: Use a fixed-time window with a duration of 60 minutes.
- B: Use a sliding time window with a duration of 60 minutes.
- C: Use a session window with a gap time duration of 60 minutes.
- D: Use a global window with a time based trigger with a delay of 60 minutes.
Question 14
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?
- A: Add a SideInput that returns a Boolean if the element is corrupt.
- B: Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
- C: Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
- D: Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
Question 15
You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the
Data Science team runs a query filtered on a date column and limited to 30`"90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries.
What should you do?
- A: Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
- B: Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
- C: Modify your pipeline to maintain the last 30ג€"90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
- D: Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.
Question 16
You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?
- A: Deploy small Kafka clusters in your data centers to buffer events.
- B: Have the data acquisition devices publish data to Cloud Pub/Sub.
- C: Establish a Cloud Interconnect between all remote data centers and Google.
- D: Write a Cloud Dataflow pipeline that aggregates all data in session windows.
Question 17
You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems. Which solutions should you choose?
- A: Speech-to-Text API
- B: Cloud Natural Language API
- C: Dialogflow Enterprise Edition
- D: AutoML Natural Language
Question 18
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
- A: Cloud Dataflow
- B: Cloud Composer
- C: Cloud Dataprep
- D: Cloud Dataproc
Question 19
You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?
- A: Use Analytics Hub to control data access, and provide third party companies with access to the dataset.
- B: Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.
- C: Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.
- D: Create a Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.
Question 20
Your company is in the process of migrating its on-premises data warehousing solutions to BigQuery. The existing data warehouse uses trigger-based change data capture (CDC) to apply updates from multiple transactional database sources on a daily basis. With BigQuery, your company hopes to improve its handling of
CDC so that changes to the source systems are available to query in BigQuery in near-real time using log-based CDC streams, while also optimizing for the performance of applying changes to the data warehouse. Which two steps should they take to ensure that changes are available in the BigQuery reporting table with minimal latency while reducing compute overhead? (Choose two.)
- A: Perform a DML INSERT, UPDATE, or DELETE to replicate each individual CDC record in real time directly on the reporting table.
- B: Insert each new CDC record and corresponding operation type to a staging table in real time.
- C: Periodically DELETE outdated records from the reporting table.
- D: Periodically use a DML MERGE to perform several DML INSERT, UPDATE, and DELETE operations at the same time on the reporting table.
- E: Insert each new CDC record and corresponding operation type in real time to the reporting table, and use a materialized view to expose only the newest version of each unique record.
Question 21
You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?
- A: Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
- B: Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
- C: Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
- D: Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.
Question 22
You need to set access to BigQuery for different departments within your company. Your solution should comply with the following requirements:
✑ Each department should have access only to their data.
✑ Each department will have one or more leads who need to be able to create and update tables and provide them to their team.
✑ Each department has data analysts who need to be able to query but not modify data.
How should you set access to the data in BigQuery?
- A: Create a dataset for each department. Assign the department leads the role of OWNER, and assign the data analysts the role of WRITER on their dataset.
- B: Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.
- C: Create a table for each department. Assign the department leads the role of Owner, and assign the data analysts the role of Editor on the project the table is in.
- D: Create a table for each department. Assign the department leads the role of Editor, and assign the data analysts the role of Viewer on the project the table is in.
Question 23
You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added. What should you do to improve the performance of your application?
- A: Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.
- B: Change the row key syntax in your Cloud Bigtable table to begin with a random number per second.
- C: Change the data pipeline to use BigQuery for storing stock trades, and update your application.
- D: Use Cloud Dataflow to write a summary of each day's stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses.
Question 24
Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's data. You want to ensure appropriate access to the data.
Which three steps should you take? (Choose three.)
- A: Load data into different partitions.
- B: Load data into a different dataset for each client.
- C: Put each client's BigQuery dataset into a different table.
- D: Restrict a client's dataset to approved users.
- E: Only allow a service account to access the datasets.
- F: Use the appropriate identity and access management (IAM) roles for each client's users.
Question 25
You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud
Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?
- A: An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/ used_bytes for the destination
- B: An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination
- C: An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/ num_undelivered_messages for the destination
- D: An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/ num_undelivered_messages for the destination
Free preview mode
Enjoy the free questions and consider upgrading to gain full access!