Free preview mode

Enjoy the free questions and consider upgrading to gain full access!

Professional Data EngineerFree trialFree trial

By google
Aug, 2025

Verified

25Q per page

Question 26

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally.
Because large parts of globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your
Kafka cluster. This is becoming difficult to manage and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?

  • A: Edge TPUs as sensor devices for storing and transmitting the messages.
  • B: Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.
  • C: An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.
  • D: A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.

Question 27

You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time. Which two methods can accomplish this?
(Choose two.)

  • A: Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.
  • B: Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.
  • C: Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.
  • D: Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.
  • E: Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.

Question 28

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis.
Every hour, thousands of transactions are updated with a new status. The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)

  • A: Denormalize the data as must as possible.
  • B: Preserve the structure of the data as much as possible.
  • C: Use BigQuery UPDATE to further reduce the size of the dataset.
  • D: Develop a data pipeline where status updates are appended to BigQuery instead of updated.
  • E: Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

Question 29

You are designing a cloud-native historical data processing system to meet the following conditions:
✑ The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Dataproc, BigQuery, and Compute
Engine.
✑ A batch pipeline moves daily data.
✑ Performance is not a factor in the solution.
✑ The solution design should maximize availability.
How should you design data storage for this solution?

  • A: Create a Dataproc cluster with high availability. Store the data in HDFS, and perform analysis as needed.
  • B: Store the data in BigQuery. Access the data using the BigQuery Connector on Dataproc and Compute Engine.
  • C: Store the data in a regional Cloud Storage bucket. Access the bucket directly using Dataproc, BigQuery, and Compute Engine.
  • D: Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Dataproc, BigQuery, and Compute Engine.

Question 30

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?

  • A: Store and process the entire dataset in BigQuery.
  • B: Store and process the entire dataset in Bigtable.
  • C: Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.
  • D: Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.

Question 31

You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?

  • A: Use Cloud Vision AutoML with the existing dataset.
  • B: Use Cloud Vision AutoML, but reduce your dataset twice.
  • C: Use Cloud Vision API by providing custom labels as recognition hints.
  • D: Train your own image recognition model leveraging transfer learning techniques.

Question 32

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do?

  • A: Use Cloud TPUs without any additional adjustment to your code.
  • B: Use Cloud TPUs after implementing GPU kernel support for your customs ops.
  • C: Use Cloud GPUs after implementing GPU kernel support for your customs ops.
  • D: Stay on CPUs, and increase the size of the cluster you're training your model on.

Question 33

You work on a regression problem in a natural language processing domain, and you have 100M labeled examples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

  • A: Increase the share of the test sample in the train-test split.
  • B: Try to collect more data and increase the size of your dataset.
  • C: Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
  • D: Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

Question 34

You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

  • A: Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.
  • B: Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
  • C: Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
  • D: Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

Question 35

You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling.
Which Google database service should you use?

  • A: Cloud SQL
  • B: BigQuery
  • C: Cloud Bigtable
  • D: Cloud Datastore

Question 36

The marketing team at your organization provides regular updates of a segment of your customer dataset. The marketing team has given you a CSV with 1 million records that must be updated in BigQuery. When you use the UPDATE statement in BigQuery, you receive a quotaExceeded error. What should you do?

  • A: Reduce the number of records updated each day to stay within the BigQuery UPDATE DML statement limit.
  • B: Increase the BigQuery UPDATE DML statement limit in the Quota management section of the Google Cloud Platform Console.
  • C: Split the source CSV file into smaller CSV files in Cloud Storage to reduce the number of BigQuery UPDATE DML statements per BigQuery job.
  • D: Import the new records from the CSV file into a new BigQuery table. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table.

Question 37

As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects.
Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies. Which two steps should you take? (Choose two.)

  • A: Use Cloud Deployment Manager to automate access provision.
  • B: Introduce resource hierarchy to leverage access control policy inheritance.
  • C: Create distinct groups for various teams, and specify groups in Cloud IAM policies.
  • D: Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
  • E: For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.

Question 38

Your United States-based company has created an application for assessing and responding to user actions. The primary table's data volume grows by 250,000 records per second. Many third parties use your application's APIs to build the functionality into their own frontend applications. Your application's APIs should comply with the following requirements:
✑ Single global endpoint
✑ ANSI SQL support
✑ Consistent access to the most up-to-date data
What should you do?

  • A: Implement BigQuery with no region selected for storage or processing.
  • B: Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.
  • C: Implement Cloud SQL for PostgreSQL with the master in North America and read replicas in Asia and Europe.
  • D: Implement Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.

Question 39

A data scientist has created a BigQuery ML model and asks you to create an ML pipeline to serve predictions. You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL 'dataset.model', table user_features). How should you create the ML pipeline?

  • A: Add a WHERE clause to the query, and grant the BigQuery Data Viewer role to the application service account.
  • B: Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account.
  • C: Create a Dataflow pipeline using BigQueryIO to read results from the query. Grant the Dataflow Worker role to the application service account.
  • D: Create a Dataflow pipeline using BigQueryIO to read predictions for all users from the query. Write the results to Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Bigtable.

Question 40

You are building an application to share financial market data with consumers, who will receive data feeds. Data is collected from the markets in real time.
Consumers will receive the data in the following ways:
✑ Real-time event stream
✑ ANSI SQL access to real-time stream and historical data
✑ Batch historical exports
Which solution should you use?

  • A: Cloud Dataflow, Cloud SQL, Cloud Spanner
  • B: Cloud Pub/Sub, Cloud Storage, BigQuery
  • C: Cloud Dataproc, Cloud Dataflow, BigQuery
  • D: Cloud Pub/Sub, Cloud Dataproc, Cloud SQL

Question 41

You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
✑ Decoupling producer from consumer
✑ Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely
✑ Near real-time SQL query
✑ Maintain at least 2 years of historical data, which will be queried with SQL
Which pipeline should you use to meet these requirements?

  • A: Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files.
  • B: Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery.
  • C: Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.
  • D: Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

Question 42

You are running a pipeline in Dataflow that receives messages from a Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization. Which two actions can you take to increase performance of your pipeline? (Choose two.)

  • A: Increase the number of max workers
  • B: Use a larger instance type for your Dataflow workers
  • C: Change the zone of your Dataflow pipeline to run in us-central1
  • D: Create a temporary table in Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Bigtable to BigQuery
  • E: Create a temporary table in Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery

Question 43

You have a data pipeline with a Dataflow job that aggregates and writes time series metrics to Bigtable. You notice that data is slow to update in Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. Which two actions should you take? (Choose two.)

  • A: Configure your Dataflow pipeline to use local execution
  • B: Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions
  • C: Increase the number of nodes in the Bigtable cluster
  • D: Modify your Dataflow pipeline to use the Flatten transform before writing to Bigtable
  • E: Modify your Dataflow pipeline to use the CoGroupByKey transform before writing to Bigtable

Question 44

You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?

  • A: Create a Cloud Dataproc Workflow Template
  • B: Create an initialization action to execute the jobs
  • C: Create a Directed Acyclic Graph in Cloud Composer
  • D: Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster

Question 45

You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones. What should you do?

  • A: Create an API using App Engine to receive and send messages to the applications
  • B: Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them
  • C: Create a table on Cloud SQL, and insert and delete rows with the job information
  • D: Create a table on Cloud Spanner, and insert and delete rows with the job information

Question 46

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

  • A: There are very few occurrences of mutations relative to normal samples.
  • B: There are roughly equal occurrences of both normal and mutated samples in the database.
  • C: You expect future mutations to have different features from the mutated samples in the database.
  • D: You expect future mutations to have similar features to the mutated samples in the database.
  • E: You already have labels for which samples are mutated and which are normal in the database.

Question 47

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

  • A: The current epoch time
  • B: A concatenation of the product name and the current epoch time
  • C: A random universally unique identifier number (version 4 UUID)
  • D: The original order identification number from the sales system, which is a monotonically increasing integer

Question 48

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects. What should you do?

  • A: Enable data access logs in each Data Analyst's project. Restrict access to Stackdriver Logging via Cloud IAM roles.
  • B: Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts' projects. Restrict access to the Cloud Storage bucket.
  • C: Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs.
  • D: Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.

Question 49

Each analytics team in your organization is running BigQuery jobs in their own projects. You want to enable each team to monitor slot usage within their projects.
What should you do?

  • A: Create a Cloud Monitoring dashboard based on the BigQuery metric query/scanned_bytes
  • B: Create a Cloud Monitoring dashboard based on the BigQuery metric slots/allocated_for_project
  • C: Create a log export for each project, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Cloud Monitoring dashboard based on the custom metric
  • D: Create an aggregated log export at the organization level, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Cloud Monitoring dashboard based on the custom metric

Question 50

You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?

  • A: Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing job name
  • B: Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to a new unique job name
  • C: Stop the Cloud Dataflow pipeline with the Cancel option. Create a new Cloud Dataflow job with the updated code
  • D: Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code
Page 2 of 13 • Questions 26-50 of 319

Free preview mode

Enjoy the free questions and consider upgrading to gain full access!