You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?
AContinuously retrain the model on just the new data.
BContinuously retrain the model on a combination of existing data and the new data.
CTrain on the existing data while using the new data as your test set.
DTrain on the new data while using the existing data as your test set.
Your infrastructure includes a set of YouTube channels. You have been tasked with creating a process for sending the YouTube channel data to Google Cloud for analysis. You want to design a solution that allows your world-wide marketing teams to perform ANSI SQL and other types of analysis on up-to-date YouTube channels log data. How should you set up the log data transfer into Google Cloud?
AUse Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
BUse Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Regional bucket as a final destination.
CUse BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
DUse BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Regional storage bucket as a final destination.
You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL.
What should you do?
AUse Cloud Dataflow with Beam to detect errors and perform transformations.
BUse Cloud Dataprep with recipes to detect errors and perform transformations.
CUse Cloud Dataproc with a Hadoop job to detect errors and perform transformations.
DUse federated tables in BigQuery with queries to detect errors and perform transformations.
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible.
What should you do?
ALoad the data every 30 minutes into a new partitioned table in BigQuery.
BStore and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
CStore the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
DStore the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
You want to process payment transactions in a point-of-sale application that will run on Google Cloud Platform. Your user base could grow exponentially, but you do not want to manage infrastructure scaling.
Which Google database service should you use?
ACloud SQL
BBigQuery
CCloud Bigtable
DCloud Datastore
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)
AThere are very few occurrences of mutations relative to normal samples.
BThere are roughly equal occurrences of both normal and mutated samples in the database.
CYou expect future mutations to have different features from the mutated samples in the database.
DYou expect future mutations to have similar features to the mutated samples in the database.
EYou already have labels for which samples are mutated and which are normal in the database.
You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom
HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?
AThe message body for the sensor event is too large.
BYour custom endpoint has an out-of-date SSL certificate.
CThe Cloud Pub/Sub topic has too many messages published to it.
DYour custom endpoint is not acknowledging messages within the acknowledgement deadline.
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
ADelete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
BAdd a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.
CCreate a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
DAdd two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
EConstruct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?
AMake a call to the Stackdriver API to list all logs, and apply an advanced filter.
BIn the Stackdriver logging admin interface, and enable a log sink export to BigQuery.
CIn the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.
DUsing the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.
You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users' privacy?
AGrant the consultant the Viewer role on the project.
BGrant the consultant the Cloud Dataflow Developer role on the project.
CCreate a service account and allow the consultant to log on with it.
DCreate an anonymized sample of the data for the consultant to work with in a different project.
You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)
AGet more training examples
BReduce the number of training examples
CUse a smaller set of features
DUse a larger set of features
EIncrease the regularization parameters
FDecrease the regularization parameters
You are building a data pipeline on Google Cloud. You need to prepare data using a casual method for a machine-learning process. You want to support a logistic regression model. You also need to monitor and adjust for null values, which must remain real-valued and cannot be removed. What should you do?
AUse Cloud Dataprep to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataproc job.
BUse Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.
CUse Cloud Dataflow to find null values in sample source data. Convert all nulls to 'none' using a Cloud Dataprep job.
DUse Cloud Dataflow to find null values in sample source data. Convert all nulls to 0 using a custom script.
You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?
AUse Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
BUse Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
CUse Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
DUse Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?
AIncrease the size of your parquet files to ensure them to be 1 GB minimum.
BSwitch to TFRecords formats (appr. 200MB per file) instead of parquet files.
CSwitch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
DSwitch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a
Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?
ARun a local version of Jupiter on the laptop.
BGrant the user access to Google Cloud Shell.
CHost a visualization tool on a VM on Google Compute Engine.
DDeploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
ACloud Dataflow
BCloud Composer
CCloud Dataprep
DCloud Dataproc
You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?
AUse Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
BUse Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
CUse Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
DUse Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.
You are designing storage for very large text files for a data pipeline on Google Cloud. You want to support ANSI SQL queries. You also want to support compression and parallel load from the input locations using Google recommended practices. What should you do?
ATransform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and query.
BTransform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery permanent linked tables for query.
CCompress text files to gzip using the Grid Computing Tools. Use BigQuery for storage and query.
DCompress text files to gzip using the Grid Computing Tools. Use Cloud Storage, and then import into Cloud Bigtable for query.
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?
AAdd capacity (memory and disk space) to the database server by the order of 200.
BShard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
CNormalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
DPartition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values
(CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?
AUse federated data sources, and check data in the SQL query.
BEnable BigQuery monitoring in Google Stackdriver and create an alert.
CImport the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
DRun a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?
AIssue a command to restart the database servers.
BRetry the query with exponential backoff, up to a cap of 15 minutes.
CRetry the query every second until it comes back online to minimize staleness of data.
DReduce the query frequency to once every hour until the database comes back online.
You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine. Which learning algorithm should you use?
ALinear regression
BLogistic classification
CRecurrent neural network
DFeedforward neural network
You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?
AInclude ORDER BY DESK on timestamp column and LIMIT to 1.
BUse GROUP BY on the unique ID column and timestamp column and SUM on the values.
CUse the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
DUse the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery. Which three approaches can you take? (Choose three.)
ADisable writes to certain tables.
BRestrict access to tables by role.
CEnsure that the data is encrypted at all times.
DRestrict BigQuery API access to approved users.
ESegregate data across multiple tables or databases.
FUse Google Stackdriver Audit Logging to determine policy violations.
Your company handles data processing for a number of different clients. Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other's data. You want to ensure appropriate access to the data.
Which three steps should you take? (Choose three.)
ALoad data into different partitions.
BLoad data into a different dataset for each client.
CPut each client's BigQuery dataset into a different table.
DRestrict a client's dataset to approved users.
EOnly allow a service account to access the datasets.
FUse the appropriate identity and access management (IAM) roles for each client's users.