AWS Certified Big Data - SpecialtyFree trialFree trial

By amazon
Aug, 2025

Verified

25Q per page

Question 1

A data engineer in a manufacturing company is designing a data processing platform that receives a large volume of unstructured data. The data engineer must populate a well-structured star schema in Amazon
Redshift.
What is the most efficient architecture strategy for this purpose?

  • A: Transform the unstructured data using Amazon EMR and generate CSV data. COPY the CSV data into the analysis schema within Redshift.
  • B: Load the unstructured data into Redshift, and use string parsing functions to extract structured data for inserting into the analysis schema.
  • C: When the data is saved to Amazon S3, use S3 Event Notifications and AWS Lambda to transform the file contents. Insert the data into the analysis schema on Redshift.
  • D: Normalize the data using an AWS Marketplace ETL tool, persist the results to Amazon S3, and use AWS Lambda to INSERT the data into Redshift.

Question 2

A company has several teams of analysts. Each team of analysts has their own cluster. The teams need to run
SQL queries using Hive, Spark-SQL, and Presto with Amazon EMR. The company needs to enable a centralized metadata layer to expose the Amazon S3 objects as tables to the analysts.
Which approach meets the requirement for a centralized metadata layer?

  • A: EMRFS consistent view with a common Amazon DynamoDB table
  • B: Bootstrap action to change the Hive Metastore to an Amazon RDS database
  • C: s3distcp with the outputManifest option to generate RDS DDL
  • D: Naming scheme support with automatic partition discovery from Amazon S3

Question 3

An administrator needs to manage a large catalog of items from various external sellers. The administrator needs to determine if the items should be identified as minimally dangerous, dangerous, or highly dangerous based on their textual descriptions. The administrator already has some items with the danger attribute, but receives hundreds of new item descriptions every day without such classification.
The administrator has a system that captures dangerous goods reports from customer support team of from user feedback.
What is a cost-effective architecture to solve this issue?

  • A: Build a set of regular expression rules that are based on the existing examples, and run them on the DynamoDB Streams as every new item description is added to the system.
  • B: Build a Kinesis Streams process that captures and marks the relevant items in the dangerous goods reports using a Lambda function once more than two reports have been filed.
  • C: Build a machine learning model to properly classify dangerous goods and run it on the DynamoDB Streams as every new item description is added to the system.
  • D: Build a machine learning model with binary classification for dangerous goods and run it on the DynamoDB Streams as every new item description is added to the system.

Question 4

A company receives data sets coming from external providers on Amazon S3. Data sets from different providers are dependent on one another. Data sets will arrive at different times and in no particular order.
A data architect needs to design a solution that enables the company to do the following:
✑ Rapidly perform cross data set analysis as soon as the data becomes available
✑ Manage dependencies between data sets that arrive at different times
Which architecture strategy offers a scalable and cost-effective solution that meets these requirements?

  • A: Maintain data dependency information in Amazon RDS for MySQL. Use an AWS Data Pipeline job to load an Amazon EMR Hive table based on task dependencies and event notification triggers in Amazon S3.
  • B: Maintain data dependency information in an Amazon DynamoDB table. Use Amazon SNS and event notifications to publish data to fleet of Amazon EC2 workers. Once the task dependencies have been resolved, process the data with Amazon EMR.
  • C: Maintain data dependency information in an Amazon ElastiCache Redis cluster. Use Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3 object to Redis. Once the task dependencies have been resolved, process the data with Amazon EMR.
  • D: Maintain data dependency information in an Amazon DynamoDB table. Use Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3 object to the task associated with it in DynamoDB. Once all task dependencies have been resolved, process the data with Amazon EMR.

Question 5

A media advertising company handles a large number of real-time messages sourced from over 200 websites in real time. Processing latency must be kept low. Based on calculations, a 60-shard Amazon Kinesis stream is more than sufficient to handle the maximum data throughput, even with traffic spikes. The company also uses an Amazon Kinesis Client Library (KCL) application running on Amazon Elastic Compute Cloud (EC2) managed by an Auto Scaling group. Amazon CloudWatch indicates an average of 25% CPU and a modest level of network traffic across all running servers.
The company reports a 150% to 200% increase in latency of processing messages from Amazon Kinesis during peak times. There are NO reports of delay from the sites publishing to Amazon Kinesis.
What is the appropriate solution to address the latency?

  • A: Increase the number of shards in the Amazon Kinesis stream to 80 for greater concurrency.
  • B: Increase the size of the Amazon EC2 instances to increase network throughput.
  • C: Increase the minimum number of instances in the Auto Scaling group.
  • D: Increase Amazon DynamoDB throughput on the checkpoint table.

Question 6

A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance.
Which action improves performance for the user teams in this situation?

  • A: Create custom table views.
  • B: Add interleaved sort keys per team.
  • C: Maintain team-specific copies of the table.
  • D: Add support for workload management queue hopping.

Question 7

A company operates an international business served from a single AWS region. The company wants to expand into a new country. The regulator for that country requires the Data Architect to maintain a log of financial transactions in the country within 24 hours of the product transaction. The production application is latency insensitive. The new country contains another AWS region.
What is the most cost-effective way to meet this requirement?

  • A: Use CloudFormation to replicate the production application to the new region.
  • B: Use Amazon CloudFront to serve application content locally in the country; Amazon CloudFront logs will satisfy the requirement.
  • C: Continue to serve customers from the existing region while using Amazon Kinesis to stream transaction data to the regulator.
  • D: Use Amazon S3 cross-region replication to copy and persist production transaction logs to a bucket in the new countrys region.

Question 8

An administrator needs to design the event log storage architecture for events from mobile devices. The event data will be processed by an Amazon EMR cluster daily for aggregated reporting and analytics before being archived.
How should the administrator recommend storing the log data?

  • A: Create an Amazon S3 bucket and write log data into folders by device. Execute the EMR job on the device folders.
  • B: Create an Amazon DynamoDB table partitioned on the device and sorted on date, write log data to table. Execute the EMR job on the Amazon DynamoDB table.
  • C: Create an Amazon S3 bucket and write data into folders by day. Execute the EMR job on the daily folder.
  • D: Create an Amazon DynamoDB table partitioned on EventID, write log data to table. Execute the EMR job on the table.

Question 9

A data engineer wants to use an Amazon Elastic Map Reduce for an application. The data engineer needs to make sure it complies with regulatory requirements. The auditor must be able to confirm at any point which servers are running and which network access controls are deployed.
Which action should the data engineer take to meet this requirement?

  • A: Provide the auditor IAM accounts with the SecurityAudit policy attached to their group.
  • B: Provide the auditor with SSH keys for access to the Amazon EMR cluster.
  • C: Provide the auditor with CloudFormation templates.
  • D: Provide the auditor with access to AWS DirectConnect to use their existing tools.

Question 10

A social media customer has data from different data sources including RDS running MySQL, Redshift, and
Hive on EMR. To support better analysis, the customer needs to be able to analyze data from different data sources and to combine the results.
What is the most cost-effective solution to meet these requirements?

  • A: Load all data from a different database/warehouse to S3. Use Redshift COPY command to copy data to Redshift for analysis.
  • B: Install Presto on the EMR cluster where Hive sits. Configure MySQL and PostgreSQL connector to select from different data sources in a single query.
  • C: Spin up an Elasticsearch cluster. Load data from all three data sources and use Kibana to analyze.
  • D: Write a program running on a separate EC2 instance to run queries to three different systems. Aggregate the results after getting the responses from all three systems.

Question 11

An Amazon EMR cluster using EMRFS has access to petabytes of data on Amazon S3, originating from multiple unique data sources. The customer needs to query common fields across some of the data sets to be able to perform interactive joins and then display results quickly.
Which technology is most appropriate to enable this capability?

  • A: Presto
  • B: MicroStrategy
  • C: Pig
  • D: R Studio

Question 12

A new algorithm has been written in Python to identify SPAM e-mails. The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage.
Which AWS service strategy is best for this use case?

  • A: Copy the data into Amazon ElastiCache to perform text analysis on the in-memory data and export the results of the model into Amazon Machine Learning.
  • B: Use Amazon EMR to parallelize the text analysis tasks across the cluster using a streaming program step.
  • C: Use Amazon Elasticsearch Service to store the text and then use the Python Elasticsearch Client to run analysis against the text index.
  • D: Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3 text files.

Question 13

A game company needs to properly scale its game application, which is backed by DynamoDB. Amazon
Redshift has the past two years of historical data. Game traffic varies throughout the year based on various factors such as season, movie release, and holiday season. An administrator needs to calculate how much read and write throughput should be provisioned for DynamoDB table for each week in advance.
How should the administrator accomplish this task?

  • A: Feed the data into Amazon Machine Learning and build a regression model.
  • B: Feed the data into Spark Mlib and build a random forest modest.
  • C: Feed the data into Apache Mahout and build a multi-classification model.
  • D: Feed the data into Amazon Machine Learning and build a binary classification model.

Question 14

A data engineer is about to perform a major upgrade to the DDL contained within an Amazon Redshift cluster to support a new data warehouse application. The upgrade scripts will include user permission updates, view and table structure changes as well as additional loading and data manipulation tasks.
The data engineer must be able to restore the database to its existing state in the event of issues.
Which action should be taken prior to performing this upgrade task?

  • A: Run an UNLOAD command for all data in the warehouse and save it to S3.
  • B: Create a manual snapshot of the Amazon Redshift cluster.
  • C: Make a copy of the automated snapshot on the Amazon Redshift cluster.
  • D: Call the waitForSnapshotAvailable command from either the AWS CLI or an AWS SDK.

Question 15

A large oil and gas company needs to provide near real-time alerts when peak thresholds are exceeded in its pipeline system. The company has developed a system to capture pipeline metrics such as flow rate, pressure, and temperature using millions of sensors. The sensors deliver to AWS IoT.
What is a cost-effective way to provide near real-time alerts on the pipeline metrics?

  • A: Create an AWS IoT rule to generate an Amazon SNS notification.
  • B: Store the data points in an Amazon DynamoDB table and poll if for peak metrics data from an Amazon EC2 application.
  • C: Create an Amazon Machine Learning model and invoke it with AWS Lambda.
  • D: Use Amazon Kinesis Streams and a KCL-based application deployed on AWS Elastic Beanstalk.

Question 16

A company is using Amazon Machine Learning as part of a medical software application. The application will predict the most likely blood type for a patient based on a variety of other clinical tests that are available when blood type knowledge is unavailable.
What is the appropriate model choice and target attribute combination for this problem?

  • A: Multi-class classification model with a categorical target attribute.
  • B: Regression model with a numeric target attribute.
  • C: Binary Classification with a categorical target attribute.
  • D: K-Nearest Neighbors model with a multi-class target attribute.

Question 17

A data engineer is running a DWH on a 25-node Redshift cluster of a SaaS service. The data engineer needs to build a dashboard that will be used by customers. Five big customers represent 80% of usage, and there is a long tail of dozens of smaller customers. The data engineer has selected the dashboarding tool.
How should the data engineer make sure that the larger customer workloads do NOT interfere with the smaller customer workloads?

  • A: Apply query filters based on customer-id that can NOT be changed by the user and apply distribution keys on customer-id.
  • B: Place the largest customers into a single user group with a dedicated query queue and place the rest of the customers into a different query queue.
  • C: Push aggregations into an RDS for Aurora instance. Connect the dashboard application to Aurora rather than Redshift for faster queries.
  • D: Route the largest customers to a dedicated Redshift cluster. Raise the concurrency of the multi-tenant Redshift cluster to accommodate the remaining customers.

That’s the end of your free questions

You’ve reached the preview limit for AWS Certified Big Data - Specialty

Consider upgrading to gain full access!

Page 1 of 4 • Questions 1-25 of 85

Free preview mode

Enjoy the free questions and consider upgrading to gain full access!