๐ŸŽ„

CertoMetrics - 9% OFF Special Discount Offer - Ends In:

0d 00h 00m 00s
Coupon code: SALE2026

Amazon AWS Certified Data Engineer - Associate (DEA-C01)

Get full access to the updated question bank and confidently prepare for your exam.

Vendor

Amazon

Certification

Associate Certifications

Content

351 Qs

Status

Verified

Updated

1 day ago

Test the Practice Engine

Experience our interactive testing environment with free demo questions

Launch Free Demo
Best Value Bundle

Premium Bundle

Complete Success Suite

$103 $59

Save $44 Instantly

  • โœ“
    Full PDF + Interactive Engine Everything you need to pass
  • โœ“
    All Advanced Question Types Drag & Drop, Hotspots, Case Studies
  • โœ“
    Priority 24/7 Expert Support Direct line to certification leads
  • โœ“
    90 Days Free Priority Updates Stay current as exams change

Success Metric

98.4% Pass Rate

Verified by 15k+ Students
Secure Checkout
Popular

Standard Simulation

Practice Engine

$54

One-Time Payment

  • Web-Based (Zero Install)
  • Real Testing Environment Virtual & Practice Modes
  • Interactive Engine Drag & Drop, Hotspots
  • 60 Days Free Updates

Compatible with All Devices

Chrome
Verified Secure Checkout

Basic Tier

PDF Study Guide

$49

Digital Access

  • โœ“ Exam Questions (PDF)
  • โœ“ Mobile Friendly
  • โœ“ 60 Days Updates
Download Free Sample PDF

Verified 71-Question Preview (DEA-C01)

Secure Checkout

Verified Community

The CertoMetrics Standard.

Recommend the #1 platform for verified Amazon certification resources.

Success Network

Help a Colleague Succeed.

Invite a peer to get their own updated DEA-C01 prep kit.

Exam Overview

The AWS Certified Data Engineer - Associate certification is a pivotal credential for professionals aiming to validate their expertise in designing, building, managing, and monitoring robust data pipelines on the Amazon Web Services (AWS) platform. This certification demonstrates a deep understanding of core AWS data services, enabling candidates to effectively ingest, transform, store, and process data for analytical workloads. Achieving this certification not only signifies your technical proficiency in modern data engineering practices but also significantly enhances your marketability, opening doors to advanced career opportunities and positioning you as a valuable asset in organizations leveraging cloud-native data solutions. It's a testament to your ability to drive data-driven insights and innovation.

Questions

65

Passing Score

700/1000

Duration

130 Minutes

Difficulty

Intermediate

Level

Associate

Skills Measured

Data Ingestion and Transformation: Designing and implementing solutions for ingesting data from various sources into AWS, including streaming and batch data, and performing necessary transformations for analysis.
Data Storage and Management: Selecting and managing appropriate AWS data storage services (e.g., S3, RDS, DynamoDB, Redshift) based on data characteristics, access patterns, and compliance requirements.
Data Processing and Analysis: Utilizing AWS services like Glue, EMR, Kinesis, and Athena to process, query, and analyze large datasets efficiently and at scale.
Data Monitoring and Orchestration: Implementing solutions for monitoring data pipelines, ensuring data quality, and orchestrating complex workflows using services like AWS Step Functions and Apache Airflow on AWS.
Data Security and Governance: Applying best practices for data security, encryption, access control, and compliance within AWS data environments, including IAM, KMS, and Lake Formation.

Career Path

Target Roles

Data Engineer ETL Developer Data Architect

Common Questions

Is the material up to date?

Yes. We update our question bank weekly to match the latest Amazon standards. You get free updates for 90 days.

What format do I get?

You get instant access to both the **PDF** (for reading) and our **Premium Test Engine** (for exam simulation).

Is there a guarantee?

Absolutely. If you fail the DEA-C01 exam using our materials, we offer a full money-back guarantee.

When do I get the download?

Instantly. The download link is available in your dashboard immediately after payment is confirmed.

Free Study Guide Samples

Previewing updated DEA-C01 bank (71 Questions).

QUESTION 1

A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application.

Which solution will meet these requirements with the LEAST operational overhead?

A
Establish WebSocket connections to Amazon Redshift.
B
Use the Amazon Redshift Data API.
C
Set up Java Database Connectivity (JDBC) connections to Amazon Redshift.
D
Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run the queries.

Correct Option: B

โœ…

Reasoning: The Amazon Redshift Data API provides a simple HTTP interface for applications to query Redshift without managing JDBC/ODBC drivers or connections. It handles authentication via IAM, connection pooling, and result retrieval, offering the least operational overhead for integrating real-time queries directly into a web application. โŒ Why the other choices are incorrect:

  • Option A is incorrect: Redshift does not natively support direct WebSocket connections for queries. Implementing this would require building an intermediary service to translate requests, significantly increasing development and operational overhead.
  • Option C is incorrect: Managing JDBC connections, drivers, and connection pools within a web application introduces more complexity and operational overhead compared to the fully managed Data API, especially for concurrent, real-time requests.
  • Option D is incorrect: This option requires moving data from Redshift to S3, introducing ETL processes, potential data staleness, and fundamentally changing the query target. This increases, rather than decreases, operational overhead for querying the data initially in Redshift.


QUESTION 2

A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The company has several use cases. The company must implement permission controls to separate query processes and access to query history among users, teams, and applications that are in the same AWS account.

Which solution will meet these requirements?

A
Create an S3 bucket for each use case. Create an S3 bucket policy that grants permissions to appropriate individual IAM users. Apply the S3 bucket policy to the S3 bucket.
B
Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply appropriate permissions to the workgroup.
C
Create an IAM role for each use case. Assign appropriate permissions to the role for each use case. Associate the role with Athena.
D
Create an AWS Glue Data Catalog resource policy that grants permissions to appropriate individual IAM users for each use case. Apply the resource policy to the specific tables that Athena uses.

Correct Option: B

โœ…

Reasoning: Athena workgroups separate query execution, control output locations, and manage query history. IAM policies with tags enforce granular permissions on workgroups, effectively isolating query processes and history access for different users, teams, or applications. โŒ Why the other choices are incorrect:

  • Option A is incorrect: S3 bucket policies control access to the raw data, not Athena query execution settings, process separation, or query history within Athena itself.
  • Option C is incorrect: IAM roles define permissions but don't provide the intrinsic separation of query processes and history that Athena workgroups offer within the Athena service.
  • Option D is incorrect: AWS Glue Data Catalog resource policies manage access to table metadata. They do not control Athena's query execution environment, resource limits, or query history separation.


QUESTION 3

A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data engineer does not require the Glue jobs to run or finish at a specific time.

Which solution will run the Glue jobs in the MOST cost-effective way?

A
Choose the FLEX execution class in the Glue job properties.
B
Use the Spot Instance type in Glue job properties.
C
Choose the STANDARD execution class in the Glue job properties.
D
Choose the latest version in the GlueVersion field in the Glue job properties.

Correct Option: A

โœ…

Reasoning: The FLEX execution class is specifically designed for non-time-sensitive jobs, leveraging spare capacity to significantly lower costs per DPU-hour compared to the Standard class. Since the scenario explicitly states no specific run or finish time is required, Flex provides the most cost-effective solution. โŒ Why the other choices are incorrect:

  • Option B is incorrect: AWS Glue job properties do not directly offer "Spot Instance type" as a configurable option for workers. While Glue may use Spot Instances internally for Flex, it's not a user-selectable setting.
  • Option C is incorrect: The STANDARD execution class offers predictable performance but is more expensive per DPU-hour than Flex. Since time sensitivity is not a requirement, this is not the most cost-effective choice.
  • Option D is incorrect: Choosing the latest Glue version can bring performance enhancements, but it doesn't directly provide the significant, dedicated cost reduction mechanism for non-time-sensitive workloads that the FLEX execution class offers.


QUESTION 4

A data engineer needs to create an AWS Lambda function that converts the format of data from .csv to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3 bucket.

Which solution will meet these requirements with the LEAST operational overhead?

A
Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
B
Create an S3 event notification that has an event type of s3:ObjectTagging:* for objects that have a tag set to .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
C
Create an S3 event notification that has an event type of s3:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification.
D
Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set an Amazon Simple Notification Service (Amazon SNS) topic as the destination for the event notification. Subscribe the Lambda function to the SNS topic.

Correct Option: A

โœ…

Reasoning: This solution correctly uses s3:ObjectCreated:* to trigger only on new object uploads, and a suffix filter for .csv ensures only relevant files invoke Lambda. Directly setting the Lambda ARN as the destination is the most direct and lowest-overhead integration for S3 event notifications. โŒ Why the other choices are incorrect:

  • Option B is incorrect: s3:ObjectTagging:* triggers on tag changes, not object creation. Relying on tags for file type identification adds unnecessary complexity and operational overhead.
  • Option C is incorrect: s3:* is too broad an event type, triggering for all S3 events. While the filter limits Lambda execution, it generates notifications for many irrelevant actions, increasing S3 service overhead compared to s3:ObjectCreated:*.
  • Option D is incorrect: Introducing an Amazon SNS topic adds an additional component to manage (SNS topic, subscriptions, permissions). This increases operational overhead compared to directly invoking the Lambda function from the S3 event notification.
QUESTION 5

A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column.

Which solution will MOST speed up the Athena query performance?

A
Change the data format from .csv to JSON format. Apply Snappy compression.
B
Compress the .csv files by using Snappy compression.
C
Change the data format from .csv to Apache Parquet. Apply Snappy compression.
D
Compress the .csv files by using gzip compression.

Correct Option: C

โœ… **Change the data format from.csv to Apache Parquet. Apply Snappy compression. **

Reasoning: Parquet is a columnar format, allowing Athena to read only selected columns, greatly reducing I/O and scan time. Snappy compression further decreases data scanned and improves performance. This combination directly addresses the "selecting a specific column" pattern and provides the most significant speedup. โŒ Why the other choices are incorrect:

  • Option A is incorrect: JSON is a row-based format; it still requires reading entire rows even when only specific columns are needed, negating the primary benefit of columnar storage. Snappy helps, but the format limits optimization.
  • Option B is incorrect: While Snappy compression reduces data scanned, the underlying CSV format is row-based. Athena still reads full rows, missing the significant performance gains from columnar pruning for "selecting specific columns."
  • Option D is incorrect: Gzip compresses data, but CSV remains a row-based format, preventing columnar pruning. Gzip also has slower decompression than Snappy in parallel environments like Athena, potentially hindering performance.


QUESTION 6

A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket.

The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility.

Which solution will meet these requirements with the LOWEST latency?

A
Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
B
Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to create an Amazon QuickSight dashboard.
C
Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Create a new Data Firehose delivery stream to publish data directly to an Amazon Timestream database. Use the Timestream database as a source to create an Amazon QuickSight dashboard.
D
Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.

Correct Option: A

โœ…

Reasoning: Amazon Managed Service for Apache Flink (Kinesis Data Analytics) processes data directly from Kinesis Data Streams with very low latency. Amazon Timestream is a purpose-built time-series database optimized for high-ingestion rates and fast queries, ideal for real-time dashboards like Grafana. This combination provides the most direct, real-time, and lowest-latency solution. โŒ Why the other choices are incorrect:

  • Option B is incorrect: This approach introduces latency by writing data to S3 first, then triggering Lambda. Amazon Aurora is a relational database not optimized for time-series data or the lowest-latency ingestion/querying compared to Timestream.
  • Option C is incorrect: While Flink is good, creating an additional Kinesis Data Firehose stream after Flink to Timestream adds an unnecessary processing hop, increasing latency. Flink can directly write to Timestream. Amazon QuickSight typically has higher dashboard refresh latency than Grafana with Timestream for pure real-time streaming.
  • Option D is incorrect: Reading from S3 using AWS Glue, even with bookmarks, introduces higher latency as S3 is object storage and Glue is typically for batch/micro-batch processing, not true real-time streaming at the lowest latency.
QUESTION 7

A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data.

The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.

Which solution will meet these requirements?

A
Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Configure the output destination to a new path in the existing S3 bucket.
B
Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Specify a database name for the output.
C
Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Specify a database name for the output.
D
Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Configure the output destination to a new path in the existing S3 bucket.

Correct Option: B

โœ…

Reasoning: An AWSGlueServiceRole IAM role grants the necessary permissions for a Glue crawler to read S3 data and write metadata to the AWS Glue Data Catalog. A daily schedule ensures the data is accessible daily. Specifying a database name directs the crawler's output to the Data Catalog. โŒ Why the other choices are incorrect:

  • Option A is incorrect: The AmazonS3FullAccess policy is insufficient as Glue crawlers require AWSGlueServiceRole to interact with the Data Catalog. Also, crawlers output to the Data Catalog, not back to an S3 path.
  • Option C is incorrect: The AmazonS3FullAccess policy is insufficient for Glue operations. Allocating DPUs is a resource aspect, not the method to schedule a daily run, which requires setting a schedule.
  • Option D is incorrect: Allocating DPUs is not the method for daily scheduling. More critically, Glue crawlers populate metadata in the AWS Glue Data Catalog, they do not output data to a new path in an S3 bucket.


QUESTION 8

A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded.

A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.

How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?

A
Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.
B
Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.
C
Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.
D
Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.

Correct Option: B

โœ…

Reasoning: After performing the Redshift load using the Data API, the orchestrating application can publish a custom event detailing the load status to Amazon EventBridge. EventBridge's primary purpose is to route events, and a rule can easily be configured to filter for this specific event pattern and invoke the Lambda function. This provides a robust, scalable, and event-driven solution for tracking load statuses. โŒ Why the other choices are incorrect:

  • Option A is incorrect: Relying on a second Lambda function invoked by CloudWatch events would typically involve polling or scheduling, which is less efficient and not truly event-driven in response to the completion of a Redshift load.
  • Option C is incorrect: While the application could publish to SQS, EventBridge is generally a more suitable event bus for "publishing events" and routing "load statuses," offering broader routing capabilities and a clearer semantic fit for event-driven architectures than a simple queue.
  • Option D is incorrect: AWS CloudTrail logs API calls for auditing. While Redshift API calls are logged, using CloudTrail for real-time application-specific event triggers for data load completion is inefficient, complex, and not its intended primary use case.
QUESTION 9

A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically.

Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?

A
AWS DataSync
B
AWS Glue
C
AWS Direct Connect
D
Amazon S3 Transfer Acceleration

Correct Option: A

โœ… **AWS DataSync **

Reasoning: DataSync is purpose-built for secure, automated, and scheduled online data transfer from on-premises storage to S3. It efficiently handles large initial datasets (5TB) and regularly synchronizes incremental changes (5% daily), optimizing for operational efficiency and meeting all specified requirements for file formats and automation. โŒ Why the other choices are incorrect:

  • Option B is incorrect: AWS Glue is an ETL service designed for data transformation and cataloging, not for automated, incremental file transfer and synchronization from on-premises file systems to S3. It focuses on data processing rather than reliable, scheduled file replication.
  • Option C is incorrect: AWS Direct Connect provides a dedicated network connection to AWS, improving bandwidth and latency. However, it is an underlying network service, not a data transfer tool. It does not automate, schedule, or manage the actual file movement or incremental updates.
  • Option D is incorrect: Amazon S3 Transfer Acceleration speeds up data transfers over long distances by leveraging AWS Edge Locations. It improves transfer performance but does not provide the automation, scheduling, or incremental synchronization capabilities needed for ongoing on-premises file replication.


QUESTION 10

A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently.

The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database.

Which AWS service should the company use to meet these requirements?

A
AWS Lambda
B
AWS Database Migration Service (AWS DMS)
C
AWS Direct Connect
D
AWS DataSync

Correct Option: B

โœ… **AWS Database Migration Service (AWS DMS) **

Reasoning: AWS DMS is specifically designed for migrating databases to AWS, supporting homogeneous migrations like SQL Server to RDS for SQL Server. It offers continuous data replication (CDC) to ensure minimal downtime for applications during migration, and is a cost-effective solution for large-scale database transfers. โŒ Why the other choices are incorrect:

  • Option A is incorrect: AWS Lambda is a serverless compute service for running code, not a dedicated service for migrating databases with transactional integrity and minimal downtime. Custom solutions using Lambda would be complex and costly.
  • Option C is incorrect: AWS Direct Connect provides a dedicated network connection to AWS. While it can reduce network costs and improve transfer speed, it is a networking service, not a database migration service itself. It doesn't perform the migration.
  • Option D is incorrect: AWS DataSync is for migrating large amounts of file or object data between on-premises storage and AWS storage services. It is not designed for transactional database migrations to services like Amazon RDS, which require specific database replication capabilities.


QUESTION 11

A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour.

Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)

A
Configure AWS Glue triggers to run the ETL jobs every hour.
B
Use AWS Glue DataBrew to clean and prepare the data for analytics.
C
Use AWS Lambda functions to schedule and run the ETL jobs every hour.
D
Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.
E
Use the Redshift Data API to load transformed data into Amazon Redshift.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 12

A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling.

Which solution will meet this requirement?

A
Turn on concurrency scaling in workload management (WLM) for Redshift Serverless workgroups.
B
Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.
C
Turn on concurrency scaling in the settings during the creation of any new Redshift cluster.
D
Turn on concurrency scaling for the daily usage quota for the Redshift cluster.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 13

A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes.

Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)

A
Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.
B
Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running.
C
Use an AWS Glue Python shell job and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.
D
Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes to determine whether the current Athena query has finished running successfully. Configure the Python shell script to invoke the next query when the current query has finished running.
E
Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the Athena queries in AWS Batch.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 14

A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options.

The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS.

Which extract, transform, and load (ETL) service will meet these requirements?

A
AWS Glue
B
Amazon EMR
C
AWS Lambda
D
Amazon Redshift

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 15

A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII.

Which solution will meet this requirement with the LEAST operational effort?

A
Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Create an AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate the PII. Set the S3 data lake as the target for the delivery stream.
B
Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.
C
Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule in AWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.
D
Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table and to transform the data. Use the same Lambda function to ingest the data into the S3 data lake.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 16

A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data.

The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort.

Which solution will meet these requirements with the LEAST operational overhead?

A
AWS Glue workflows
B
AWS Step Functions tasks
C
AWS Lambda functions
D
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 17

A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.

A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.

The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.

Which solution will meet these requirements in the MOST cost-effective way?

A
Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
B
Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
C
Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.
D
Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 18

A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks.

The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster.

The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster.

Which solution will meet these requirements?

A
Set up the sales team BI cluster as a consumer of the ETL cluster by using Redshift data sharing.
B
Create materialized views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
C
Create database views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
D
Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 19

A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3.

Which solution will meet this requirement MOST cost-effectively?

A
Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.
B
Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.
C
Use Amazon Athena Federated Query to join the data from all data sources.
D
Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 20

A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance.

Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)

A
Use Hadoop Distributed File System (HDFS) as a persistent data store.
B
Use Amazon S3 as a persistent data store.
C
Use x86-based instances for core nodes and task nodes.
D
Use Graviton instances for core nodes and task nodes.
E
Use Spot Instances for all primary nodes

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 21

A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis.
B
Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the materialized views regularly to query the most recent stream data.
C
Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a materialized view to read data from the stream. Set the materialized view to auto refresh.
D
Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 22

A company uses an Amazon QuickSight dashboard to monitor usage of one of the company's applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day.

A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs.

Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)

A
Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
B
Increase the AWS Glue instance size by scaling up the worker type.
C
Convert the AWS Glue schema to the DynamicFrame schema class.
D
Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.
E
Modify the IAM role that grants access to AWS glue to grant access to all S3 features.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 23

A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file.

Which Step Functions state should the data engineer use to meet these requirements?

A
Parallel state
B
Choice state
C
Map state
D
Wait state

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 24

A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information.

The data engineer must identify and remove duplicate information from the legacy application data.

Which solution will meet these requirements with the LEAST operational overhead?

A
Write a custom extract, transform, and load (ETL) job in Python. Use the DataFrame.drop_duplicates() function by importing the Pandas library to perform data deduplication.
B
Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.
C
Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library. Use the dedupe library to perform data deduplication.
D
Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data deduplication.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 25

A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3.

Which actions will provide the FASTEST queries? (Choose two.)

A
Use gzip compression to compress individual files to sizes that are between 1 GB and 5 GB.
B
Use a columnar storage file format.
C
Partition the data based on the most common query predicates.
D
Split the data into files that are less than 10 KB.
E
Use file formats that are not splittable.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 26

A company uses Amazon RDS to store transactional data. The company runs an RDS DB instance in a private subnet. A developer wrote an AWS Lambda function with default settings to insert, update, or delete data in the DB instance.

The developer needs to give the Lambda function the ability to connect to the DB instance privately without using the public internet.

Which combination of steps will meet this requirement with the LEAST operational overhead? (Choose two.)

A
Turn on the public access setting for the DB instance.
B
Update the security group of the DB instance to allow only Lambda function invocations on the database port.
C
Configure the Lambda function to run in the same subnet that the DB instance uses.
D
Attach the same security group to the Lambda function and the DB instance. Include a self-referencing rule that allows access through the database port.
E
Update the network ACL of the private subnet to include a self-referencing rule that allows access through the database port.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 27

A company has a frontend ReactJS website that uses Amazon API Gateway to invoke REST APIs. The APIs perform the functionality of the website. A data engineer needs to write a Python script that can be occasionally invoked through API Gateway. The code must return results to API Gateway.

Which solution will meet these requirements with the LEAST operational overhead?

A
Deploy a custom Python script on an Amazon Elastic Container Service (Amazon ECS) cluster.
B
Create an AWS Lambda Python function with provisioned concurrency.
C
Deploy a custom Python script that can integrate with API Gateway on Amazon Elastic Kubernetes Service (Amazon EKS).
D
Create an AWS Lambda function. Ensure that the function is warm by scheduling an Amazon EventBridge rule to invoke the Lambda function every 5 minutes by using mock events.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 28

A company has a production AWS account that runs company workloads. The company's security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs.

The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account.

Which solution will meet these requirements?

A
Create a destination data stream in the production AWS account. In the security AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the production AWS account.
B
Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the security AWS account.
C
Create a destination data stream in the production AWS account. In the production AWS account, create an IAM role that has cross-account permissions to Kinesis Data Streams in the security AWS account.
D
Create a destination data stream in the security AWS account. Create an IAM role and a trust policy to grant CloudWatch Logs the permission to put data into the stream. Create a subscription filter in the production AWS account.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 29

A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes.

A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake.

Which solution will capture the changed data MOST cost-effectively?

A
Create an AWS Lambda function to identify the changes between the previous data and the current data. Configure the Lambda function to ingest the changes into the data lake.
B
Ingest the data into Amazon RDS for MySQL. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.
C
Use an open source data lake format to merge the data source with the S3 data lake to insert the new data and update the existing data.
D
Ingest the data into an Amazon Aurora MySQL DB instance that runs Aurora Serverless. Use AWS Database Migration Service (AWS DMS) to write the changed data to the data lake.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 30

A data engineer runs Amazon Athena queries on data that is in an Amazon S3 bucket. The Athena queries use AWS Glue Data Catalog as a metadata table.

The data engineer notices that the Athena query plans are experiencing a performance bottleneck. The data engineer determines that the cause of the performance bottleneck is the large number of partitions that are in the S3 bucket. The data engineer must resolve the performance bottleneck and reduce Athena query planning time.

Which solutions will meet these requirements? (Choose two.)

A
Create an AWS Glue partition index. Enable partition filtering.
B
Bucket the data based on a column that the data have in common in a WHERE clause of the user query.
C
Use Athena partition projection based on the S3 bucket prefix.
D
Transform the data that is in the S3 bucket to Apache Parquet format.
E
Use the Amazon EMR S3DistCP utility to combine smaller objects in the S3 bucket into larger objects.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 31

A company uses Amazon Athena to run SQL queries for extract, transform, and load (ETL) tasks by using Create Table As Select (CTAS). The company must use Apache Spark instead of SQL to generate analytics.

Which solution will give the company the ability to use Spark to access Athena?

A
Athena query settings
B
Athena workgroup
C
Athena data source
D
Athena query editor

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 32

A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized views.

Which solution will meet this requirement with the LEAST effort?

A
Use Apache Airflow to refresh the materialized views.
B
Use an AWS Lambda user-defined function (UDF) within Amazon Redshift to refresh the materialized views.
C
Use the query editor v2 in Amazon Redshift to refresh the materialized views.
D
Use an AWS Glue workflow to refresh the materialized views.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 33

A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends.

The company must ensure that the application performs consistently during peak usage times.

Which solution will meet these requirements in the MOST cost-effective way?

A
Increase the provisioned capacity to the maximum capacity that is currently present during peak load times.
B
Divide the table into two tables. Provision each table with half of the provisioned capacity of the original table. Spread queries evenly across both tables.
C
Use AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times. Schedule lower capacity during off-peak times.
D
Change the capacity mode from provisioned to on-demand. Configure the table to scale up and scale down based on the load on the table.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 34

A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple user groups need to access the raw data. The company must ensure that user groups can access only the PII that they require.

Which solution will meet these requirements with the LEAST effort?

A
Use Amazon Athena to query the data. Set up AWS Lake Formation and create data filters to establish levels of access for the company's IAM roles. Assign each user to the IAM role that matches the user's PII access requirements.
B
Use Amazon QuickSight to access the data. Use column-level security features in QuickSight to limit the PII that users can retrieve from Amazon S3 by using Amazon Athena. Define QuickSight access levels based on the PII access requirements of the users.
C
Build a custom query builder UI that will run Athena queries in the background to access the data. Create user groups in Amazon Cognito. Assign access levels to the user groups based on the PII access requirements of the users.
D
Create IAM roles that have different levels of granular access. Assign the IAM roles to IAM user groups. Use an identity-based policy to assign access levels to user groups at the column level.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 35

A data engineer must build an extract, transform, and load (ETL) pipeline to process and load data from 10 source systems into 10 tables that are in an Amazon Redshift database. All the source systems generate .csv, JSON, or Apache Parquet files every 15 minutes. The source systems all deliver files into one Amazon S3 bucket. The file sizes range from 10 MB to 20 GB. The ETL pipeline must function correctly despite changes to the data schema.

Which data pipeline solutions will meet these requirements? (Choose two.)

A
Use an Amazon EventBridge rule to run an AWS Glue job every 15 minutes. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
B
Use an Amazon EventBridge rule to invoke an AWS Glue workflow job every 15 minutes. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
C
Configure an AWS Lambda function to invoke an AWS Glue crawler when a file is loaded into the S3 bucket. Configure an AWS Glue job to process and load the data into the Amazon Redshift tables. Create a second Lambda function to run the AWS Glue job. Create an Amazon EventBridge rule to invoke the second Lambda function when the AWS Glue crawler finishes running successfully.
D
Configure an AWS Lambda function to invoke an AWS Glue workflow when a file is loaded into the S3 bucket. Configure the AWS Glue workflow to have an on-demand trigger that runs an AWS Glue crawler and then runs an AWS Glue job when the crawler finishes running successfully. Configure the AWS Glue job to process and load the data into the Amazon Redshift tables.
E
Configure an AWS Lambda function to invoke an AWS Glue job when a file is loaded into the S3 bucket. Configure the AWS Glue job to read the files from the S3 bucket into an Apache Spark DataFrame. Configure the AWS Glue job to also put smaller partitions of the DataFrame into an Amazon Kinesis Data Firehose delivery stream. Configure the delivery stream to load data into the Amazon Redshift tables.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 36

A company has used an Amazon Redshift table that is named Orders for 6 months. The company performs weekly updates and deletes on the table. The table has an interleaved sort key on a column that contains AWS Regions.

The company wants to reclaim disk space so that the company will not run out of storage space. The company also wants to analyze the sort key column.

Which Amazon Redshift command will meet these requirements?

A
VACUUM FULL Orders
B
VACUUM DELETE ONLY Orders
C
VACUUM REINDEX Orders
D
VACUUM SORT ONLY Orders

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 37

A company stores details about transactions in an Amazon S3 bucket. The company wants to log all writes to the S3 bucket into another S3 bucket that is in the same AWS Region.

Which solution will meet this requirement with the LEAST operational effort?

A
Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket to invoke an AWS Lambda function. Program the Lambda function to write the event to Amazon Kinesis Data Firehose. Configure Kinesis Data Firehose to write the event to the logs S3 bucket.
B
Create a trail of management events in AWS CloudTraiL. Configure the trail to receive data from the transactions S3 bucket. Specify an empty prefix and write-only events. Specify the logs S3 bucket as the destination bucket.
C
Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket to invoke an AWS Lambda function. Program the Lambda function to write the events to the logs S3 bucket.
D
Create a trail of data events in AWS CloudTraiL. Configure the trail to receive data from the transactions S3 bucket. Specify an empty prefix and write-only events. Specify the logs S3 bucket as the destination bucket.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 38

A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The repository needs to provide the schema and properties of many tables. Some of the metadata is stored in Apache Hive. The data engineer needs to import the metadata from Hive into the central metadata repository.

Which solution will meet these requirements with the LEAST development effort?

A
Use Amazon EMR and Apache Ranger.
B
Use a Hive metastore on an EMR cluster.
C
Use the AWS Glue Data Catalog.
D
Use a metastore on an Amazon RDS for MySQL DB instance.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 39

A company needs to build a data lake in AWS. The company must provide row-level data access and column-level data access to specific teams. The teams will access the data by using Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use Amazon S3 for data lake storage. Use S3 access policies to restrict data access by rows and columns. Provide data access through Amazon S3.
B
Use Amazon S3 for data lake storage. Use Apache Ranger through Amazon EMR to restrict data access by rows and columns. Provide data access by using Apache Pig.
C
Use Amazon Redshift for data lake storage. Use Redshift security policies to restrict data access by rows and columns. Provide data access by using Apache Spark and Amazon Athena federated queries.
D
Use Amazon S3 for data lake storage. Use AWS Lake Formation to restrict data access by rows and columns. Provide data access through AWS Lake Formation.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 40

A manufacturing company wants to collect data from sensors. A data engineer needs to implement a solution that ingests sensor data in near real time.

The solution must store the data to a persistent data store. The solution must store the data in nested JSON format. The company must have the ability to query from the data store with a latency of less than 10 milliseconds.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use a self-hosted Apache Kafka cluster to capture the sensor data. Store the data in Amazon S3 for querying.
B
Use AWS Lambda to process the sensor data. Store the data in Amazon S3 for querying.
C
Use Amazon Kinesis Data Streams to capture the sensor data. Store the data in Amazon DynamoDB for querying.
D
Use Amazon Simple Queue Service (Amazon SQS) to buffer incoming sensor data. Use AWS Glue to store the data in Amazon RDS for querying.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 41

A data engineering team is using an Amazon Redshift data warehouse for operational reporting. The team wants to prevent performance issues that might result from long- running queries. A data engineer must choose a system table in Amazon Redshift to record anomalies when a query optimizer identifies conditions that might indicate performance issues.

Which table views should the data engineer use to meet this requirement?

A
STL_USAGE_CONTROL
B
STL_ALERT_EVENT_LOG
C
STL_QUERY_METRICS
D
STL_PLAN_INFO

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 42

A company is migrating its database servers from Amazon EC2 instances that run Microsoft SQL Server to Amazon RDS for Microsoft SQL Server DB instances. The company's analytics team must export large data elements every day until the migration is complete. The data elements are the result of SQL joins across multiple tables. The data must be in Apache Parquet format. The analytics team must store the data in Amazon S3.

Which solution will meet these requirements in the MOST operationally efficient way?

A
Create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create an AWS Glue job that selects the data directly from the view and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.
B
Schedule SQL Server Agent to run a daily SQL query that selects the desired data elements from the EC2 instance-based SQL Server databases. Configure the query to direct the output .csv objects to an S3 bucket. Create an S3 event that invokes an AWS Lambda function to transform the output format from .csv to Parquet.
C
Use a SQL query to create a view in the EC2 instance-based SQL Server databases that contains the required data elements. Create and run an AWS Glue crawler to read the view. Create an AWS Glue job that retrieves the data and transfers the data in Parquet format to an S3 bucket. Schedule the AWS Glue job to run every day.
D
Create an AWS Lambda function that queries the EC2 instance-based databases by using Java Database Connectivity (JDBC). Configure the Lambda function to retrieve the required data, transform the data into Parquet format, and transfer the data into an S3 bucket. Use Amazon EventBridge to schedule the Lambda function to run every day.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 43

A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.

Which solution will meet these requirements MOST cost-effectively?

A
Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
B
Write a PySpark ETL script. Host the script on an Amazon EMR cluster.
C
Write an AWS Glue PySpark job. Use Apache Spark to transform the data.
D
Write an AWS Glue Python shell job. Use pandas to transform the data.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 44

A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII.

To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solution that with redact PII dynamically, based on the needs of each application that accesses the dataset.

Which solution will meet the requirements with the LEAST operational overhead?

A
Create an S3 bucket policy to limit the access each application has. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy.
B
Create an S3 Object Lambda endpoint. Use the S3 Object Lambda endpoint to read data from the S3 bucket. Implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data.
C
Use AWS Glue to transform the data for each application. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy.
D
Create an API Gateway endpoint that has custom authorizers. Use the API Gateway endpoint to read data from the S3 bucket. Initiate a REST API call to dynamically redact PII based on the needs of each application that accesses the data.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 45

A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models.

The data engineer receives an access denied error when the data engineer tries to prepare the data by using SageMaker Studio.

Which change should the engineer make to gain access to SageMaker Studio?

A
Add the AWSGlueServiceRole managed policy to the data engineer's IAM user.
B
Add a policy to the data engineer's IAM user that includes the sts:AssumeRole action for the AWS Glue and SageMaker service principals in the trust policy.
C
Add the AmazonSageMakerFullAccess managed policy to the data engineer's IAM user.
D
Add a policy to the data engineer's IAM user that allows the sts:AddAssociation action for the AWS Glue and SageMaker service principals in the trust policy.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 46

A healthcare company uses Amazon Kinesis Data Streams to stream real-time health data from wearable devices, hospital equipment, and patient records.

A data engineer needs to find a solution to process the streaming data. The data engineer needs to store the data in an Amazon Redshift Serverless warehouse. The solution must support near real-time analytics of the streaming data and the previous day's data.

Which solution will meet these requirements with the LEAST operational overhead?

A
Load data into Amazon Kinesis Data Firehose. Load the data into Amazon Redshift.
B
Use the streaming ingestion feature of Amazon Redshift.
C
Load the data into Amazon S3. Use the COPY command to load the data into Amazon Redshift.
D
Use the Amazon Aurora zero-ETL integration with Amazon Redshift.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 47

A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) General Purpose SSD storage from gp2 to gp3. The company wants to prevent any interruptions in its Amazon EC2 instances that will cause data loss during the migration to the upgraded storage.

Which solution will meet these requirements with the LEAST operational overhead?

A
Create snapshots of the gp2 volumes. Create new gp3 volumes from the snapshots. Attach the new gp3 volumes to the EC2 instances.
B
Create new gp3 volumes. Gradually transfer the data to the new gp3 volumes. When the transfer is complete, mount the new gp3 volumes to the EC2 instances to replace the gp2 volumes.
C
Change the volume type of the existing gp2 volumes to gp3. Enter new values for volume size, IOPS, and throughput.
D
Use AWS DataSync to create new gp3 volumes. Transfer the data from the original gp2 volumes to the new gp3 volumes.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 48

A data engineer must manage the ingestion of real-time streaming data into AWS. The data engineer wants to perform real-time analytics on the incoming streaming data by using time-based aggregations over a window of up to 30 minutes. The data engineer needs a solution that is highly fault tolerant.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use an AWS Lambda function that includes both the business and the analytics logic to perform time-based aggregations over a window of up to 30 minutes for the data in Amazon Kinesis Data Streams.
B
Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data that might occasionally contain duplicates by using multiple types of aggregations.
C
Use an AWS Lambda function that includes both the business and the analytics logic to perform aggregations for a tumbling window of up to 30 minutes, based on the event timestamp.
D
Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to analyze the data by using multiple types of aggregations to perform time-based analytics over a window of up to 30 minutes.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 49

A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change.

A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use Amazon EMR to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
B
Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
C
Create a PySpark program in AWS Lambda to extract, transform, and load the data into the S3 bucket.
D
Create a stored procedure in Amazon Redshift to detect the schema and to extract, transform, and load the data into a Redshift Spectrum table. Access the table from Amazon S3.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 50

A company stores datasets in JSON format and .csv format in an Amazon S3 bucket. The company has Amazon RDS for Microsoft SQL Server databases, Amazon DynamoDB tables that are in provisioned capacity mode, and an Amazon Redshift cluster. A data engineering team must develop a solution that will give data scientists the ability to query all data sources by using syntax similar to SQL.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Amazon Athena to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format.
B
Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use Redshift Spectrum to query the data. Use SQL for structured data sources. Use PartiQL for data that is stored in JSON format.
C
Use AWS Glue to crawl the data sources. Store metadata in the AWS Glue Data Catalog. Use AWS Glue jobs to transform data that is in JSON format to Apache Parquet or .csv format. Store the transformed data in an S3 bucket. Use Amazon Athena to query the original and transformed data from the S3 bucket.
D
Use AWS Lake Formation to create a data lake. Use Lake Formation jobs to transform the data from all data sources to Apache Parquet format. Store the transformed data in an S3 bucket. Use Amazon Athena or Redshift Spectrum to query the data.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 51

A data engineer needs to use an Amazon QuickSight dashboard that is based on Amazon Athena queries on data that is stored in an Amazon S3 bucket. When the data engineer connects to the QuickSight dashboard, the data engineer receives an error message that indicates insufficient permissions.

Which factors could cause to the permissions-related errors? (Choose two.)

A
There is no connection between QuickSight and Athena.
B
The Athena tables are not cataloged.
C
QuickSight does not have access to the S3 bucket.
D
QuickSight does not have access to decrypt S3 data.
E
There is no IAM role assigned to QuickSight.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 52

A company receives a daily file that contains customer data in .xls format. The company stores the file in Amazon S3. The daily file is approximately 2 GB in size.

A data engineer concatenates the column in the file that contains customer first names and the column that contains customer last names. The data engineer needs to determine the number of distinct customers in the file.

Which solution will meet this requirement with the LEAST operational effort?

A
Create and run an Apache Spark job in an AWS Glue notebook. Configure the job to read the S3 file and calculate the number of distinct customers.
B
Create an AWS Glue crawler to create an AWS Glue Data Catalog of the S3 file. Run SQL queries from Amazon Athena to calculate the number of distinct customers.
C
Create and run an Apache Spark job in Amazon EMR Serverless to calculate the number of distinct customers.
D
Use AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to calculate the number of distinct customers.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 53

During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script.

A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must securely store the credentials.

Which combination of steps should the data engineer take to meet these requirements? (Choose two.)

A
Store the credentials in the AWS Glue job parameters.
B
Store the credentials in a configuration file that is in an Amazon S3 bucket.
C
Access the credentials from a configuration file that is in an Amazon S3 bucket by using the AWS Glue job.
D
Store the credentials in AWS Secrets Manager.
E
Grant the AWS Glue job IAM role access to the stored credentials.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 54

A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by using encryption. The company must also use encryption keys that only specific employees can access.

Which solution will meet these requirements with the LEAST effort?

A
Use an AWS CloudHSM cluster to store the encryption keys. Configure the process that writes to Amazon S3 to make calls to CloudHSM to encrypt and decrypt the objects. Deploy an IAM policy that restricts access to the CloudHSM cluster.
B
Use server-side encryption with customer-provided keys (SSE-
C
Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the KMS keys that encrypt the objects.
D
Use server-side encryption with Amazon S3 managed keys (SSE-S3) to encrypt the objects that contain customer information. Configure an IAM policy that restricts access to the Amazon S3 managed keys that encrypt the objects.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 55

A company's data engineer needs to optimize the performance of table SQL queries. The company stores data in an Amazon Redshift cluster. The data engineer cannot increase the size of the cluster because of budget constraints.

The company stores the data in multiple tables and loads the data by using the EVEN distribution style. Some tables are hundreds of gigabytes in size. Other tables are less than 10 MB in size.

Which solution will meet these requirements?

A
Keep using the EVEN distribution style for all tables. Specify primary and foreign keys for all tables.
B
Use the ALL distribution style for large tables. Specify primary and foreign keys for all tables.
C
Use the ALL distribution style for rarely updated small tables. Specify primary and foreign keys for all tables.
D
Specify a combination of distribution, sort, and partition keys for all tables.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 56

An application uses an AWS Lambda function that is configured with managed runtimes. The Lambda function successfully writes logs to the default Amazon CloudWatch Logs log group. A data engineer wants to modify the logging behavior to show only ERROR level logs for application logs and WARN level logs for system logs.

Which solution will meet these requirements?

A
Add additional permissions to the Lambda execution role.
B
Set the log level to ERROR in the Lambda function code.
C
Configure the Lambda function to use the JSON log format.
D
Configure the Lambda function to send logs to a custom log group.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 57

A financial company wants to use Amazon Athena to run on-demand SQL queries on a petabyte-scale dataset to support a business intelligence (BI) application. An AWS Glue job that runs during non-business hours updates the dataset once every day. The BI application has a standard data refresh frequency of 1 hour to comply with company policies.

A data engineer wants to cost optimize the company's use of Amazon Athena without adding any additional infrastructure costs.

Which solution will meet these requirements with the LEAST operational overhead?

A
Configure an Amazon S3 Lifecycle policy to move data to the S3 Glacier Deep Archive storage class after 1 day.
B
Use the query result reuse feature of Amazon Athena for the SQL queries.
C
Add an Amazon ElastiCache cluster between the BI application and Athena.
D
Change the format of the files that are in the dataset to Apache Parquet.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 58

A data engineer needs to build an interactive system that answers complex questions about customer feedback. The system must generate comprehensive reports that summarize feedback trends and provide natural language explanations of product issues with supporting evidence. The system must give users the ability to ask follow-up questions in natural language. The system must also dynamically explore feedback patterns without predefined categories. The data must remain within the data engineer's AWS account.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use Amazon SageMaker Al to build a custom sentiment analysis model. Use Amazon EMR to run batch processing jobs that categorize the feedback data into themes.
B
Create an Amazon Bedrock knowledge base linked to an Amazon S3 bucket that contains survey data. Use Retrieval Augmented Generation (RAG) to analyze feedback trends.
C
Use Amazon Comprehend to detect sentiment and key phrases. Use Amazon OpenSearch Service to index and search the feedback data for specific themes.
D
Store survey data in Amazon DynamoDB. IJse AWS Lambda functions with custom code to analyze sentiment and categorize the feedback.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 59

An airline company is collecting metrics about flight activities for analytics. The company is conducting a proof of concept (POC) test to show how analytics can provide insights that the company can use to increase on-time departures.

The POC test uses objects in Amazon S3 that contain the metrics in .csv format. The POC test uses Amazon Athena to query the data. The data is partitioned in the S3 bucket by date.

As the amount of data increases, the company wants to optimize the storage solution to improve query performance.

Which combination of solutions will meet these requirements? (Choose two.)

A
Add a randomized string to the beginning of the keys in Amazon S3 to get more throughput across partitions.
B
Use an S3 bucket that is in the same account that uses Athena to query the data.
C
Use an S3 bucket that is in the same AWS Region where the company runs Athena queries.
D
Preprocess the .csv data to JSON format by fetching only the document keys that the query requires.
E
Preprocess the .csv data to Apache Parquet format by fetching only the data blocks that are needed for predicates.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 60

A data engineer needs to share a dataset that contains customer transaction data with a machine learning (ML) team in another AWS account. The dataset is stored in Amazon S3 and contains sensitive information that requires governance controls. The data engineer wants to ensure that the ML team can discover and request access to the dataset. The solution must maintain appropriate security controls and track data lineage.

Which solution will meet these requirements?

A
Create cross-account IAM roles that grant the ML team direct access to the S3 bucket where the data is stored. Use AWS CloudTrail to track data access.
B
Configure an Amazon SageMaker Unified Studio data catalog project that contains the dataset with appropriate metadata and project-based access controls.
C
Set up cross-account S3 bucket replication to copy the dataset to the ML team's account. Use S3 server access logging to monitor usage.
D
Create an AWS Lake Formation data catalog. Define tag-based access controls that allow the ML team to query the data directly from the team's account.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 61

A security company stores IoT data that is in JSON format in an Amazon S3 bucket. The data structure can change when the company upgrades the IoT devices. The company wants to create a data catalog that includes the IoT data. The company's analytics department will use the data catalog to index the data.

Which solution will meet these requirements MOST cost-effectively?

A
Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
B
Create an Amazon Redshift provisioned cluster. Create an Amazon Redshift Spectrum database for the analytics department to explore the data that is in Amazon S3. Create Redshift stored procedures to load the data into Amazon Redshift.
C
Create an Amazon Athena workgroup. Explore the data that is in Amazon S3 by using Apache Spark through Athena. Provide the Athena workgroup schema and tables to the analytics department.
D
Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create AWS Lambda user defined functions (UDFs) by using the Amazon Redshift Data API. Create an AWS Step Functions job to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 62

A company has an application that is deployed on AWS. The application uses Amazon Simple Notification Service (Amazon SNS) with multiple topics. The company's security team needs to be able to audit all Publish and PublishBatch API actions for all the SNS topics. The company's application team and security team must also be able to query the audit data. The company has already established an event data store in AWS CloudTrail Lake to collect all events.

Which solution will meet these requirements with the LEAST operational overhead?

A
Enable management events for the SNS topics. Create a table in AWS Glue Data Catalog. Query the data by using Amazon Athena.
B
Enable management events for the SNS topics. Use CloudTrail Lake to query the audit data.
C
Enable data events for the SNS topics. Use CloudTrail Lake to query the audit data.
D
Enable data events for the SNS topics. Create a table in AWS Glue Data Catalog. Query the data by using Amazon Athena.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 63

A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution.

A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations.

The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes.

Which solution will meet these requirements?

A
Change the sort key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.
B
Change the distribution key to the table column that has the largest dimension.
C
Upgrade the reserved node from ra3.4xlarge to ra3.16xlarge.
D
Change the primary key to be the data column that is most often used in a WHERE clause of the SQL SELECT statement.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 64

A company processes a CSV file that contains millions of transaction records every day. The file is stored in Amazon S3. Each transaction must be validated before updating a database. The company needs a solution that will process the data in parallel. The solution must use error handling that stops the entire process if more than 15% of the records fail validation.

Which solution will meet these requirements with the LEAST operational overhead?

A
Create an AWS Batch job that processes chunks of the file in parallel with a custom error tracking mechanism.
B
Use AWS Step Functions Distributed Map state with the ToleratedFailurePercentage field set to 15%.
C
Deploy an Amazon EMR cluster with Spark to process the file. Configure a custom failure threshold to 15%.
D
Use AWS Lambda with S3 Batch Operations to process the file and track validation failures to be less than 15%.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 65

A company is planning to migrate on-premises Apache Hadoop clusters to Amazon EMR. The company also needs to migrate a data catalog into a persistent storage solution.

The company currently stores the data catalog in an on-premises Apache Hive metastore on the Hadoop clusters. The company requires a serverless solution to migrate the data catalog.

Which solution will meet these requirements MOST cost-effectively?

A
Use AWS Database Migration Service (AWS DMS) to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3 to produce the data catalog.
B
Configure a Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company's data catalog as an external data catalog.
C
Configure an external Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store the company's data catalog.
D
Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company's data catalog.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 66

A global ecommerce company occasionally receives customer data files in its Amazon S3 data lake. The company needs to automatically detect new data and mask sensitive data before making the data available to the company's analytics team.

Which solution will meet these requirements with the LEAST operational overhead?

A
Configure Amazon S3 Event Notifications to detect the new data and trigger an AWS Glue job. Use Amazon Macie to detect and mask the sensitive data. Store the processed data in Amazon Redshift.
B
Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to detect incoming data. Use Amazon EMR workflows to detect and mask sensitive data. Store the processed data in Amazon S3.
C
Use Amazon Kinesis Data Streams to capture new data. Use Amazon Comprehend to detect and mask the sensitive data. Store the processed data in Amazon DynamoDB tables.
D
Use Amazon EventBridge to detect new data and run AWS Glue workflows. IJse AWS Glue DataBrew to detect and mask the sensitive data. Store the processed data in an S3 bucket.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 67

A company needs to set up a data catalog and metadata management for data sources that run in the AWS Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3.

The company needs a solution that will update the data catalog on a regular basis. The solution also must detect changes to the source metadata.

Which solution will meet these requirements with the LEAST operational overhead?

A
Use Amazon Aurora as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the Aurora data catalog. Schedule the Lambda functions to run periodically.
B
Use the AWS Glue Data Catalog as the central metadata repository. Use AWS Glue crawlers to connect to multiple data stores and to update the Data Catalog with metadata changes. Schedule the crawlers to run periodically to update the metadata catalog.
C
Use Amazon DynamoDB as the data catalog. Create AWS Lambda functions that will connect to the data catalog. Configure the Lambda functions to gather the metadata information from multiple sources and to update the DynamoDB data catalog. Schedule the Lambda functions to run periodically.
D
Use the AWS Glue Data Catalog as the central metadata repository. Extract the schema for Amazon RDS and Amazon Redshift sources, and build the Data Catalog. Use AWS Glue crawlers for data that is in Amazon S3 to infer the schema and to automatically update the Data Catalog.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 68

A company is designing an AWS analytics platform to centralize data that contains personally identifiable information (PII). The platform must support SQL-based queries on semi-structured data that is stored in Amazon S3. The platform must also provide fine-grained access control, including column-level restrictions and data masking, based on user roles. The platform requires minimal operational overhead and must scale securely as the company adds more analysts and data sources.

Which solution will meet these requirements?

A
Use Amazon Athena to query the S3 data. Manage access to PII columns by using IAM policies. Enforce masking by restricting access to views with predefined column sets.
B
Use AWS Glue to catalog datasets in Amazon S3. Provide SQL access through Amazon Athena. Implement field-level masking by using custom views and conditional logic in queries.
C
Use AWS Lake Formation to register the S3 data. Enforce column-level permissions and dynamic data masking with Lake Formation policies. Query by using Amazon Athena.
D
Use Amazon Redshift Spectrum to query the S3 data. Implement column-level restrictions by using Amazon Redshift views. Apply masking logic in the query results.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 69

A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one AWS Glue job. The solution must integrate with AWS services.

Which solution will meet these requirements with the LEAST management overhead?

A
Use an AWS Step Functions workflow that includes a state machine. Configure the state machine to run the Lambda function and then the AWS Glue job.
B
Use an Apache Airflow workflow that is deployed on an Amazon EC2 instance. Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.
C
Use an AWS Glue workflow to run the Lambda function and then the AWS Glue job.
D
Use an Apache Airflow workflow that is deployed on Amazon Elastic Kubernetes Service (Amazon EKS). Define a directed acyclic graph (DAG) in which the first task is to call the Lambda function and the second task is to call the AWS Glue job.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 70

A company has a data pipeline that processes transaction data in real time. The company needs a notification system that alerts different teams based on the type of processing error without any delay. For security-related errors, the system must immediately notify the security team. For data validation errors, the system must notify the data quality team. For system errors, the system must notify the operations team.

Which solution will meet these requirements with the LEAST operational overhead?

A
Create an Amazon Simple Notification Service (Amazon SNS) topic with an AWS Lambda function subscriber that evaluates the error type and forwards the error to the appropriate email addresses.
B
Configure Amazon EventBridge rules with distinct event patterns for each error type. Route each error type to a dedicated Amazon Simple Notification Service (Amazon SNS) topic for team-specific alerts.
C
Use Amazon Simple Queue Service (Amazon SQS) with message attributes to categorize errors. Allow each team to poll their respective SQS queue for relevant errors.
D
Set up Amazon CloudWatch alarms with different metrics for each error type. Invoke a different Amazon Simple Notification Service (Amazon SNS) notification each time a metrics threshold is crossed.

Premium Solution Locked

Unlock all 351 answers & explanations

QUESTION 71

A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data.

Which solution will meet these requirements with the LEAST operational overhead?

A
Configure an AWS Lambda function to load data from the S3 bucket into a pandas dataframe. Write a SQL SELECT statement on the dataframe to query the required column.
B
Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.
C
Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column.
D
Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in Amazon Athena to query the required column.

Premium Solution Locked

Unlock all 351 answers & explanations

Full Question Bank Locked

You have reached the end of the free study guide preview. Upgrade now to unlock all 351 questions and the full simulation engine.

Customer Reviews

5 / 5
(15,000+ verified)
5
100%
4
0%
3
0%
2
0%
1
0%

Global Community Feedback

DM

David M.

Verified Student

"The practice engine is incredible. It feels exactly like the real testing environment and helped me build so much confidence."

SJ

Sarah J.

Premium Member

"The PDF is very well organized and the explanations for the answers are actually helpful, not just random text."

MC

Michael C.

Verified Buyer

"I was skeptical, but the content is high quality and definitely worth the price. I passed on my first try!"

Need Assistance?

Our expert support team is available to assist you with any inquiries about our exam materials.

Contact Support
Average response: < 24 Hours

Get Exam Updates

Subscribe to receive instant notifications on new questions and exclusive flash sales.

* Join 5,000+ students getting weekly updates

Support Chat โ— Active Now

๐Ÿ‘‹ Hi! How can we help you pass your exam?

Enter email to start chatting