Amazon AWS Certified Data Engineer - Associate (DEA-C01)
Get full access to the updated question bank and confidently prepare for your exam.
Vendor
Amazon
Certification
Associate Certifications
Content
351 Qs
Status
Verified
Updated
1 day ago
Test the Practice Engine
Experience our interactive testing environment with free demo questions
Premium Bundle
Complete Success Suite
Save $44 Instantly
-
โFull PDF + Interactive Engine Everything you need to pass
-
โAll Advanced Question Types Drag & Drop, Hotspots, Case Studies
-
โPriority 24/7 Expert Support Direct line to certification leads
-
โ90 Days Free Priority Updates Stay current as exams change
Success Metric
98.4% Pass Rate
Standard Simulation
Practice Engine
One-Time Payment
-
Web-Based (Zero Install)
-
Real Testing Environment Virtual & Practice Modes
-
Interactive Engine Drag & Drop, Hotspots
-
60 Days Free Updates
Compatible with All Devices
Basic Tier
PDF Study Guide
Digital Access
- โ Exam Questions (PDF)
- โ Mobile Friendly
- โ 60 Days Updates
Verified 71-Question Preview (DEA-C01)
Verified Community
The CertoMetrics Standard.
Recommend the #1 platform for verified Amazon certification resources.
Success Network
Help a Colleague Succeed.
Invite a peer to get their own updated DEA-C01 prep kit.
Exam Overview
The AWS Certified Data Engineer - Associate certification is a pivotal credential for professionals aiming to validate their expertise in designing, building, managing, and monitoring robust data pipelines on the Amazon Web Services (AWS) platform. This certification demonstrates a deep understanding of core AWS data services, enabling candidates to effectively ingest, transform, store, and process data for analytical workloads. Achieving this certification not only signifies your technical proficiency in modern data engineering practices but also significantly enhances your marketability, opening doors to advanced career opportunities and positioning you as a valuable asset in organizations leveraging cloud-native data solutions. It's a testament to your ability to drive data-driven insights and innovation.
Questions
65
Passing Score
700/1000
Duration
130 Minutes
Difficulty
Intermediate
Level
Associate
Skills Measured
Career Path
Target Roles
Common Questions
Is the material up to date?
Yes. We update our question bank weekly to match the latest Amazon standards. You get free updates for 90 days.
What format do I get?
You get instant access to both the **PDF** (for reading) and our **Premium Test Engine** (for exam simulation).
Is there a guarantee?
Absolutely. If you fail the DEA-C01 exam using our materials, we offer a full money-back guarantee.
When do I get the download?
Instantly. The download link is available in your dashboard immediately after payment is confirmed.
Free Study Guide Samples
Previewing updated DEA-C01 bank (71 Questions).
A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application.
Which solution will meet these requirements with the LEAST operational overhead?
Correct Option: B
โ
Reasoning: The Amazon Redshift Data API provides a simple HTTP interface for applications to query Redshift without managing JDBC/ODBC drivers or connections. It handles authentication via IAM, connection pooling, and result retrieval, offering the least operational overhead for integrating real-time queries directly into a web application. โ Why the other choices are incorrect:
- Option A is incorrect: Redshift does not natively support direct WebSocket connections for queries. Implementing this would require building an intermediary service to translate requests, significantly increasing development and operational overhead.
- Option C is incorrect: Managing JDBC connections, drivers, and connection pools within a web application introduces more complexity and operational overhead compared to the fully managed Data API, especially for concurrent, real-time requests.
- Option D is incorrect: This option requires moving data from Redshift to S3, introducing ETL processes, potential data staleness, and fundamentally changing the query target. This increases, rather than decreases, operational overhead for querying the data initially in Redshift.
A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The company has several use cases. The company must implement permission controls to separate query processes and access to query history among users, teams, and applications that are in the same AWS account.
Which solution will meet these requirements?
Correct Option: B
โ
Reasoning: Athena workgroups separate query execution, control output locations, and manage query history. IAM policies with tags enforce granular permissions on workgroups, effectively isolating query processes and history access for different users, teams, or applications. โ Why the other choices are incorrect:
- Option A is incorrect: S3 bucket policies control access to the raw data, not Athena query execution settings, process separation, or query history within Athena itself.
- Option C is incorrect: IAM roles define permissions but don't provide the intrinsic separation of query processes and history that Athena workgroups offer within the Athena service.
- Option D is incorrect: AWS Glue Data Catalog resource policies manage access to table metadata. They do not control Athena's query execution environment, resource limits, or query history separation.
A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data engineer does not require the Glue jobs to run or finish at a specific time.
Which solution will run the Glue jobs in the MOST cost-effective way?
Correct Option: A
โ
Reasoning: The FLEX execution class is specifically designed for non-time-sensitive jobs, leveraging spare capacity to significantly lower costs per DPU-hour compared to the Standard class. Since the scenario explicitly states no specific run or finish time is required, Flex provides the most cost-effective solution. โ Why the other choices are incorrect:
- Option B is incorrect: AWS Glue job properties do not directly offer "Spot Instance type" as a configurable option for workers. While Glue may use Spot Instances internally for Flex, it's not a user-selectable setting.
- Option C is incorrect: The STANDARD execution class offers predictable performance but is more expensive per DPU-hour than Flex. Since time sensitivity is not a requirement, this is not the most cost-effective choice.
- Option D is incorrect: Choosing the latest Glue version can bring performance enhancements, but it doesn't directly provide the significant, dedicated cost reduction mechanism for non-time-sensitive workloads that the FLEX execution class offers.
A data engineer needs to create an AWS Lambda function that converts the format of data from .csv to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3 bucket.
Which solution will meet these requirements with the LEAST operational overhead?
Correct Option: A
โ
Reasoning: This solution correctly uses s3:ObjectCreated:* to trigger only on new object uploads, and a suffix filter for .csv ensures only relevant files invoke Lambda. Directly setting the Lambda ARN as the destination is the most direct and lowest-overhead integration for S3 event notifications. โ Why the other choices are incorrect:
- Option B is incorrect:
s3:ObjectTagging:*triggers on tag changes, not object creation. Relying on tags for file type identification adds unnecessary complexity and operational overhead. - Option C is incorrect:
s3:*is too broad an event type, triggering for all S3 events. While the filter limits Lambda execution, it generates notifications for many irrelevant actions, increasing S3 service overhead compared tos3:ObjectCreated:*. - Option D is incorrect: Introducing an Amazon SNS topic adds an additional component to manage (SNS topic, subscriptions, permissions). This increases operational overhead compared to directly invoking the Lambda function from the S3 event notification.
A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column.
Which solution will MOST speed up the Athena query performance?
Correct Option: C
โ **Change the data format from.csv to Apache Parquet. Apply Snappy compression. **
Reasoning: Parquet is a columnar format, allowing Athena to read only selected columns, greatly reducing I/O and scan time. Snappy compression further decreases data scanned and improves performance. This combination directly addresses the "selecting a specific column" pattern and provides the most significant speedup. โ Why the other choices are incorrect:
- Option A is incorrect: JSON is a row-based format; it still requires reading entire rows even when only specific columns are needed, negating the primary benefit of columnar storage. Snappy helps, but the format limits optimization.
- Option B is incorrect: While Snappy compression reduces data scanned, the underlying CSV format is row-based. Athena still reads full rows, missing the significant performance gains from columnar pruning for "selecting specific columns."
- Option D is incorrect: Gzip compresses data, but CSV remains a row-based format, preventing columnar pruning. Gzip also has slower decompression than Snappy in parallel environments like Athena, potentially hindering performance.
A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket.
The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility.
Which solution will meet these requirements with the LOWEST latency?
Correct Option: A
โ
Reasoning: Amazon Managed Service for Apache Flink (Kinesis Data Analytics) processes data directly from Kinesis Data Streams with very low latency. Amazon Timestream is a purpose-built time-series database optimized for high-ingestion rates and fast queries, ideal for real-time dashboards like Grafana. This combination provides the most direct, real-time, and lowest-latency solution. โ Why the other choices are incorrect:
- Option B is incorrect: This approach introduces latency by writing data to S3 first, then triggering Lambda. Amazon Aurora is a relational database not optimized for time-series data or the lowest-latency ingestion/querying compared to Timestream.
- Option C is incorrect: While Flink is good, creating an additional Kinesis Data Firehose stream after Flink to Timestream adds an unnecessary processing hop, increasing latency. Flink can directly write to Timestream. Amazon QuickSight typically has higher dashboard refresh latency than Grafana with Timestream for pure real-time streaming.
- Option D is incorrect: Reading from S3 using AWS Glue, even with bookmarks, introduces higher latency as S3 is object storage and Glue is typically for batch/micro-batch processing, not true real-time streaming at the lowest latency.
A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data.
The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog.
Which solution will meet these requirements?
Correct Option: B
โ
Reasoning: An AWSGlueServiceRole IAM role grants the necessary permissions for a Glue crawler to read S3 data and write metadata to the AWS Glue Data Catalog. A daily schedule ensures the data is accessible daily. Specifying a database name directs the crawler's output to the Data Catalog. โ Why the other choices are incorrect:
- Option A is incorrect: The
AmazonS3FullAccesspolicy is insufficient as Glue crawlers requireAWSGlueServiceRoleto interact with the Data Catalog. Also, crawlers output to the Data Catalog, not back to an S3 path. - Option C is incorrect: The
AmazonS3FullAccesspolicy is insufficient for Glue operations. Allocating DPUs is a resource aspect, not the method to schedule a daily run, which requires setting a schedule. - Option D is incorrect: Allocating DPUs is not the method for daily scheduling. More critically, Glue crawlers populate metadata in the AWS Glue Data Catalog, they do not output data to a new path in an S3 bucket.
A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded.
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
Correct Option: B
โ
Reasoning: After performing the Redshift load using the Data API, the orchestrating application can publish a custom event detailing the load status to Amazon EventBridge. EventBridge's primary purpose is to route events, and a rule can easily be configured to filter for this specific event pattern and invoke the Lambda function. This provides a robust, scalable, and event-driven solution for tracking load statuses. โ Why the other choices are incorrect:
- Option A is incorrect: Relying on a second Lambda function invoked by CloudWatch events would typically involve polling or scheduling, which is less efficient and not truly event-driven in response to the completion of a Redshift load.
- Option C is incorrect: While the application could publish to SQS, EventBridge is generally a more suitable event bus for "publishing events" and routing "load statuses," offering broader routing capabilities and a clearer semantic fit for event-driven architectures than a simple queue.
- Option D is incorrect: AWS CloudTrail logs API calls for auditing. While Redshift API calls are logged, using CloudTrail for real-time application-specific event triggers for data load completion is inefficient, complex, and not its intended primary use case.
A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically.
Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?
Correct Option: A
โ **AWS DataSync **
Reasoning: DataSync is purpose-built for secure, automated, and scheduled online data transfer from on-premises storage to S3. It efficiently handles large initial datasets (5TB) and regularly synchronizes incremental changes (5% daily), optimizing for operational efficiency and meeting all specified requirements for file formats and automation. โ Why the other choices are incorrect:
- Option B is incorrect: AWS Glue is an ETL service designed for data transformation and cataloging, not for automated, incremental file transfer and synchronization from on-premises file systems to S3. It focuses on data processing rather than reliable, scheduled file replication.
- Option C is incorrect: AWS Direct Connect provides a dedicated network connection to AWS, improving bandwidth and latency. However, it is an underlying network service, not a data transfer tool. It does not automate, schedule, or manage the actual file movement or incremental updates.
- Option D is incorrect: Amazon S3 Transfer Acceleration speeds up data transfers over long distances by leveraging AWS Edge Locations. It improves transfer performance but does not provide the automation, scheduling, or incremental synchronization capabilities needed for ongoing on-premises file replication.
A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently.
The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database.
Which AWS service should the company use to meet these requirements?
Correct Option: B
โ **AWS Database Migration Service (AWS DMS) **
Reasoning: AWS DMS is specifically designed for migrating databases to AWS, supporting homogeneous migrations like SQL Server to RDS for SQL Server. It offers continuous data replication (CDC) to ensure minimal downtime for applications during migration, and is a cost-effective solution for large-scale database transfers. โ Why the other choices are incorrect:
- Option A is incorrect: AWS Lambda is a serverless compute service for running code, not a dedicated service for migrating databases with transactional integrity and minimal downtime. Custom solutions using Lambda would be complex and costly.
- Option C is incorrect: AWS Direct Connect provides a dedicated network connection to AWS. While it can reduce network costs and improve transfer speed, it is a networking service, not a database migration service itself. It doesn't perform the migration.
- Option D is incorrect: AWS DataSync is for migrating large amounts of file or object data between on-premises storage and AWS storage services. It is not designed for transactional database migrations to services like Amazon RDS, which require specific database replication capabilities.
A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour.
Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling.
Which solution will meet this requirement?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes.
Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options.
The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS.
Which extract, transform, and load (ETL) service will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII.
Which solution will meet this requirement with the LEAST operational effort?
Premium Solution Locked
Unlock all 351 answers & explanations
A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data.
The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.
A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.
The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.
Which solution will meet these requirements in the MOST cost-effective way?
Premium Solution Locked
Unlock all 351 answers & explanations
A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks.
The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster.
The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3.
Which solution will meet this requirement MOST cost-effectively?
Premium Solution Locked
Unlock all 351 answers & explanations
A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance.
Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company uses an Amazon QuickSight dashboard to monitor usage of one of the company's applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day.
A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs.
Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer needs to use AWS Step Functions to design an orchestration workflow. The workflow must parallel process a large collection of data files and apply a specific transformation to each file.
Which Step Functions state should the data engineer use to meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information.
The data engineer must identify and remove duplicate information from the legacy application data.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company is building an analytics solution. The solution uses Amazon S3 for data lake storage and Amazon Redshift for a data warehouse. The company wants to use Amazon Redshift Spectrum to query the data that is in Amazon S3.
Which actions will provide the FASTEST queries? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company uses Amazon RDS to store transactional data. The company runs an RDS DB instance in a private subnet. A developer wrote an AWS Lambda function with default settings to insert, update, or delete data in the DB instance.
The developer needs to give the Lambda function the ability to connect to the DB instance privately without using the public internet.
Which combination of steps will meet this requirement with the LEAST operational overhead? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company has a frontend ReactJS website that uses Amazon API Gateway to invoke REST APIs. The APIs perform the functionality of the website. A data engineer needs to write a Python script that can be occasionally invoked through API Gateway. The code must return results to API Gateway.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company has a production AWS account that runs company workloads. The company's security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs.
The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes.
A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake.
Which solution will capture the changed data MOST cost-effectively?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer runs Amazon Athena queries on data that is in an Amazon S3 bucket. The Athena queries use AWS Glue Data Catalog as a metadata table.
The data engineer notices that the Athena query plans are experiencing a performance bottleneck. The data engineer determines that the cause of the performance bottleneck is the large number of partitions that are in the S3 bucket. The data engineer must resolve the performance bottleneck and reduce Athena query planning time.
Which solutions will meet these requirements? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company uses Amazon Athena to run SQL queries for extract, transform, and load (ETL) tasks by using Create Table As Select (CTAS). The company must use Apache Spark instead of SQL to generate analytics.
Which solution will give the company the ability to use Spark to access Athena?
Premium Solution Locked
Unlock all 351 answers & explanations
A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized views.
Which solution will meet this requirement with the LEAST effort?
Premium Solution Locked
Unlock all 351 answers & explanations
A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends.
The company must ensure that the application performs consistently during peak usage times.
Which solution will meet these requirements in the MOST cost-effective way?
Premium Solution Locked
Unlock all 351 answers & explanations
A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple user groups need to access the raw data. The company must ensure that user groups can access only the PII that they require.
Which solution will meet these requirements with the LEAST effort?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer must build an extract, transform, and load (ETL) pipeline to process and load data from 10 source systems into 10 tables that are in an Amazon Redshift database. All the source systems generate .csv, JSON, or Apache Parquet files every 15 minutes. The source systems all deliver files into one Amazon S3 bucket. The file sizes range from 10 MB to 20 GB. The ETL pipeline must function correctly despite changes to the data schema.
Which data pipeline solutions will meet these requirements? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company has used an Amazon Redshift table that is named Orders for 6 months. The company performs weekly updates and deletes on the table. The table has an interleaved sort key on a column that contains AWS Regions.
The company wants to reclaim disk space so that the company will not run out of storage space. The company also wants to analyze the sort key column.
Which Amazon Redshift command will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A company stores details about transactions in an Amazon S3 bucket. The company wants to log all writes to the S3 bucket into another S3 bucket that is in the same AWS Region.
Which solution will meet this requirement with the LEAST operational effort?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The repository needs to provide the schema and properties of many tables. Some of the metadata is stored in Apache Hive. The data engineer needs to import the metadata from Hive into the central metadata repository.
Which solution will meet these requirements with the LEAST development effort?
Premium Solution Locked
Unlock all 351 answers & explanations
A company needs to build a data lake in AWS. The company must provide row-level data access and column-level data access to specific teams. The teams will access the data by using Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A manufacturing company wants to collect data from sensors. A data engineer needs to implement a solution that ingests sensor data in near real time.
The solution must store the data to a persistent data store. The solution must store the data in nested JSON format. The company must have the ability to query from the data store with a latency of less than 10 milliseconds.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineering team is using an Amazon Redshift data warehouse for operational reporting. The team wants to prevent performance issues that might result from long- running queries. A data engineer must choose a system table in Amazon Redshift to record anomalies when a query optimizer identifies conditions that might indicate performance issues.
Which table views should the data engineer use to meet this requirement?
Premium Solution Locked
Unlock all 351 answers & explanations
A company is migrating its database servers from Amazon EC2 instances that run Microsoft SQL Server to Amazon RDS for Microsoft SQL Server DB instances. The company's analytics team must export large data elements every day until the migration is complete. The data elements are the result of SQL joins across multiple tables. The data must be in Apache Parquet format. The analytics team must store the data in Amazon S3.
Which solution will meet these requirements in the MOST operationally efficient way?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.
Which solution will meet these requirements MOST cost-effectively?
Premium Solution Locked
Unlock all 351 answers & explanations
A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII.
To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solution that with redact PII dynamically, based on the needs of each application that accesses the dataset.
Which solution will meet the requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models.
The data engineer receives an access denied error when the data engineer tries to prepare the data by using SageMaker Studio.
Which change should the engineer make to gain access to SageMaker Studio?
Premium Solution Locked
Unlock all 351 answers & explanations
A healthcare company uses Amazon Kinesis Data Streams to stream real-time health data from wearable devices, hospital equipment, and patient records.
A data engineer needs to find a solution to process the streaming data. The data engineer needs to store the data in an Amazon Redshift Serverless warehouse. The solution must support near real-time analytics of the streaming data and the previous day's data.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) General Purpose SSD storage from gp2 to gp3. The company wants to prevent any interruptions in its Amazon EC2 instances that will cause data loss during the migration to the upgraded storage.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer must manage the ingestion of real-time streaming data into AWS. The data engineer wants to perform real-time analytics on the incoming streaming data by using time-based aggregations over a window of up to 30 minutes. The data engineer needs a solution that is highly fault tolerant.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change.
A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company stores datasets in JSON format and .csv format in an Amazon S3 bucket. The company has Amazon RDS for Microsoft SQL Server databases, Amazon DynamoDB tables that are in provisioned capacity mode, and an Amazon Redshift cluster. A data engineering team must develop a solution that will give data scientists the ability to query all data sources by using syntax similar to SQL.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer needs to use an Amazon QuickSight dashboard that is based on Amazon Athena queries on data that is stored in an Amazon S3 bucket. When the data engineer connects to the QuickSight dashboard, the data engineer receives an error message that indicates insufficient permissions.
Which factors could cause to the permissions-related errors? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company receives a daily file that contains customer data in .xls format. The company stores the file in Amazon S3. The daily file is approximately 2 GB in size.
A data engineer concatenates the column in the file that contains customer first names and the column that contains customer last names. The data engineer needs to determine the number of distinct customers in the file.
Which solution will meet this requirement with the LEAST operational effort?
Premium Solution Locked
Unlock all 351 answers & explanations
During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script.
A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must securely store the credentials.
Which combination of steps should the data engineer take to meet these requirements? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by using encryption. The company must also use encryption keys that only specific employees can access.
Which solution will meet these requirements with the LEAST effort?
Premium Solution Locked
Unlock all 351 answers & explanations
A company's data engineer needs to optimize the performance of table SQL queries. The company stores data in an Amazon Redshift cluster. The data engineer cannot increase the size of the cluster because of budget constraints.
The company stores the data in multiple tables and loads the data by using the EVEN distribution style. Some tables are hundreds of gigabytes in size. Other tables are less than 10 MB in size.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
An application uses an AWS Lambda function that is configured with managed runtimes. The Lambda function successfully writes logs to the default Amazon CloudWatch Logs log group. A data engineer wants to modify the logging behavior to show only ERROR level logs for application logs and WARN level logs for system logs.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A financial company wants to use Amazon Athena to run on-demand SQL queries on a petabyte-scale dataset to support a business intelligence (BI) application. An AWS Glue job that runs during non-business hours updates the dataset once every day. The BI application has a standard data refresh frequency of 1 hour to comply with company policies.
A data engineer wants to cost optimize the company's use of Amazon Athena without adding any additional infrastructure costs.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer needs to build an interactive system that answers complex questions about customer feedback. The system must generate comprehensive reports that summarize feedback trends and provide natural language explanations of product issues with supporting evidence. The system must give users the ability to ask follow-up questions in natural language. The system must also dynamically explore feedback patterns without predefined categories. The data must remain within the data engineer's AWS account.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
An airline company is collecting metrics about flight activities for analytics. The company is conducting a proof of concept (POC) test to show how analytics can provide insights that the company can use to increase on-time departures.
The POC test uses objects in Amazon S3 that contain the metrics in .csv format. The POC test uses Amazon Athena to query the data. The data is partitioned in the S3 bucket by date.
As the amount of data increases, the company wants to optimize the storage solution to improve query performance.
Which combination of solutions will meet these requirements? (Choose two.)
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer needs to share a dataset that contains customer transaction data with a machine learning (ML) team in another AWS account. The dataset is stored in Amazon S3 and contains sensitive information that requires governance controls. The data engineer wants to ensure that the ML team can discover and request access to the dataset. The solution must maintain appropriate security controls and track data lineage.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A security company stores IoT data that is in JSON format in an Amazon S3 bucket. The data structure can change when the company upgrades the IoT devices. The company wants to create a data catalog that includes the IoT data. The company's analytics department will use the data catalog to index the data.
Which solution will meet these requirements MOST cost-effectively?
Premium Solution Locked
Unlock all 351 answers & explanations
A company has an application that is deployed on AWS. The application uses Amazon Simple Notification Service (Amazon SNS) with multiple topics. The company's security team needs to be able to audit all Publish and PublishBatch API actions for all the SNS topics. The company's application team and security team must also be able to query the audit data. The company has already established an event data store in AWS CloudTrail Lake to collect all events.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution.
A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations.
The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A company processes a CSV file that contains millions of transaction records every day. The file is stored in Amazon S3. Each transaction must be validated before updating a database. The company needs a solution that will process the data in parallel. The solution must use error handling that stops the entire process if more than 15% of the records fail validation.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company is planning to migrate on-premises Apache Hadoop clusters to Amazon EMR. The company also needs to migrate a data catalog into a persistent storage solution.
The company currently stores the data catalog in an on-premises Apache Hive metastore on the Hadoop clusters. The company requires a serverless solution to migrate the data catalog.
Which solution will meet these requirements MOST cost-effectively?
Premium Solution Locked
Unlock all 351 answers & explanations
A global ecommerce company occasionally receives customer data files in its Amazon S3 data lake. The company needs to automatically detect new data and mask sensitive data before making the data available to the company's analytics team.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company needs to set up a data catalog and metadata management for data sources that run in the AWS Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3.
The company needs a solution that will update the data catalog on a regular basis. The solution also must detect changes to the source metadata.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company is designing an AWS analytics platform to centralize data that contains personally identifiable information (PII). The platform must support SQL-based queries on semi-structured data that is stored in Amazon S3. The platform must also provide fine-grained access control, including column-level restrictions and data masking, based on user roles. The platform requires minimal operational overhead and must scale securely as the company adds more analysts and data sources.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer must orchestrate a data pipeline that consists of one AWS Lambda function and one AWS Glue job. The solution must integrate with AWS services.
Which solution will meet these requirements with the LEAST management overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A company has a data pipeline that processes transaction data in real time. The company needs a notification system that alerts different teams based on the type of processing error without any delay. For security-related errors, the system must immediately notify the security team. For data validation errors, the system must notify the data quality team. For system errors, the system must notify the operations team.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data.
Which solution will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 351 answers & explanations
Full Question Bank Locked
You have reached the end of the free study guide preview. Upgrade now to unlock all 351 questions and the full simulation engine.
Certification Path
Related Certifications
Customer Reviews
Global Community Feedback
David M.
"The practice engine is incredible. It feels exactly like the real testing environment and helped me build so much confidence."
Sarah J.
"The PDF is very well organized and the explanations for the answers are actually helpful, not just random text."
Michael C.
"I was skeptical, but the content is high quality and definitely worth the price. I passed on my first try!"