Amazon AWS Certified Machine Learning - Specialty (MLS-C01)
Get full access to the updated question bank and confidently prepare for your exam.
Vendor
Amazon
Certification
Specialty Certifications
Content
390 Qs
Status
Verified
Updated
3 days ago
Test the Practice Engine
Experience our interactive testing environment with free demo questions
Premium Bundle
Complete Success Suite
Save $44 Instantly
-
โFull PDF + Interactive Engine Everything you need to pass
-
โAll Advanced Question Types Drag & Drop, Hotspots, Case Studies
-
โPriority 24/7 Expert Support Direct line to certification leads
-
โ90 Days Free Priority Updates Stay current as exams change
Success Metric
98.4% Pass Rate
Standard Simulation
Practice Engine
One-Time Payment
-
Web-Based (Zero Install)
-
Real Testing Environment Virtual & Practice Modes
-
Interactive Engine Drag & Drop, Hotspots
-
60 Days Free Updates
Compatible with All Devices
Basic Tier
PDF Study Guide
Digital Access
- โ Exam Questions (PDF)
- โ Mobile Friendly
- โ 60 Days Updates
Verified 78-Question Preview (MLS-C01)
Verified Community
The CertoMetrics Standard.
Recommend the #1 platform for verified Amazon certification resources.
Success Network
Help a Colleague Succeed.
Invite a peer to get their own updated MLS-C01 prep kit.
Exam Overview
The AWS Certified Machine Learning - Specialty certification is a rigorous validation of your expertise in designing, implementing, deploying, and maintaining machine learning solutions on the Amazon Web Services platform. Achieving this credential signifies a deep understanding of core machine learning concepts, AWS ML services like Amazon SageMaker, and the ability to solve complex business problems using AI/ML. This certification not only enhances your professional credibility and marketability but also empowers you to drive innovation, optimize workflows, and deliver significant business value by leveraging AWS's comprehensive suite of machine learning tools. It's a testament to your capability in operationalizing ML models from data preparation to production deployment.
Questions
65
Passing Score
750/1000
Duration
170 Minutes
Difficulty
Expert
Level
Specialist
Skills Measured
Career Path
Target Roles
Common Questions
Is the material up to date?
Yes. We update our question bank weekly to match the latest Amazon standards. You get free updates for 90 days.
What format do I get?
You get instant access to both the **PDF** (for reading) and our **Premium Test Engine** (for exam simulation).
Is there a guarantee?
Absolutely. If you fail the MLS-C01 exam using our materials, we offer a full money-back guarantee.
When do I get the download?
Instantly. The download link is available in your dashboard immediately after payment is confirmed.
Free Study Guide Samples
Previewing updated MLS-C01 bank (78 Questions).
An interactive online dictionary wants to add a widget that displays words used in similar contexts. A Machine Learning Specialist is asked to provide word features for the downstream nearest neighbor model powering the widget.
What should the Specialist do to meet these requirements?
Correct Option: D
โ
Download word embeddings pre-trained on a large corpus.
Description: Pre-trained word embeddings are vector representations of words that have been learned from massive text corpora (e.g., Wikipedia, Common Crawl, Google News). These embeddings, such as Word2Vec, GloVe, or fastText, capture semantic relationships and syntactic regularities between words based on their co-occurrence patterns in the training data. Each word is mapped to a vector in a continuous vector space, where words with similar meanings are located closer together.
Why this fits: For most machine learning tasks involving natural language processing (NLP), especially when starting a new project or working with limited domain-specific data, downloading pre-trained word embeddings is the most efficient and effective approach. Training custom word embeddings from scratch requires a very large amount of text data and significant computational resources, which is often not feasible. Pre-trained embeddings provide a strong baseline, generalize well across many tasks, and save considerable time and effort. Options A and C describe methods for creating embeddings but are generally less efficient than leveraging pre-trained models, while Option B is related to generating synonyms, not directly creating word embeddings for semantic representation.
Example: An NLP engineer is building a sentiment analysis model for customer reviews. Instead of training word embeddings from scratch on their relatively small dataset of reviews, they can download pre-trained GloVe embeddings. They then use these pre-trained embeddings as the initial layer in their neural network, fine-tuning them slightly during the training process with their review data to adapt them to the specific domain.
A data engineer is preparing a dataset that a retail company will use to predict the number of visitors to stores. The data engineer created an Amazon S3 bucket. The engineer subscribed the S3 bucket to an AWS Data Exchange data product for general economic indicators. The data engineer wants to join the economic indicator data to an existing table in Amazon Athena to merge with the business data. All these transformations must finish running in 30-60 minutes.
Which solution will meet these requirements MOST cost-effectively?
Correct Option: C
โ
Use an S3 event on the AWS Data Exchange S3 bucket to invoke an AWS Lambda function. Program the Lambda function to run an AWS Glue job that will merge the existing business data with the Athena table. Write the results back to Amazon S3.
Description: This solution outlines an event-driven, serverless data integration and transformation pipeline. AWS Data Exchange provides data products, typically delivering data files to an Amazon S3 bucket within the subscriber's account. An S3 event notification can then automatically trigger an AWS Lambda function when new data arrives. Lambda acts as an orchestrator, initiating an AWS Glue job. AWS Glue is a fully managed extract, transform, and load (ETL) service that can process large datasets, integrating new data with existing business data often cataloged in the AWS Glue Data Catalog and queryable via Amazon Athena. The processed and merged data is then written back to Amazon S3.
Why this fits: This approach is highly scalable, cost-effective, and fits a common data lake pattern.
- Event-Driven Ingestion: S3 event notifications provide an immediate, automated trigger when new data from AWS Data Exchange lands in the S3 bucket, eliminating manual checks.
- Serverless Orchestration: AWS Lambda is a lightweight, serverless compute service ideal for orchestrating tasks like triggering an AWS Glue job. It executes code in response to events without provisioning or managing servers.
- Robust ETL: AWS Glue is specifically designed for ETL workloads. It can read data from S3 (where existing and new data reside), understand schemas via the Data Catalog, perform complex merges and transformations, and write the results back to S3. This makes it ideal for merging data from various sources and formats.
- Data Lake Integration: By storing data in S3 and making it queryable via Athena, the solution leverages a data lake architecture, which is flexible and cost-efficient for analytics and machine learning workloads. Merging with an Athena table implies the existing business data is already part of this data lake.
Example: A financial institution subscribes to a market data product on AWS Data Exchange. When new daily market data files are delivered to their S3 bucket, an S3 event triggers a Lambda function. This Lambda function then starts an AWS Glue job. The Glue job reads the new market data from the S3 bucket, along with the existing historical market data stored in another S3 location (and cataloged by Athena). It merges the new data, deduplicates entries, and potentially enriches it before writing the updated, merged dataset back to a designated S3 prefix, making it immediately available for queries via Athena for financial analysts and ML models.
A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked.
Which services are integrated with Amazon SageMaker to track this information? (Choose two.)
Correct Option: A,D
โ
AWS CloudTrail
Description: AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account. It records actions taken by a user, role, or an AWS service as events. These events include actions performed through the AWS Management Console, AWS SDKs, command line tools, and other AWS services.
Why this fits: For monitoring AWS Machine Learning workloads, CloudTrail is critical for auditing API calls made to ML services like Amazon SageMaker, Amazon Rekognition, or Amazon Comprehend. This allows organizations to track administrative and data plane activities, such as who initiated a training job, who modified a model endpoint configuration, or who accessed data in an S3 bucket used by an ML pipeline. This provides a comprehensive security and operational audit trail.
Example: A data scientist updates the configuration of an existing SageMaker endpoint. CloudTrail records the UpdateEndpoint API call, capturing details like the user identity, the exact timestamp, the source IP address, and the specific parameters of the update. This record is invaluable for security auditing and compliance.
โ Amazon CloudWatch
Description: Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. It collects monitoring and operational data in the form of logs, metrics, and events.
Why this fits: CloudWatch is essential for real-time performance and health monitoring of ML workloads. It collects various metrics from ML services (e.g., SageMaker training job CPU/GPU utilization, memory usage, endpoint invocation latency, error rates). It allows for the creation of custom dashboards to visualize these metrics, and the configuration of alarms to proactively notify stakeholders of potential issues, performance degradation, or operational failures. It also centralizes logs from ML applications and services, aiding in troubleshooting.
Example: During a SageMaker model training job, CloudWatch monitors resource utilization metrics like
CPUUtilization, GPUUtilization, and MemoryUtilization. An alarm can be configured to trigger if CPUUtilization drops below a certain threshold for an extended period, potentially indicating an inefficient training job or a stalled process, prompting investigation.
A financial services company wants to automate its loan approval process by building a machine learning (ML) model. Each loan data point contains credit history from a third-party data source and demographic information about the customer. Each loan approval prediction must come with a report that contains an for why the customer was approved for a loan or was denied for a loan. The company will use Amazon SageMaker to build the model.
Which solution will meet these requirements with the LEAST development effort?
Correct Option: C
โ
Use SageMaker Clarify to generate the report. Attach the report to the predicted results.
Description: Amazon SageMaker Clarify is a machine learning (ML) capability that helps detect potential bias in ML models and provides tools to explain model predictions. It allows for analysis of data for bias prior to training, and provides post-training explainability for model predictions. It can generate various types of reports, including bias reports and explainability reports (e.g., SHAP values), which can be integrated into model workflows.
Why this fits: The question implies a need to generate a report related to model predictions, likely for bias detection or model explainability. SageMaker Clarify is purpose-built for these tasks. It can assess potential bias in data and models and generate explainability reports for individual predictions or overall model behavior. Attaching this report to predicted results is a common practice to provide transparency and ensure fairness, directly aligning with Clarify's core functionality.
Example: A financial institution uses a machine learning model to approve loan applications. To ensure fairness and compliance, they use SageMaker Clarify after model training to detect potential biases in the model's decisions based on demographic features. Clarify generates an explainability report showing the feature importance for each loan approval decision, which is then attached to the decision record to provide transparency to auditors and customers.
A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose. To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs to be retrained daily.
Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort?
Correct Option: D
โ
Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQL.
Description: Amazon Kinesis Data Analytics (KDA) is a fully managed service designed to process and analyze streaming data in real time. It allows users to write SQL queries against incoming data streams to perform transformations, aggregations, and enrichments without needing to manage servers or infrastructure. It integrates seamlessly with Kinesis Data Firehose as both an input and output source.
Why this fits: This solution directly addresses the requirement for transforming raw record attributes into simple transformed values using SQL, specifically within a streaming context that leverages Kinesis Data Firehose. Kinesis Data Analytics for SQL applications provides a serverless and scalable way to perform these transformations in real time, making it ideal for processing data immediately after ingestion via Kinesis Data Firehose. It eliminates the operational overhead associated with managing compute instances (like EC2 or EMR) while offering robust real-time processing capabilities through a familiar SQL interface.
Example: A retail company streams raw customer clickstream data from its website via Kinesis Data Firehose. Before archiving this data to Amazon S3, they need to extract the product ID and categorize the event type (e.g., 'view', 'add_to_cart') from a complex JSON payload and calculate a simplified session duration. An Amazon Kinesis Data Analytics SQL application can be inserted downstream of the Firehose. It reads the raw JSON, applies SQL queries like SELECT JSON_VALUE(raw_data, '$.product.id') AS productId, CASE WHEN JSON_VALUE(raw_data, '$.event') LIKE '%view%' THEN 'view' ELSE 'other' END AS eventCategory FROM SOURCE_SQL_STREAM_001 to perform the necessary transformations, and then outputs the clean, transformed data to another Kinesis Data Firehose stream for delivery to S3.
A manufacturing company wants to build a machine learning (ML) model to predict defects in the screws that the company produces. There are three types of defects: bent. brittle. and cracked. Each screw can have zero or more of these defects. The company has collected data on all the screws produced in the past 6 months. including any features or defects associated with the screws.
Which algorithm will meet this requirement?
Correct Option: B
โ
Multilayer perceptron (MLP) with a Softmax function at the output layer
Description: A Multilayer Perceptron (MLP) is a class of feedforward artificial neural networks. It typically consists of an input layer, one or more hidden layers, and an output layer. MLPs are widely used for supervised learning tasks, including classification. The Softmax function is an activation function applied to the output layer of a neural network that is used for multi-class classification problems. It takes a vector of real numbers (the network's raw outputs, often called logits) and normalizes them into a probability distribution, where each value is between 0 and 1, and all values sum to 1. Each output value represents the predicted probability of the input belonging to a specific class.
Why this fits: For multi-class classification tasks, where an input instance needs to be assigned to one of several mutually exclusive categories, an MLP with a Softmax output layer is the standard and most appropriate architecture. The Softmax function provides a probabilistic interpretation of the network's output, allowing the model to predict the likelihood of an input belonging to each class. This is distinct from binary classification (which might use a sigmoid function) or unsupervised learning methods like clustering (K-means) or instance-based learning (K-nearest neighbors) which are not focused on the activation function of a neural network's output layer for probabilistic multi-class prediction.
Example: Consider a machine learning model designed to classify customer reviews into one of three sentiment categories: "positive", "neutral", or "negative". An MLP would process the textual features of a review, and its output layer, activated by a Softmax function, would produce probabilities such as [0.8 (positive), 0.15 (neutral), 0.05 (negative)], indicating a high likelihood that the review is positive.
A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes. The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes.
Which function will produce the desired output?
Correct Option: C
โ
Softmax
Description: Softmax is an activation function used in the output layer of a neural network primarily for multi-class classification problems. It takes a vector of arbitrary real-valued scores (logits) and squashes them into a probability distribution over multiple classes. Each element of the output vector is a probability between 0 and 1, and the sum of all elements in the output vector is 1.
Why this fits: For multi-class classification, where an input belongs to exactly one of several possible classes, Softmax is the appropriate choice for the output layer's activation function. It ensures that the model's output can be interpreted as the probability of the input belonging to each class, which is crucial for decision-making and often paired with a cross-entropy loss function during training.
Example: In an image classification task aiming to identify whether an image contains a "cat," "dog," or "bird," the output layer of the neural network would typically have three neurons, one for each class. Applying the Softmax function to the outputs of these three neurons would yield a probability distribution, for instance, [0.1, 0.85, 0.05], indicating an 85% probability that the image is a "dog."
A companyโs machine learning (ML) specialist is building a computer vision model to classify IO different traffic signs. The company has stored 100 images of each class in Amazon $3. and the company has another 10.000 unlabeled images. All the images come from dash cameras and are a size of 224 pixels x 224 pixels. After several training runs, the model is overfitting on the training data.
Which actions should the ML specialist take to address this problem? (Select TWO.)
Correct Option: A,C
โ
Use Amazon SageMaker Ground Truth to label the unlabeled images.
Description: Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build high-quality training datasets for machine learning. It can use human annotators (via Mechanical Turk, private workforce, or vendor workforce) or machine learning to label data, accelerating the data labeling process for various data types, including images.
Why this fits: Supervised machine learning models require labeled data for training. If you have a collection of unlabeled images, Ground Truth provides a scalable and efficient way to obtain the necessary labels, transforming raw data into a usable training dataset. This is a fundamental step in many machine learning workflows, especially for tasks like image classification, object detection, or segmentation where manual labeling can be tedious and error-prone.
Example: For a computer vision task requiring a model to identify different types of vehicles in images, a developer could use Amazon SageMaker Ground Truth to have human workers draw bounding boxes around vehicles and label them as "car," "truck," or "motorcycle" in a large dataset of raw images.
โ Use data augmentation to rotate and translate the labeled images.
Description: Data augmentation is a technique used to artificially increase the size and diversity of a training dataset by creating modified versions of existing data. For image data, common augmentation techniques include rotation, translation (shifting), scaling, flipping, shearing, and adjusting brightness or contrast. These transformations create new, slightly different training examples from the original labeled data.
Why this fits: When training a machine learning model, especially with a limited dataset, data augmentation helps improve the model's robustness and generalization capabilities, reducing overfitting. By exposing the model to various orientations and positions of the objects within images (e.g., via rotation and translation), the model learns to recognize patterns regardless of their exact presentation, leading to better performance on unseen data.
Example: If a dataset of cat images is used to train an image classifier, applying data augmentation could involve rotating some cat images by 15 degrees, translating others slightly to the left or right, or flipping a subset horizontally. This creates a larger, more varied training set without collecting new original images, helping the model become more resilient to variations in real-world cat photos.
A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target.
What option can the Specialist use to determine whether it is overestimating or underestimating the target value?
Correct Option: B
โ
Residual plots
Description: Residual plots are graphical tools used in regression analysis to visualize the difference between the observed and predicted values (residuals) from a model. These plots typically display the residuals on the y-axis against the predicted values, an independent variable, or the order of observations on the x-axis.
Why this fits: Residual plots are an essential diagnostic tool for regression models. They allow data scientists to visually inspect key assumptions of a regression model, such as linearity, homoscedasticity (constant variance of errors), and independence of errors. By identifying systematic patterns (e.g., a curve, a funnel shape, or unusual clusters), outliers, or trends in the residuals, these plots help in diagnosing model inadequacies, identifying whether a different model form (e.g., non-linear) is needed, or confirming the robustness of the model's assumptions. Unlike numerical metrics like RMSE, residual plots provide qualitative insights into the model's error distribution.
Example: After training a linear regression model to predict housing prices using Amazon SageMaker, a data scientist might generate a residual plot. If the plot shows residuals randomly scattered around zero with no clear pattern, it suggests that the linear model is a good fit and its assumptions are largely met. However, if the plot reveals a distinct "U" shape, it indicates that the linear model might be underfitting and that a non-linear relationship (e.g., quadratic term for a feature) may exist, requiring model refinement.
A machine learning (ML) engineer is integrating a production model with a customer metadata repository for real-time inference. The repository is hosted in Amazon SageMaker Feature Store. The engineer wants to retrieve only the latest version of the customer metadata record for a single customer at a time.
Which solution will meet these requirements?
Correct Option: D
โ
Use the SageMaker Feature Store GetRecord API with the record identifier.
Description: The GetRecord API in Amazon SageMaker Feature Store is designed for low-latency, real-time retrieval of a single record from the Online Store. It requires the FeatureGroupName and the RecordIdentifierValueAsString (which uniquely identifies the entity). The Online Store is optimized to serve the latest available features for a given record identifier.
Why this fits: When an application needs the most up-to-date features for a specific entity (identified by its record identifier) quickly, GetRecord is the most direct and efficient method. The SageMaker Feature Store's Online Store automatically ensures that GetRecord returns the absolute latest version of the record, eliminating the need for manual filtering by write_time or other timestamps. This is crucial for real-time inference where low latency is paramount.
Example: A real-time recommendation engine needs to fetch the latest browsing history and user preferences for a user entering a website. By calling GetRecord on a user_features Feature Group with the user's ID as the record identifier, the engine can instantly retrieve the most recent feature values for that user to personalize recommendations.
A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided.

Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class?
Premium Solution Locked
Unlock all 390 answers & explanations
A data scientist is using an Amazon SageMaker notebook instance and needs to securely access data stored in a specific Amazon S3 bucket.
How should the data scientist accomplish this?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric. This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours.
With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s).
Which visualization will accomplish this?
Premium Solution Locked
Unlock all 390 answers & explanations
A machine learning (ML) specialist uploads 5 TB of data to an Amazon SageMaker Studio environment. The ML specialist performs initial data cleansing. Before the ML specialist begins to train a model, the ML specialist needs to create and view an analysis report that details potential bias in the uploaded data.
Which combination of actions will meet these requirements with the LEAST operational overhead? (Choose two.)
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions.
Here is an example from the dataset:
"The quck BROWN FOX jumps over the lazy dog.โ
Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Choose three.)
Premium Solution Locked
Unlock all 390 answers & explanations
A company wants to conduct targeted marketing to sell solar panels to homeowners. The company wants to use machine learning (ML) technologies to identify which houses already have solar panels. The company has collected 8,000 satellite images as training data and will use Amazon SageMaker Ground Truth to label the data.
The company has a small internal team that is working on the project. The internal team has no ML expertise and no ML experience.
Which solution will meet these requirements with the LEAST amount of effort from the internal team?
Premium Solution Locked
Unlock all 390 answers & explanations
A company is using Amazon Polly to translate plaintext documents to speech for automated company announcements. However, company acronyms are being mispronounced in the current documents.
How should a Machine Learning Specialist address this issue for future documents?
Premium Solution Locked
Unlock all 390 answers & explanations
A company is building a line-counting application for use in a quick-service restaurant. The company wants to use video cameras pointed at the line of customers at a given register to measure how many people are in line and deliver notifications to managers if the line grows too long. The restaurant locations have limited bandwidth for connections to external services and cannot accommodate multiple video streams without impacting other operations.
Which solution should a machine learning specialist implement to meet these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Choose three.)
Premium Solution Locked
Unlock all 390 answers & explanations
A company is planning a marketing campaign to promote a new product to existing customers. The company has data for past promotions that are similar. The company decides to try an experiment to send a more expensive marketing package to a smaller number of customers. The company wants to target the marketing campaign to customers who are most likely to buy the new product. The experiment requires that at least 90% of the customers who are likely to purchase the new product receive the marketing materials.
The company trains a model by using the linear learner algorithm in Amazon SageMaker. The model has a recall score of 80% and a precision of 75%.
How should the company retrain the model to meet these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A monitoring service generates 1 TB of scale metrics record data every minute. A Research team performs queries on this data using Amazon Athena. The queries run slowly due to the large volume of data, and the team requires better performance.
How should the records be stored in Amazon S3 to improve query performance?
Premium Solution Locked
Unlock all 390 answers & explanations
A wildlife research company has a set of images of lions and cheetahs. The company created a dataset of the images. The company labeled each image with a binary label that indicates whether an image contains a lion or cheetah. The company wants to train a model to identify whether new images contain a lion or cheetah.
Which Amazon SageMaker algorithm will meet this requirement?
Premium Solution Locked
Unlock all 390 answers & explanations
Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published. A sample of the data being used is below.

Given the dataset, the Specialist wants to convert the Day_Of_Week column to binary values. What technique should be used to convert this column to binary values?
Premium Solution Locked
Unlock all 390 answers & explanations
A data scientist for a medical diagnostic testing company has developed a machine learning (ML) model to identify patients who have a specific disease. The dataset that the scientist used to train the model is imbalanced. The dataset contains a large number of healthy patients and only a small number of patients who have the disease. The model should consider that patients who are incorrectly identified as positive for the disease will increase costs for the company.
Which metric will MOST accurately evaluate the performance of this model?
Premium Solution Locked
Unlock all 390 answers & explanations
A gaming company has launched an online game where people can start playing for free, but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users.
The training dataset consists of 1,000 positive samples (from users who ended up paying within 1 year) and 999,000 negative samples (from users who did not use any paid features). Each data sample consists of 200 features including user age, device, location, and play patterns.
Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set. However, the prediction results on a test dataset were not satisfactory
Which of the following approaches should the Data Science team take to mitigate this issue? (Choose two.)
Premium Solution Locked
Unlock all 390 answers & explanations
A network security vendor needs to ingest telemetry data from thousands of endpoints that run all over the world. The data is transmitted every 30 seconds in the form of records that contain 50 fields. Each record is up to 1 KB in size. The security vendor uses Amazon Kinesis Data Streams to ingest the data. The vendor requires hourly summaries of the records that Kinesis Data Streams ingests. The vendor will use Amazon Athena to query the records and to generate the summaries. The Athena queries will target 7 to 12 of the available data fields.
Which solution will meet these requirements with the LEAST amount of customization to transform and store the ingested data?
Premium Solution Locked
Unlock all 390 answers & explanations
A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.
Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population
How should the Data Scientist correct this issue?
Premium Solution Locked
Unlock all 390 answers & explanations
A machine learning (ML) specialist is training a multilayer perceptron (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes in the dataset, but it does not achieve an acceptable recall metric. The ML specialist varies the number and size of the MLP's hidden layers, but the results do not improve significantly.
Which solution will improve recall in the LEAST amount of time?
Premium Solution Locked
Unlock all 390 answers & explanations
A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day, the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.
Which storage scheme is MOST adapted to this scenario?
Premium Solution Locked
Unlock all 390 answers & explanations
A company has a podcast platform that has thousands of users. The company implemented an algorithm to detect low podcast engagement based on a 10-minute running window of user events such as listening to, pausing, and closing the podcast. A machine learning (ML) specialist is designing the ingestion process for these events. The ML specialist needs to transform the data to prepare the data for inference.
How should the ML specialist design the transformation step to meet these requirements with the LEAST operational effort?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist deployed a model that provides product recommendations on a company's website. Initially, the model was performing very well and resulted in customers buying more products on average. However, within the past few months, the Specialist has noticed that the effect of product recommendations has diminished and customers are starting to return to their original habits of spending less. The Specialist is unsure of what happened, as the model has not changed from its initial deployment over a year ago.
Which method should the Specialist try to improve model performance?
Premium Solution Locked
Unlock all 390 answers & explanations
A machine learning (ML) specialist uploads a dataset to an Amazon S3 bucket that is protected by server-side encryption with AWS KMS keys (SSE-KMS). The ML specialist needs to ensure that an Amazon SageMaker notebook instance can read the dataset that is in Amazon S3.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist working for an online fashion company wants to build a data ingestion solution for the company's Amazon S3-based data lake.
The Specialist wants to create a set of ingestion mechanisms that will enable future capabilities comprised of:




Which services should the Specialist use?
Premium Solution Locked
Unlock all 390 answers & explanations
A company wants to enhance audits for its machine learning (ML) systems. The auditing system must be able to perform metadata analysis on the features that the ML models use. The audit solution must generate a report that analyzes the metadata. The solution also must be able to set the data sensitivity and authorship of features.
Which solution will meet these requirements with the LEAST development effort?
Premium Solution Locked
Unlock all 390 answers & explanations
A company is observing low accuracy while training on the default built-in image classification algorithm in Amazon SageMaker. The Data Science team wants to use an Inception neural network architecture instead of
a ResNet architecture.
Which of the following will accomplish this? (Choose two.)
Premium Solution Locked
Unlock all 390 answers & explanations
A company is deploying a new machine learning (ML) model in a production environment. The company is concerned that the ML model will drift over time, so the company creates a script to aggregate all inputs and predictions into a single file at the end of each day. The company stores the file as an object in an Amazon S3 bucket. The total size of the daily file is 100 GB. The daily file size will increase over time.
Four times a year, the company samples the data from the previous 90 days to check the ML model for drift. After the 90-day period, the company must keep the files for compliance reasons.
The company needs to use S3 storage classes to minimize costs. The company wants to maintain the same storage durability of the data.
Which solution will meet these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist built an image classification deep learning model. However, the Specialist ran into an overfitting problem in which the training and testing accuracies were 99% and 75%, respectively.
How should the Specialist address this issue and what is the reason behind it?
Premium Solution Locked
Unlock all 390 answers & explanations
A company hosts a machine learning (ML) dataset repository on Amazon S3. A data scientist is preparing the repository to train a model. The data scientist needs to redact personally identifiable information (PH) from the dataset.
Which solution will meet these requirements with the LEAST development effort?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning team uses Amazon SageMaker to train an Apache MXNet handwritten digit classifier model using a research dataset. The team wants to receive a notification when the model is overfitting.
Auditors want to view the Amazon SageMaker log activity report to ensure there are no unauthorized API calls.
What should the Machine Learning team do to address the requirements with the least amount of code and fewest steps?
Premium Solution Locked
Unlock all 390 answers & explanations
A media company wants to create a solution that identifies celebrities in pictures that users upload. The company also wants to identify the IP address and the timestamp details from the users so the company can prevent users from uploading pictures from unauthorized locations.
Which solution will meet these requirements with LEAST development effort?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist is building a prediction model for a large number of features using linear models, such as linear regression and logistic regression. During exploratory data analysis, the Specialist observes that many features are highly correlated with each other. This may make the model unstable.
What should be done to reduce the impact of having such a large number of features?
Premium Solution Locked
Unlock all 390 answers & explanations
A social media company wants to develop a machine learning (ML) model to detect inappropriate or offensive content in images. The company has collected a large dataset of labeled images and plans to use the built-in Amazon SageMaker image classification algorithm to train the model. The company also intends to use SageMaker pipe mode to speed up the training.
The company splits the dataset into training, validation, and testing datasets. The company stores the training and validation images in folders that are named Training and Validation, respectively. The folders contain subfolders that correspond to the names of the dataset classes. The company resizes the images to the same size and generates two input manifest files named training.lst and validation.lst, for the training dataset and the validation dataset, respectively. Finally, the company creates two separate Amazon S3 buckets for uploads of the training dataset and the validation dataset.
Which additional data preparation steps should the company take before uploading the files to Amazon S3?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist is implementing a full Bayesian network on a dataset that describes public transit in New York City. One of the random variables is discrete, and represents the number of minutes New Yorkers wait for a bus given that the buses cycle every 10 minutes, with a mean of 3 minutes.
Which prior probability distribution should the ML Specialist use for this variable?
Premium Solution Locked
Unlock all 390 answers & explanations
A financial company sends special offers to customers through weekly email campaigns. A bulk email marketing system takes the list of email addresses as an input and sends the marketing campaign messages in batches. Few customers use the offers from the campaign messages. The company does not want to send irrelevant offers to customers.
A machine learning (ML) team at the company is using Amazon SageMaker to build a model to recommend specific offers to each customer based on the customer's profile and the offers that the customer has accepted in the past.
Which solution will meet these requirements with the MOST operational efficiency?
Premium Solution Locked
Unlock all 390 answers & explanations
A Data Science team within a large company uses Amazon SageMaker notebooks to access data stored in Amazon S3 buckets. The IT Security team is concerned that internet-enabled notebook instances create a security vulnerability where malicious code running on the instances could compromise data privacy. The company mandates that all instances stay within a secured VPC with no internet access, and data communication traffic must stay within the AWS network.
How should the Data Science team configure the notebook instance placement to meet these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A company is creating an application to identify, count, and classify animal images that are uploaded to the companyโs website. The company is using the Amazon SageMaker image classification algorithm with an ImageNetV2 convolutional neural network (CNN). The solution works well for most animal images but does not recognize many animal species that are less common.
The company obtains 10,000 labeled images of less common animal species and stores the images in Amazon S3. A machine learning (ML) engineer needs to incorporate the images into the model by using Pipe mode in SageMaker.
Which combination of steps should the ML engineer take to train the model? (Choose two.)
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist has created a deep learning neural network model that performs well on the training data but performs poorly on the test data.
Which of the following methods should the Specialist consider using to correct this? (Choose three.)
Premium Solution Locked
Unlock all 390 answers & explanations
An automotive company uses computer vision in its autonomous cars. The company trained its object detection models successfully by using transfer learning from a convolutional neural network (CNN). The company trained the models by using PyTorch through the Amazon SageMaker SDK.
The vehicles have limited hardware and compute power. The company wants to optimize the model to reduce memory, battery, and hardware consumption without a significant sacrifice in accuracy.
Which solution will improve the computational efficiency of the models?
Premium Solution Locked
Unlock all 390 answers & explanations
A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data.
The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss. The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards.
Which solution should the Data Scientist build to satisfy the requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A manufacturing company has a production line with sensors that collect hundreds of quality metrics. The company has stored sensor data and manual inspection results in a data lake for several months. To automate quality control, the machine learning team must build an automated mechanism that determines whether the produced goods are good quality, replacement market quality, or scrap quality based on the manual inspection results.
Which modeling approach will deliver the MOST accurate prediction of product quality?
Premium Solution Locked
Unlock all 390 answers & explanations
An online reseller has a large, multi-column dataset with one column missing 30% of its data. A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data.
Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?
Premium Solution Locked
Unlock all 390 answers & explanations
A companyโs data scientist has trained a new machine learning model that performs better on test data than the companyโs existing model performs in the production environment. The data scientist wants to replace the existing model that runs on an Amazon SageMaker endpoint in the production environment. However, the company is concerned that the new model might not work well on the production environment data.
The data scientist needs to perform A/B testing in the production environment to evaluate whether the new model performs well on production environment data.
Which combination of steps must the data scientist take to perform the A/B testing? (Choose two.)
Premium Solution Locked
Unlock all 390 answers & explanations
A company is setting up an Amazon SageMaker environment. The corporate data security policy does not allow communication over the internet.
How can the company enable the Amazon SageMaker service without enabling direct internet access to Amazon SageMaker notebook instances?
Premium Solution Locked
Unlock all 390 answers & explanations
Each morning, a data scientist at a rental car company creates insights about the previous dayโs rental car reservation demands. The company needs to automate this process by streaming the data to Amazon S3 in near real time. The solution must detect high-demand rental cars at each of the companyโs locations. The solution also must create a visualization dashboard that automatically refreshes with the most recent data.
Which solution will meet these requirements with the LEAST development time?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist is training a model to identify the make and model of vehicles in images. The Specialist wants to use transfer learning and an existing model trained on images of general objects. The Specialist collated a large custom dataset of pictures containing different vehicle makes and models.
What should the Specialist do to initialize the model to re-train it with the custom data?
Premium Solution Locked
Unlock all 390 answers & explanations
A company is using a legacy telephony platform and has several years remaining on its contract. The company wants to move to AWS and wants to implement the following machine learning features:
โข Call transcription in multiple languages
โข Categorization of calls based on the transcript
โข Detection of the main customer issues in the calls
โข Customer sentiment analysis for each line of the transcript, with positive or negative indication and scoring of that sentiment
Which AWS solution will meet these requirements with the LEAST amount of custom model training?
Premium Solution Locked
Unlock all 390 answers & explanations
An office security agency conducted a successful pilot using 100 cameras installed at key locations within the main office. Images from the cameras were uploaded to Amazon S3 and tagged using Amazon Rekognition, and the results were stored in Amazon ES. The agency is now looking to expand the pilot into a full production system using thousands of video cameras in its office locations globally. The goal is to identify activities performed by non-employees in real time
Which solution should the agency consider?
Premium Solution Locked
Unlock all 390 answers & explanations
A company wants to predict the classification of documents that are created from an application. New documents are saved to an Amazon S3 bucket every 3 seconds. The company has developed three versions of a machine learning (ML) model within Amazon SageMaker to classify document text. The company wants to deploy these three versions to predict the classification of each document.
Which approach will meet these requirements with the LEAST operational overhead?
Premium Solution Locked
Unlock all 390 answers & explanations
A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers. Currently, the company has the following data in Amazon Aurora:





What steps should be taken to implement a machine learning model to identify potential new customers on social media?
Premium Solution Locked
Unlock all 390 answers & explanations
A healthcare company wants to create a machine learning (ML) model to predict patient outcomes. A data science team developed an ML model by using a custom ML library. The company wants to use Amazon SageMaker to train this model. The data science team creates a custom SageMaker image to train the model. When the team tries to launch the custom image in SageMaker Studio, the data scientists encounter an error within the application.
Which service can the data scientists use to access the logs for this error?
Premium Solution Locked
Unlock all 390 answers & explanations
A manufacturing company has a large set of labeled historical sales data. The manufacturer would like to predict how many units of a particular part should be produced each quarter.
Which machine learning approach should be used to solve this problem?
Premium Solution Locked
Unlock all 390 answers & explanations
A company wants to create a data repository in the AWS Cloud for machine learning (ML) projects. The company wants to use AWS to perform complete ML lifecycles and wants to use Amazon S3 for the data storage. All of the company's data currently resides on premises and is 40 ืยขืโ in size.
The company wants a solution that can transfer and automatically update data between the on-premises object storage and Amazon S3. The solution must support encryption, scheduling, monitoring, and data integrity validation.
Which solution meets these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements:



Which approach meets these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A data scientist is using the Amazon SageMaker Neural Topic Model (NTM) algorithm to build a model that recommends tags from blog posts. The raw blog post data is stored in an Amazon S3 bucket in JSON format. During model evaluation, the data scientist discovered that the model recommends certain stopwords such as "a," "an," and "the" as tags to certain blog posts, along with a few rare words that are present only in certain blog entries. After a few iterations of tag review with the content team, the data scientist notices that the rare words are unusual but feasible. The data scientist also must ensure that the tag recommendations of the generated model do not include the stopwords.
What should the data scientist do to meet these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A company's Machine Learning Specialist needs to improve the training speed of a time-series forecasting
model using TensorFlow. The training is currently implemented on a single-GPU machine and takes approximately 23 hours to complete. The training needs to be run daily.
The model accuracy is acceptable, but the company anticipates a continuous increase in the size of the training data and a need to update the model on an hourly, rather than a daily, basis. The company also wants to minimize coding effort and infrastructure changes.
What should the Machine Learning Specialist do to the training solution to allow it to scale for future demand?
Premium Solution Locked
Unlock all 390 answers & explanations
A machine learning (ML) specialist wants to create a data preparation job that uses a PySpark script with complex window aggregation operations to create data for training and testing. The ML specialist needs to evaluate the impact of the number of features and the sample count on model performance.
Which approach should the ML specialist use to determine the ideal data transformations for the model?
Premium Solution Locked
Unlock all 390 answers & explanations
Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?
Premium Solution Locked
Unlock all 390 answers & explanations
A developer wants to build an application that detects when customers enter personally identifiable information (PII). such as bank account numbers. into a customer survey before those responses are saved into a third-party database as records. The survey responses allow 100 words maximum and are less than 1 KB in size. The developer has never built a machine learning (ML) model before and wants a solution that requires the least development effort to build.
Which solution will meet these requirements with the LEAST development effort?
Premium Solution Locked
Unlock all 390 answers & explanations
A company is running a machine learning prediction service that generates 100 TB of predictions every day. A Machine Learning Specialist must generate a visualization of the daily precision-recall curve from the predictions, and forward a read-only version to the Business team.
Which solution requires the LEAST coding effort?
Premium Solution Locked
Unlock all 390 answers & explanations
A data science team is working with a tabular dataset that the team stores in Amazon S3. The team wants to experiment with different feature transformations such as categorical feature encoding. Then the team wants to visualize the resulting distribution of the dataset. After the team finds an appropriate set of feature transformations. The team wants to automate the workflow for feature transformations.
Which solution will meet these requirements with the MOST operational efficiency?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist is required to build a supervised image-recognition model to identify a cat. The ML Specialist performs some tests and records the following results for a neural network-based image classifier:
Total number of images available = 1,000 Test set images = 100 (constant test set)
The ML Specialist notices that, in over 75% of the misclassified images, the cats were held upside down by their owners.
Which techniques can be used by the ML Specialist to improve this specific test error?
Premium Solution Locked
Unlock all 390 answers & explanations
A data engineer is evaluating customer data in Amazon SageMaker Data Wrangler. The data engineer will use the customer data to create a new model to predict customer behavior.
The engineer needs to increase the model performance by checking for multicollineanty in the dataset
Which steps can the data engineer take to accomplish this with the LEAST operational effort? (Select TWO.)
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist needs to be able to ingest streaming data and store it in Apache Parquet files for exploration and analysis.
Which of the following services would both ingest and store this data in the correct format?
Premium Solution Locked
Unlock all 390 answers & explanations
A company hosts a public web application on AWS. The application provides a user feedback feature that consists of free-text fields where users can submit text to provide feedback. The company receives a large amount of free-text user feedback from the online web application. The product managers at the company classify the feedback into a set of fixed categories including user interface issues. performance issues, new feature request, and chat issues for further actions by the company's engineering teams.
A machine learning (ML) engineer at the company must automate the classification of new user feedback into these fixed categories by using Amazon SageMaker A large set of accurate data is available from the historical user feedback that the product managers previously classified.
Which solution should the ML engineer apply to perform multi-class text classification of the user feedback?
Premium Solution Locked
Unlock all 390 answers & explanations
A Data Scientist is developing a machine learning model to classify whether a financial transaction is fraudulent. The labeled data available for training consists of 100,000 non-fraudulent observations and 1,000 fraudulent observations.
The Data Scientist applies the XGBoost algorithm to the data, resulting in the following confusion matrix when the trained model is applied to a previously unseen validation dataset. The accuracy of the model is 99.1%, but the Data Scientist has been asked to reduce the number of false negatives.

Which combination of steps should the Data Scientist take to reduce the number of false positive predictions by the model? (Choose two.)
Premium Solution Locked
Unlock all 390 answers & explanations
A company is building a new supervised classification model in an AWS environment. The company's data science team notices that the dataset has a large quantity of variables. All the variables are numeric.
The model accuracy for training and validation is low. The model's processing time is affected by high latency. The data science team needs to Increase the accuracy of the model and decrease the processing time.
What should the data science team do to meet these requirements?
Premium Solution Locked
Unlock all 390 answers & explanations
A Machine Learning Specialist is assigned a TensorFlow project using Amazon SageMaker for training, and needs to continue working for an extended period with no Wi-Fi access.
Which approach should the Specialist use to continue working?
Premium Solution Locked
Unlock all 390 answers & explanations
A company's data scientist has built a machine learning (ML) classification system that can determine whether the company's promotional items are present in an image. The company wants the ML classification system to also determine how many times each type of promotional item appears in an image and the exact position of each item in the image.
Which solution will provide all the annotation data that the data scientist needs to train a supervised model to accomplish this task?
Premium Solution Locked
Unlock all 390 answers & explanations
Full Question Bank Locked
You have reached the end of the free study guide preview. Upgrade now to unlock all 390 questions and the full simulation engine.
Certification Path
Related Certifications
Customer Reviews
Global Community Feedback
David M.
"The practice engine is incredible. It feels exactly like the real testing environment and helped me build so much confidence."
Sarah J.
"The PDF is very well organized and the explanations for the answers are actually helpful, not just random text."
Michael C.
"I was skeptical, but the content is high quality and definitely worth the price. I passed on my first try!"