1/55
Flashcards for reviewing ML exam topics
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
A company wants to read, write, and transform Apache Parquet files from an Amazon S3 bucket. An ML engineer has been asked to automate this process by creating a custom transform script in Python. Which solution meets these requirements with the LEAST operational effort?
AWS Glue
A company is building a fraudulent transaction detection solution on Amazon SageMaker. The company wants to use a SageMaker built-in algorithm for the fraud detection model by using customer transaction data. Which algorithm should the company use?
Random Cut Forest (RCF)
An agriculture company plans to use drone imagery to count their flock of sheep. The company is experimenting with Amazon SageMaker built-in algorithms, and needs advice on what type of ML model to use. Which ML model should the company use?
Object detection model
A global company wants to build an internal chatbot that employees can use to answer questions about company-relevant information. The chatbot will use a retrieval augmented generation (RAG) approach to retrieve relevant information from internal documents. The chatbot will use a large language model (LLM) to answer the employees' questions. The company expects the chatbot to consistently have a high number of queries. The chatbot must be available 24 hours a day, 7 days a week. The company wants to use a fully managed RAG solution. Which solution will meet these requirements MOST cost-effectively?
Amazon Bedrock Knowledge Bases with provisioned throughput
A retail company is using an Amazon SageMaker recommendation model to generate personalized in application notifications. The product catalog changes on a weekly basis, and the merchandise categories are highly imbalanced. The company is concerned that data drift is too high and introduces significant bias, despite regular model re-training. Which action can the company take to detect bias in production?
SageMaker Clarify
A company designed a classification system. The system uses an ML model deployed on an Amazon SageMaker endpoint. The company wants to assess system performance by implementing a feedback mechanism to track the model's performance. Which solution will meet these requirements with the LEAST development effort?
Use SageMaker Model Monitor to ingest and merge captured data from the endpoint and the processed feedback. Create and schedule a baseline job and a model quality monitoring job
A library designed a book recommendation system.
The system was deployed by using an Amazon SageMaker endpoint. The endpoint has a target tracking scaling policy to auto-scale based on the
number of invocations metric. After system deployment, traffic has seen intermittent spikes that caused over-scaling. An ML engineer must implement a solution to handle the spike in traffic. Which solution will meet these requirements with the LEAST operational overhead?
Specify a cooldown period in the target tracking scaling policy
A company is developing a TensorFlow model by using Amazon SageMaker framework estimators. The model is experiencing heavy system utilization. An ML engineer must identify system utilization bottlenecks in real time. Which solution will meet these requirements?
Use Amazon CloudWatch to monitor SageMaker instance metrics that are used by the model.
A company is using Amazon SageMaker to train and evaluate an ML model. The company will use the model to predict if an email is spam or not. Because the model will be used for internal emails, the company wants to ensure that legitimate emails are not incorrectly flagged as spam. Which model evaluation metric will meet these requirements?
Precision
A data scientist at a bank wants to train a model to predict loan approvals by using XGBoost on Amazon SageMaker. The training dataset is a tabular dataset. The dataset includes a column named 'approved' that
indicates if the loan is approved, with 1 indicating approved and 0 indicating not approved. When setting up the hyperparameter tuning, the data scientist needs to provide an evaluation metric. Which evaluation
metric is correct for this scenario?
validation: f1
An ML engineer uses Amazon SageMaker Data Wrangler to prepare a dataset for training an ML model to predict housing prices. The dataset consists of housing data from the past 10 years sorted by date and includes features such as location, size, and price. The ML engineer wants to reduce prediction bias and ensure that the model generalizes well on future, unseen data. Which SageMaker Data Wrangler split transform will meet this requirement?
Ordered split
An ML engineer uses Amazon SageMaker Data Wrangler to pre-process a housing dataset for a Support Vector Machine (SVM) regression model. The dataset includes a Property_Age feature with values ranging from 1-10. The dataset also includes a Property_Price feature with most values around 300,000 and several outliers with values up to 15,000,000. The model requires features to be on a similar scale for optimal performance. Which SageMaker Data Wrangler scaling function should be applied to the model?
Robust Scaler
An ML engineer must prepare data from car rental contracts for model training. The car rental contracts that are used to train the model are in plain text format and are stored in Amazon S3. The contracts include the renter's name, age, email address, driver's license ID, car model, and vehicle identification number. In preparation for model training, the contracts must be processed to detect personally identifiable information (PII). Which solution will meet these requirements with the LEAST operational overhead?
Use an Amazon SageMaker Canvas ready-to-use model to detect PII.
A data scientist successfully used Amazon Comprehend from an Amazon SageMaker notebook
instance through the boto3 Python APIs. Later, a security team configures a VPC endpoint dedicated to
Amazon Comprehend. Security requirements state that AWS services should not be reached through the
public internet. The data scientist attempts to update the SageMaker notebook to reach the DNS entry of the VPC endpoint, but the service call fails. How can the data scientist resolve the error to access Amazon
Comprehend from the SageMaker notebook instance?
Verify if the SageMaker notebook instance is configured to run inside the same VPC as the VPC endpoint.
A company built a deep learning model for climate modeling by using Amazon SageMaker. In each invocation, the model processes 400 MB of data for 30 minutes to return a prediction. The climate model is invoked automatically when a new climate event is
detected. The company needs a deployment strategy to move the deep learning model to production. A cold
start can be tolerated. What is the MOST cost-effective solution?
Deploy the model by using an asynchronous endpoint.
An ML engineer must maintain an existing Amazon SageMaker Pipelines pipeline to build an ML model.
The ML engineer must modify the current pipeline to implement a custom model training logic. The training
code is written in Python. Which modification should the ML engineer make to meet these requirements?
Wrap the custom training logic into a function and use the @step decorator in the function. Add the function as a step in the current pipeline.
A social media company wants to build a content moderation system to detect inappropriate or offensive material in user-uploaded images. Which solution will meet this requirement?
Use Amazon Rekognition moderation APIs.
A data scientist is training a deep learning neural network by using Amazon SageMaker. The data
scientist wants to debug the model to identify and address model convergence issues. The data scientist
wants to use real-time monitoring to determine if there is a sampling imbalance between classes. Which
solution will meet these requirements with the LEAST operational overhead
Set up a SageMaker training job that is configured to include SageMaker Debugger. Start the training job and monitor for sampling imbalance by using SageMaker Debugger built-in rules.
A company is planning to develop an ML model by using Amazon SageMaker. The training dataset is
sensitive and is stored in Amazon S3 in a different AWS Region from where the company plans to run SageMaker. The training dataset cannot be exposed to the public internet during processing. Which solution will meet these requirements?
Disable direct internet access for SageMaker instances. Enable an interface VPC endpoint within the VPC. Encrypt the S3 data by using AWS Key Management Service (AWS KMS).
An ML engineer is developing a semantic segmentation computer vision product. The ML engineer has an unlabeled image dataset that is
stored in Amazon S3. The images must be labeled to prepare the dataset to train a built-in classification ML
model on Amazon SageMaker. Which solution will meet these requirements with the LEAST operational
overhead?
Create a SageMaker Ground Truth labeling job.
A financial services company created a feature group in Amazon SageMaker Feature Store. The feature group manages user-related ML features across different areas. Data has already been loaded into
SageMaker Feature Store. An ML engineer needs to add a new feature to the feature group. The feature
group will be used in several ML marketing models. The ML engineer must update the historical records to
include the values of the new feature. Which step should the ML engineer take to add the new
Use the UpdateFeatureGroup operation to add the new feature to the feature group. Specify the name and type.
A financial services company created a feature group in Amazon SageMaker Feature Store. The feature
group manages user-related ML features across different areas. Data has already been loaded into
SageMaker Feature Store. An ML engineer needs to add a new feature to the feature group. The feature
group will be used in several ML marketing models. The ML engineer must update the historical records to
include the values of the new feature. Which step should the ML engineer take to update the histo...
Use the PutRecord operation to overwrite the records that do not have data for the new feature.
An ML engineer built an ML solution that was deployed in an AWS account. The account was shared by the
company's ML team, which is where additional projects are already running. The company needs to use AWS Cost Center to track costs across all the AWS resources that are used in the solution. These resources include training and batch inference
workflows in Amazon SageMaker Pipelines, Amazon S3 buckets, and AWS Glue tables. Which solution will
meet the requirements to group and track the project costs?
Add an inline policy to the execution role of the SageMaker Studio domain.
Which solution will meet the requirements to group and track the project costs?
Assign a user-defined tag to the project AWS resources that includes a project identifier. Activate user-defined tags in the AWS Billing and Cost Management console and use AWS Cost Explorer to filter costs by the project identifier.
A car company wants to build an ML model by using Amazon SageMaker to predict the prices of pre-owned
cars. The company provides a dataset to a data scientist that includes thousands of observations and
10 features based on past sales data. Which ML algorithm should the data scientist use to meet these
requirements?
Linear learner algorithm
A research team collects data from 10 universities that are participating in a research study. The data consists
of many large .csv files that are uploaded from each university into Amazon S3. An ML engineer notices that files are taking a long time to upload. The ML engineer needs to increase the upload speed. Which solution will meet these requirements?
Use Amazon S3 Transfer Acceleration.
A data scientist is exploring a dataset by using an Amazon SageMaker Studio notebook. The data
scientist wants to visualize the correlation between different input features. Which correlation metric should the data scientist use to investigate non-linear relationships between numeric features?
Spearman
An online retail company is using an Amazon SageMaker endpoint to deliver product recommendations to customers directly in a web application. An ML specialist needs to ensure that the ML model remains available during seasonal sale
events. The ML model must be able to accommodate the expected increase in endpoint invocations. Which
solution provides the HIGHEST scalability capabilities to meet these requirements?
Configure auto scaling on the SageMaker ML model endpoint
A telecommunications company uses an Amazon SageMaker ML model to predict customer turnover.
The model is an XGBoost tree-based model. The tabular dataset includes both nominal categorical
variables and numerical variables. A data scientist must transform the variables so that the data can be
analyzed in the SageMaker environment. Which solution should the data scientist use to help analyze
the data?
Use SageMaker Data Wrangler to perform encoding on the categorical variables.
A financial services company created a feature group in Amazon SageMaker Feature Store. The feature group manages user-related ML features across different areas. Data has already been loaded into SageMaker Feature Store. An ML engineer needs to add a new feature to the feature group. The feature group will be used in several ML marketing models.
The ML engineer must update the historical records to include the values of the new feature. Which step should the ML engineer take to add the new featu...
Use the UpdateFeatureGroup operation to add the new feature to the feature group. Specify the name and type.
How can the model engineer train the built-in Sagemaker ML model in the MOST cost- effective manner?
In the SageMaker training job, set EnableManagedSpotTraining to True.
An ML engineer wants to train a model to analyze customer turnover in a telecommunications company.
The ML engineer created a script to fit a Cox model in an Amazon SageMaker training job by using a dataset available in Amazon S3. The training code requires access to third-party Python libraries including scikitlearn, NumPy, pandas, and a proprietary library. The proprietary library code cannot be modified and is
available in a private artifact repository. Which solution will run the training job with the LEAST operational overhead?
Extend the prebuilt SageMaker scikit-learn framework container to include custom dependencies.
An ML engineer wants to create a text summarization model that is based on the Amazon SageMaker seq2seq algorithm. The ML model training data
includes 1 TB of flat files. The ML Engineer must convert the data to RecordIO-Protobuf format. Which
solution will meet these requirements?
Launch an Apache Spark Amazon EMR cluster to transform the training data to RecordIO-Protobuf format on Amazon S3.
An ML engineer must monitor a production ML model that has an endpoint that is configured for real-time inference. Model training data and inference I/O data are stored in Amazon S3. The ML engineer needs to track data drift in production to see if the quality of predictions changes from when the model was trained. Which solution should the ML engineer use to
create a baseline of the training data?
Use an Amazon SageMaker Model Monitor prebuilt container with SageMaker Python SDK to generate statistics from the training data.
A data scientist is developing a forecasting model by using Amazon SageMaker. The data scientist has 3
years of daily time series data, including days with missing data. The data is stored in Amazon S3. The
data scientist wants to perform feature engineering by filling in missing values with various substitutes. What is the MOST operationally efficient method to fill in missing values?
Use SageMaker Data Wrangler within the SageMaker Canvas environment to fill missing values.
A data scientist must train an ML model in Amazon SageMaker. The model should be trained with customer purchasing data to classify customer
segments based on behavior. The data scientist must evaluate multiple algorithms and track model performance. Which solution will meet these requirements with the LEAST effort?
Use SageMaker built-in algorithms to train the model. Use SageMaker Experiments to track model runs and results.
A company is using an ML model that runs inferences in real time in Amazon SageMaker as part of an online
application. Lately, the accuracy of the model has been decreasing. The company has developed three
new versions of the model. The company wants to perform A/B testing on the new versions of the model
and deploy the model that has the highest accuracy. Which solution will meet these requirements with the
LEAST operational overhead?
Deploy the three new versions of the model behind a single SageMaker endpoint. Define a traffic percentage for each version.
A data scientist needs to deploy an ML model. The model will be invoked every 24 hours. The model takes
30 minutes to process requests. Which solution will meet these requirements MOST cost-effectively?
Create an Amazon SageMaker batch transform job.
An ML engineer is developing a computer vision ML model to identify visual defects on the products. The
engineer is using a dataset with 50,000 images of products that will be split for training and evaluation.
During the validation step, the model is not accurately capturing the underlying relationship in the training dataset. Which approach will improve the model performance?
Increase the amount of domain-specific features in the training dataset.
A data scientist wants to train a model to predict housing prices by using XGBoost on Amazon SageMaker. The training dataset is a tabular dataset.
The dataset includes a column named 'price' that indicates the sales price of each house. The data scientist is setting up the hyperparameter tuning, and needs to provide an evaluation metric. Which evaluation metric is appropriate for this scenario?
validation: mse
Which action will help the company’s own code
and dependencies run feature engineering in Amazon
SageMaker? A company wants to implement predictive maintenance for critical equipment by using ML algorithms. The company requires development, deployment, and management of the predictive
maintenance solutions.
Build a Dockerfile and push the image to Amazon Elastic Container Registry (Amazon ECR).
An ML engineer must implement a solution that processes hundreds of thousands of text inputs once
every 24 hours. Each of the inputs is inserted into a prompt and sent to a large language model (LLM) for
inference. The LLM response must be stored in an Amazon S3 bucket. Which solution will meet these requirements with the LEAST operational overhead?
Create a batch inference job in Amazon Bedrock. Store the input file in an S3 bucket and specify the stored file as an input to a CreateModelInvocationJob request. Specify the output locations for the request as the target S3 bucket.
An ML engineer wants to create a text summarization model that is based on the Amazon SageMaker seq2seq algorithm. The ML model training data
includes 1 TB of flat files. The ML engineer must convert the data to RecordIO-Protobuf format. Which
solution will meet these requirements?
Launch an Apache Spark Amazon EMR cluster to transform the training data to RecordIO-Protobuf format on Amazon S3.
The company wants to deploy a new version of a model into production. The company wants to shift
only a small portion of traffic at first. If the results are satisfactory, the company will shift the remainder of
the traffic to the new version. Which solution will meet these requirements?
Implement a blue/green deployment strategy in canary mode.
A company wants to build an ML model to predict future sales based on historical trends. The company
has several years of sales data stored in an SQL server in an on-premises database. The company uses AWS
and has dedicated network connectivity between AWS and the on-premises database. An ML engineer is
using Amazon SageMaker pre-built Docker images to train the model. Which approach must the engineer
use to ingest the data to train the model?
Use AWS Database Migration Service (AWS DMS) to export the data to Amazon S3. Provide the S3 location within the SageMaker notebook.
An ML engineer wants to use Amazon SageMaker to create a model that predicts whether a student will
pass an exam. The ML engineer is developing a logistic regression model and needs to find an optimal model
with the most accurate classification threshold. The ML engineer must select a model evaluation technique to analyze the performance of the model
based on the defined threshold. The dataset contains an equal amount of observations for passed and failed
exam attempts. Which model evaluation technique meets the requirements?
Receiver operating characteristic (ROC) curve
A company urgently wants to deploy a newly trained ML model to improve customer experience for an existing application that has a custom traffic pattern. An MLOps engineer must build a deployment pipeline
to host the model on a persistent, scalable endpoint that provides consistently low latency. The MLOps
engineer must identify the instance type to use to host the model. Which solution will meet these
requirements with the LEAST operational overhead?
Use Amazon SageMaker Inference Recommender to run an inference recommendation job.
A financial company has a compliance policy that states that direct internet access from an Amazon
SageMaker notebook instance is not allowed. An ML engineer disabled direct internet access on the SageMaker notebook instance and hosted the instance in a private subnet in a VPC. However, internet access is required to update the SageMaker
instance. Which solution will meet these requirements?
Set up a NAT gateway within the VPC. Configure security groups and network access control lists (network ACLs) to allow outbound connections.
A global automotive company is managing a fleet of hundreds of thousands of vehicles. For each new
vehicle, the company receives a scan of the vehicle registration information card. The company uses Amazon Textract to extract the text from the scans. An ML engineer must redact the vehicle identification number (VIN) from the extracted text before being
used for modeling. Which solution will meet these requirements with the LEAST operational overhead?
Use Amazon Comprehend to run an asynchronous job to redact personally identifiable information (PII) entities. Set the redaction configuration to redact VINs from the text.
A data scientist created an Amazon SageMaker Processing job that processes CSV files in an Amazon
S3 bucket. The SageMaker Processing job can access the S3 bucket. However, when the job tries to access
the CSV files, the job receives a 403 error. What is the cause of the error?
The SageMaker Processing job execution role does not have the necessary permissions.
An ML engineer is experimenting with a large language model (LLM) for text generation on the Amazon Bedrock text playground. The ML engineer tests
inference on different prompts and discovers high randomness and variation of responses to the same
repeated questions. The ML engineer must change the inference parameters to standardize answers and generate more consistent responses. Which change to the inference parameters will meet these requirements?
Reduce the temperature parameter of the model.
A data science team has built over 50 models on Amazon SageMaker during the last several years. The models support different business groups within a company. The data science team wants to document the critical details of the models. The critical details include the background and purpose of the models. The critical details will give model consumers the
ability to browse and understand available models within the company. Which solution will meet these
requirements with the LEAST operational overhead?
Configure SageMaker Model Cards.
A financial services company is developing a new ML model to automatically assign credit limits to
customers when they apply for a credit card. To train the model, the company has gathered a large dataset
of customers and their history. The data includes credit transactions, credit scores, and other relevant financial and demographic information from the last year. The first results of the new model in training show that the model is returning inaccurate predictions for specific types of customer...
Configure SageMaker Clarify processing job to identify bias in the training data
An ML engineer must reuse features across different ML applications for training and low-latency inference.
The ML engineer must ensure that the features can be shared with multiple team members in different accounts. Which solution will meet these
requirements with the LEAST operational effort?
Use Amazon SageMaker Feature Store to store features for reuse and to provide access for team members across different accounts.
A company hosts many ML models that support unique use cases that have dynamic workloads. All the
models were trained by using the same ML framework. The models are hosted on Amazon SageMaker
dedicated endpoints that are underutilized. The company has a goal to optimize its environment for
cost. Which solution will meet these requirements MOST cost-effectively?
Configure a SageMaker multi-model endpoint.
A company is training a model on 50,000 images. A first evaluation of the model shows that the training
error rapidly decreases as the number of epochs increases, which causes the model to generalize poorly on the evaluation data. An ML engineer must
improve the generalization of the model. Which method will meet the requirements?
Increase the number for the regularization hyperparameter.