AWS Machine Learning Associate

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/112

There's no tags or description

Looks like no tags are added yet.

Last updated 2:17 PM on 12/13/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

113 Terms

New cards

Machine Learning (ML)

Field of study in AI concerned with the development and study of statistical algorithms that can learn from data and generalize from unseen data and thus perform tasks without explicit instructions

ML models can be deterministic or probablistic or a mix of both.

Types: Supervised, Unsupervised and Deep Learning.

*Note: At the start of a ML project, Identify the business goal, and frame the problem*

New cards

Types of Data

Structured: Data that is organized in a defined manner or schema. Typically found in relational databases (database tables, excel spreadsheets) (RDS, Aurora)

Unstructured: Data that does not have a pre-defined structure or schema (videos, images, emails) (DynamoDB)

Semi-structured: Data that is not as organized as structured data but has some form of structure (tags, hierarchies) (XML, JSON)

New cards

Supervised vs. Unsupervised learning

Supervised learning: relies on two variables: labeled input and output training data.

Unsupervised learning: processes unlabelled/ raw data.

New cards

Mean/Median/Mode

Mean: (average) normal distributions where the mean represents the average of the cumulative data set.

Median: (middle number): better option when outliers are present. Only works on columnar datasets and can not be used on categorical features. Not very accurate.

Mode: (most frequent): used with categorical values.

New cards

Data Repositories

Data Warehouses: Centralized repository optimized for querying and analysis with structured data where data is injested from different sources. (Redshift)

Data Lakes: Storage repository that holds vast amounts of raw data in its native format, including structured, unstructured and semi-structured data. (S3)

Data Lakehouse: Hybrid of DW and DL. Primarily used in ML/AI. Combines the flexibility/cost effectiveness of a DL and the performance of a DW. (AWS Lake Formation w/S3, Redshift Spectrum)

New cards

Data Processing Frameworks

Python (Pandas): for SMBs. Perfect for local, in-memory merge tasks. Used with small amount of structured data.

PySpark (Python API for Apache Spark): for large-scale distributed machines and datasets stored in cloud environments such as S3, RDS (Recommended)

New cards

Convolutional Neural Network (CNN)

Scans image for specific features by passing an image through a series of hierarchical filters/layers

Image classification, object detection regardless of placement.

Includes: LeNet (handwriting), AlexNet (image recognition), GoogLeNet (deeper, better performance), ResNet (deepest)

*NOTE: CNNs are difficult to administer: resource intensive, lots of hyperparameters, getting training data (as well as storing and accessing)

<p>Scans image for specific features by passing an image through a series of hierarchical filters/layers</p><p>Image classification, object detection regardless of placement.</p><p>Includes: LeNet (handwriting), AlexNet (image recognition), GoogLeNet (deeper, better performance), ResNet (deepest)</p><p>*NOTE: CNNs are difficult to administer: resource intensive, lots of hyperparameters, getting training data (as well as storing and accessing)</p>

New cards

Recurrent Neural Network (RNN)

Designed for sequential data such as text, time series and speech.

Has a Memory mechanism: Remembers previous inputs to determine future outputs using time-series data.

Uses a "hidden state" to loop info from previous trainings. Helps carry info forward.

Vanishing gradient problem: Simple RNNs have "short-term memory." Major flaw in RNN architecture

Sequential, can not handle parallel learning.

*NOTE: Build upon previous research when creating since they are very resource intensive.

<p>Designed for sequential data such as text, time series and speech.</p><p>Has a Memory mechanism: Remembers previous inputs to determine future outputs using time-series data. </p><p>Uses a "hidden state" to loop info from previous trainings. Helps carry info forward. </p><p>Vanishing gradient problem: Simple RNNs have "short-term memory." Major flaw in RNN architecture </p><p>Sequential, can not handle parallel learning.</p><p>*NOTE: Build upon previous research when creating since they are very resource intensive. </p>

New cards

Activation Functions

Takes the weighted input signals for a node and determines its final output (or "activation") in a neural network

Activation functions introduce non-linearity in neural networks, allowing for increased complexity.

Popular tools: Sigmoid function, ReLU

New cards

Loss Function (*)

Aggregates the residuals of all data points in a model and provides a single output to calculate overall error rate.

Quantifies how well a model's overall predictions align with the actual, correct data.

New cards

Unbalanced Data

When there is a large discrepancy between what you are/are not testing for.

*Common issue when dealing with ML algorithms*

To resolve:

1) Correctly identify the minority group in data sample

2) Use Oversampling and Undersampling to increase accuracy

Oversampling: increase the number of what you are testing for, matching with the majority class

Undersampling: decrease the majority of samples you are not testing for, matching with the majority class

Tools Used:

Synthetic Minority Oversampling Technique (SMOTE): Generating wide variety of newly, artificial samples of minority classes (*advanced form of oversampling based on KNN)

New cards

Common Data Formats (*)

CSV (Simplicity): simple text file, used for spreadsheets; separated by commas/semicolons.

(Data Import from Spreadsheets)

JSON (Compatability): simple, schema-less, key-value pair, text format used to structure data.

Parquet (Performance): structured, binary data is stored in columns, not rows. Excellent compression and performance. Broad ecosystem support.

*Gold standard for big data analytics*

(Hadoop, Apache Spark, Redshift Spectrum)

NOTE: Does not work well with unsorted data

Apache ORC (Optimized Row Columnar) (Data Volume): Specialized, columnar database used with Apache Hive. Fast performance for tasks that require a lot of reading. Used when dealing with large amounts of data.

(Apache Hive)

Apache AVRO (Streaming, Storage, Schema Evolution) Compact, fast, binary, row-based data. Great for schema evolution (adding/deleting/renaming features/columns), allowing it to change over time.

RecordIO (Deep Learning): Combines multiple data sets into a single file. Stores and moves big datasets. Common use case is AWS Sagemaker.

New cards

Feature Engineering

The art of selecting, modifying or creating new features from raw data to improve a model's performance.

Includes:

-Feature Creation from Missing Data

-Combining and splitting features

-Feature Scaling

-Encoding Categorical Variables

Watch for the Curse of Dimensionality: the more features you have, the sparser your data.

*Choose the features that are most relevant to the problem.*

Inputting Missing Feature Data: taking the mean of the column and replacing all missing values with that value. (limited with outliers and categories). Not an overall best practice.

Other Options:

KNN: find most similar row, average their values

Deep Learning model: great for categorical values

Regression: find linear or non-linear relationships

New cards

Regression Analysis (*)

Method for modeling relationships between variables.

*Primary benefit in Machine Learning is to predict future values.*

Simple Linear: one independent variable is used

Multiple Linear: multiple variables is used

Logistics: categorical variables

Binary Logistics: (yes/no) categorical variables

New cards

Epoch/Batches/Steps

Epoch: One complete pass-through of entire training dataset by the ML algorithm. Multiple epochs are needed for the model to learn the underlying patterns in the data.

Batches: Small, manageable chunk of training data. Primary reason is memory efficiency, allowing for quicker iterations of the model in a single step.

Steps/Iterations: Process of training a model on one batch. After the step ends, updates are made to the model's parameters.

New cards

Batch Sizes/Learning Rate

In hyperparameter tuning: how many examples the model "sees" before it updates its internal parameters.

Small batch: "bumpy, noisy" learning process. tends not to get stuck in local minima -a point that looks like the best solution but really isn't

Large batch: "smoother, faster" learning process. can get stuck in local minima

Small learning rates: increased learning over time. slower process but does not "overshoot" the optimal solution.

Large learning rates: "large steps, faster process". can overshoot the optimal solution and cause performance to diverge.

*NOTE: Watch for too high or low of a learning rate.

New cards

Underfitting(*)

Responses are too simplistic to capture the underlying patterns in the data

Responds poorly both to existing data and new, unseen data

Fails to capture the specific nuances, correlations or details of the query. High bias.

*Typically occurs when the model is not trained on enough relevant data or does not understand the context.

How to resolve:

increase model complexity

Add epochs

Add more features

New cards

Overfitting (*)

Excels at training data but fails to predict results correctly from new, unseen data.

When the model learns the data too well. including noise and outliers.

*Typically occurs when the model is trained too heavily on a specific set of data or is overly reliant on memorization rather than understanding.

How to resolve: (LEEDCP)

L1/L2 Regularization: adds specific penalties when coefficients become too large. Prevents a model from becoming overly complex.

L1: Reduces the features to zero

L2: Reduces the features evenly (but not to zero)

Early stopping: pauses the data before too much noise is introduced. Timing is essential.

Ensembling: averaging predictions over multiple models for more robust predictions.

Data Augmentation: changes the data set slightly each time the model processes it. Keeps data fresh and reduces rote memorization.

Cross-validation: data is split into equal-sized subsets. One fold is held as for test/validation and the others used for training/evaluation. Repeated with each fold getting at least one turn as test/validation set. Performance metric is calculated.

Pruning/Feature Selection: identifies most important features and removes unnecessary ones.

New cards

Binning/Encoding/Decoding

Binning: Converting continuous data into "bins" based on category. (think of school grades (100-90: "A", 89-80 "B", etc)

Encoding: Transforming data into some numerical format that the model can understand.

Decoding: Taking the numerical format and converting it back into meaningful data.

New cards

Encoding techniques

One-hot encoding: Used for data pre-processing. converts each category into a binary vector representation (where there is no inherent order) that neural networks can process.

Binary encoding: A compact method that first assigns a unique integer to each category, then converts that integer into its binary code.

Label encoding: Used when categorical variables have a "meaningful rank" to their categories.

<p>One-hot encoding: Used for data pre-processing. converts each category into a binary vector representation (where there is no inherent order) that neural networks can process.</p><p>Binary encoding: A compact method that first assigns a unique integer to each category, then converts that integer into its binary code. </p><p>Label encoding: Used when categorical variables have a "meaningful rank" to their categories.</p>

New cards

Confusion Matrix

Model evaluation tool used to compare predicted outcome vs actual outcome in simple, binary response (ie. is the picture a fish? a: yes/no) (TP/FP/TN/FN)

Helps provide precision/accuracy of the model.

New cards

Classification(*)

Accuracy: How many predictions the model got right overall.

*Not suitable for imbalanced datasets*

Precision: (Positive Predictions Pct) What pct of positive identifications were actually correct? The higher the better (TP/TP+FP)

"Of these 10 patients we flagged, how many actually were sick?"

*Important when the cost of a false positive is high*

(Being right when you say yes)

Recall: What proportion of actual positives were identified correctly? The higher the better. (TP/TP+FN)

"Of these 100 sick patients, how many did our test successfully identify?"

(Not missing any yes's)

Specification: (Recall's Counterpart) What proportion of actual negatives were identified correctly? (TN/TN+FP)

(Not missing any no's)

F1 Score: Overall performance of Precision and Recall.

2*(precision*recall/precision+recall)

AUC-ROC curve: A performance measurement for binary classification model across all classification thresholds

(=1.0, perfect model, =0.5, random guessing)

Precision/Recall Curve (PR Curve): *Similar to AUC-ROC* Good for zooming in on specific use cases. Better for imbalanced datasets.

Measuring/evaluating regression models:

Mean Absolute Error (MAE): Calculates diff betw actual value and predicted value. Used with simple metrics, measuring outliers and weighting all errors equally.

Root Mean Squared Error (RMSE): common metric to assess the performance of a regression model (lower the better). Used in prediction of large model errors.

NOTE: Both MAE and RMSE are used to evaluate performance of regression models. Key difference is how they handle errors/outliers.

<p>Accuracy: How many predictions the model got right overall. </p><p>*Not suitable for imbalanced datasets*</p><p>Precision: (Positive Predictions Pct) What pct of positive identifications were actually correct? The higher the better (TP/TP+FP)</p><p>"Of these 10 patients we flagged, how many actually were sick?" </p><p>*Important when the cost of a false positive is high*</p><p>(Being right when you say yes)</p><p>Recall: What proportion of actual positives were identified correctly? The higher the better. (TP/TP+FN)</p><p>"Of these 100 sick patients, how many did our test successfully identify?"</p><p>(Not missing any yes's)</p><p>Specification: (Recall's Counterpart) What proportion of actual negatives were identified correctly? (TN/TN+FP)</p><p>(Not missing any no's)</p><p>F1 Score: Overall performance of Precision and Recall. </p><p>2*(precision*recall/precision+recall)</p><p>AUC-ROC curve: A performance measurement for binary classification model across all classification thresholds </p><p>(=1.0, perfect model, =0.5, random guessing)</p><p>Precision/Recall Curve (PR Curve): *Similar to AUC-ROC* Good for zooming in on specific use cases. Better for imbalanced datasets. </p><p>Measuring/evaluating regression models:</p><p>Mean Absolute Error (MAE): Calculates diff betw actual value and predicted value. Used with simple metrics, measuring outliers and weighting all errors equally. </p><p>Root Mean Squared Error (RMSE): common metric to assess the performance of a regression model (lower the better). Used in prediction of large model errors. </p><p>NOTE: Both MAE and RMSE are used to evaluate performance of regression models. Key difference is how they handle errors/outliers. </p>

New cards

CPU vs GPU

CPU: low cost, generalized processor, designed for basic ML tasks, cleans data, handles small jobs. Used with training sets (AWS: T3, M5, C5)

GPU: specialized processing unit for real-time, computationally heavy tasks (deep learning). Ideal for computer graphics, vectors, multiple high-thoughput, time-sensitive ML tasks (P-series NVDIA)

New cards

ML Instance Types

Utilizes C-series, G-series, R-series, P-series, Inf-series and Trainium instances

C-series: Optimized for intensive Compute tasks. Ideal for model training plus big simulations. Cost-effective compute power.

G-series: Graphic optimized instances. Image recognition, object detection and video analysis

P-series: Performance optimized for Deep Learning. Heavily computational, large scale model training.

R-series/X1e: Provides superior memory capacity.

Inf-series: Optimized for faster prediction or model inference tasks. Real-time suggestions, chatbots

AWS Trainium instances: meets sustainability goals when training ultra-large models. Not used for inference.

New cards

Amazon Titan Models

Amazon Titan Text Embeddings: natural language converted converted to numerical representations called "embeddings" or "vectors"

Use case: real-time, low-latency applications via the InvokeModel API and for large-scale processing tasks using batch jobs, which are optimized for high throughput.

Amazon Titan Image Generator G1: text to image with simple language prompts.

Use case: Advertising, marketing, design, architecture

Amazon Titan Text: text related tasks for Enterprise use cases.

Use case: Open-ended text generation, conversational chat, summarization, RAG

Amazon Titan Multimodal Embeddings G1 model: multi-modal model (both image and text), converting to embeddings/vectors.

Use case: Perform searches across various data types

New cards

SageMaker AI

Fully managed, one-click, comprehensive and scalable solution for building, training, and deploying high-quality models at scale for data scientists and developers.

<p>Fully managed, one-click, comprehensive and scalable solution for building, training, and deploying high-quality models at scale for data scientists and developers. </p>

New cards

Sagemaker Studio

Visual IDE for teams used for end-to-end ML collaboration and development

AWS SSO/IAM based access; Used with a single domain

ie. CodeEditor, JupyterLab, RStudio, Canvas

New cards

Sagemaker Notebooks

Fully managed, Jupyter environment

Comes with: pre-installed libraries, easy set-up, automatic saving and scalability.

Audience: Individual data scientists (simple, focused environment)

New cards

Sagemaker Domain

Dedicated, collaborative workspace for your data scientist teams, streamlining administration, enhancing security and seemlessly integrates with Sagemaker Studio.

Centralized management hub for your Machine learning environments.

Provides: Shared spaces, Shared EFS volumes, 2 default VPCs

New cards

SageMaker Canvas

Perform visual data preparation and feature engineering, create and compare models, and generate predictions by simply interacting with the data via a visual, no-code, point-and-click interface.

Perfect for non-technical users (part of Sagemaker Studio)

New cards

Sagemaker Jumpstart

Open-source, hub for pre-trained models with full control and customization to jump-start the modeling process.

Provides blogs, videos and training.

New cards

SageMaker Data Wrangler (*)

Prepares data using a simple visual interface for modeling/ML deployment (ETL Pipeline)

Verify data quality and anomoly detection

Balance "class imbalance" issues (over/under sampling)

Can use preconfigured templates

Quick ML accuracy estimates

Can be automated

Works with a variety ot tools: SageMaker Studio, Sagemaker Pipelines and Sagemaker Feature Store

(*)Types of splits: Randomized, Ordered, Split-by-key, Stratified

Support features such as binning, scaling and one-hot encoding

New cards

SageMaker Feature Store

Central repository to share, store, update and retrieve feature definitions for your models from multiple sources.

On-line/off-line repositories:

On-line: real-time inference, low latency access

Off-line: model training/batch inference/historical data

Works with streaming data (Kinesis), CI/CD pipelines and batch data

New cards

SageMaker Processing

Highly flexible, managed service that assists with pre-processing/post-processing and model evaluation, before model deployment.

*Used by developers and data scientists for complex, custom code-based workloads.

New cards

SageMaker Training Compiler

Accelerate the process of deep learning models by making the code run faster on the GPU.

Resources:

Warm Pools: retrain/re-use provisioned infrastructure

Checkpointing: Creates snapshots during your training

Distributed Training: Allow multiple jobs to run in parallel

Cluster Healthchecks: monitoring system for computer cluster.

Automatic Restarts: recovery mechanism when a job fails

New cards

Sagemaker Automatic Model Tuning (AMT)(*)

Automatically fine tunes hyper-parameter ranges based on objective metrics which you pre-define.

Set performance metric and training algorithm to start tuning job.

Uses Baysesian Optimizer to choose the optimal hyper-parameter.

Features: Parallel training, Cost-effective, Automated, Fully-customizable

"It learns as it goes" so it doesn't have to try every possible combination of the parameters.

Automated Hyperparameter Tuning

Early stopping

Warm start

Resource Limits

Transfer Learning

*NOTE: Similar to Canvas. The key difference is that Canvas is a graphical user interface, whereas Autopilot uses SageMaker API or SDK, giving developers and data scientists more programmatic control

New cards

Sagemaker Debugger

Automated quality control as your ML models learn.

Automatically monitors and analyzes your training data in real-time, catching issues early in the process.

Define rules for detecting unwanted conditions while training

Auto-generated performance reports and insights dashboard.

New cards

Sagemaker Profiler (*)

Provides detailed information about the resource utilization of your training job, including CPU, GPU, memory, and I/O operations

Identify bottlenecks and optimize training process

New cards

Bias vs Variance (*)

Bias: error due to overly simplistic assumptions in the model not allowiing it to capture underlying patterns in the data (underfitting)

Variance: error due to the model being too complex and sensitive to small fluctuations in the training data (overfitting)

New cards

Augmented vs Synthetic Data

Augmented: data is augmented from existing data set

Synthetic: new data set is created

New cards

SageMaker Clarify (*)

Provides greater insight into fairness, explainability and biases of your models

Increase transparency to your stakeholders, informs human decision making, and tracks whether a model is performing as intended.

Explains how input features effects the output of your model by importance.

Provides a comprehensive suite of tools to help developers build more transparent and less biased machine learning models.

Uses continuous monitoring for detecting changes in feature/bias attribution drift after model is deployed.

*NOTE: Can also use SHAP (local and global) and LIME (local) for model explainability.

*Integrated with SageMaker Studio and SageMaker Model Monitor

New cards

SageMaker Autopilot (ie. AutoML)(*)

Fully managed, end-to-end ML service.

Automates algorithm selection, data preprocessing, model tuning, "trial and error" testing, and incorporates all associated infrastructure.

*Provides a model leaderboard that shows which ML model works best.*

Three Training modes to Choose From:

Ensemble: trains several base models (XGBoost, RandomForest) to provide predictions.

Hyperparameter Optimization (HPO): precision tuning one, single best algorithm.

Auto: Let's Autopilot choose the most appropriate strategy.

(if dataset <100mb = Ensembling, if > 100mb = HPO)

NOTE: Integrates with Sagemaker Clarify

New cards

Sagemaker Model Optimizer

Automatically lessens the model's complexity while maintaining accuracy.

When low latency and high throughput are essential.

New cards

Sagemaker Neo

Optimize machine learning with Edge devices.

Key Features: Model Optimization, Increased Speed and Efficiency (smaller, less complex), Cross-Platform Deployment

Used with AWS IoT Greengrass

New cards

Sagemaker Model Parallel Library

Trains very large deep learning models that are too big to fit into the memory of a single CPU.

New cards

SageMaker Ground Truth

Fully managed, crowd-sourced data labeling service for image classification and object detection.

A combination of machine learning and human feedback where human workers only intervene in complex tasks, improving labeling efficiency.

*Focuses on preparing data before model training*

Reinforcement learning through machine or human feedback (internal employees, 3rd party contractors, Amazon Turk)

*Ground Truth Plus: AWS fully managed, turn-key solution.

Use cases: build high quality data training sets for MLs reducing time and cost

New cards

Sagemaker Pipelines

CI/CD service dedicated to the automation of ML Workflow.

Manages raw data preparation to model release along with performance tracking.

Supports manual approval workflows

*Includes retraining based on data/concept drift detected by Sagemaker Model Monitor*

New cards

SageMaker Model Registry

Organize, catalog and track versioning of models, metadata and approval status.

Manages approval status of models and deploys models.

New cards

Sagemaker Experiments

Organize, compare, and search experiment trials and historical ML training jobs.

Provides a structured way to manage the inputs, parameters, configurations, and results of your training runs, making it easier to analyze performance and reproduce results.

Deeply woven into the fabric of Sagemaker.

New cards

Sagemaker Endpoints /Inference Endpoints

Fully managed service, creating a dedicated secure web address for real-time inferences of your trained model.

Key Features in ML:

Autoscaling: The system changes the number of servers based on request count or delay goals.

Multi-Model Endpoints: It places several models at one address, sharing resources/cutting costs.

Production Variants: deploy multiple models behind a single endpoint. Primarily used for testing and comparing models.

Traffic Shifting: It sends part of the traffic to a new model during an initial run.

Secure Access: It works with VPC, uses SSL encryption plus IAM-based entry.

New cards

SageMaker Serverless Inference

Fully managed, serverless, cost-effective inference tool used when the model has idle time and uneven traffic.

Ideal for workloads that can handle idle time/cold starts.

*NOTE: Specify container, memory and concurrency req.

New cards

Sagemaker Serverless Inference Recommender

Recommends best instance type/configuration for your model, automates load testing model tuning and deploys to optimal endpoint.

Good for cost-optimization.

*NOTE: Use for pre-deployment. Use Cost Optimizer after free trials are complete.

New cards

Shapley vs PDP models

Shapley: Provides local, individualized explanation each feature makes to model prediction.

PDP: Provides global explanation to understand overall relationship between a feature and the model behavior across a dataset.

New cards

Inference Pipeline

Combines pre-processing, predictions, and post-processing.

Linear sequence of containers.

New cards

SageMaker Tensorboard

Helps ML engineers visualize, understand and debug during model training/development.

Used for manual inspection rather than real-time monitoring.

Problem it solves: helps you understand if your model is learning effectively, if its architecture is correct, and how different hyperparameters are affecting performance

New cards

SageMaker Model Monitor

Continuous monitoring system used to detect deviations in data quality and model quality after a model has been deployed.

Assesses quality of the models in production on a scheduled basis.

Can be used to configure alerts for anomalies, outliers, data drift and new features.

*Integrated with Cloudwatch*

New cards

Supervised vs Unsupervised training

Supervised: Learns from labeled datasets

Unsupervised: raw, unstructured data

New cards

Sagemaker Built-in Algorithms (part I)(*)

*Different models/algorithms serve different purposes*

*First understand the use case

Linear Learner: determines simple linear relationship between variables. Baseline for classification and regression tasks. (supervised)

Use Cases: Predicting house prices based on age

Seq2Seq: Sequence of tokenized text files (encoder/decoder) (supervised)

Use case: Speech-to-text (AWS Transcribe)

XGBoost/LightGBM: Boosted group of decision trees used to sequentially learn from/correct errors/update from previous versions. Highly accurate, interpretable with emphasis on features. (supervised)

Use case: Sales forecasting, Customer churn

Image Classification: Assigns labels to the objects within an image.

Use Case: Classifying images

Object Detection: Detects, identifies and categorizes all objects in an image with bounding boxes. Provides confidence scores.

Use case: AWS Rekognition, Self-driving cars

Semantic Segmentation: High-precision, image analysis. More specialized than object detection.

Use case: Medical

New cards

Sagemaker Built-in Algorithms (part II)(*)

BlazingText/Word2Vec/Doc2Vec:

(BT) High-performance, word embeddings and text classification for large datasets. Customized for your environment (compared to AWS Comprehend)

(W2V) Vector embeddings for words. Smaller datasets

(O2V) Similar to W2V but used with many object types.

Use case: Sentiment analysis, Genre predictions

K-Nearest Neighbor (KNN): returns most common classification nearest neighbor in latent space based on proximity to labeled data . (supervised)

K-Means Clustering: large scale clustering of data points into groups in latent space (unsupervised)

DeepAR: Time-series forecasting. Finds frequencies and seasonality. (supervised using RNN)

Use case: Demand forecasting, Financial predictions

Factorization Machines: Extension of linear learning, dealing with features and high-dimensionality, sparse data (both supervised and unsupervised)

Use case: Recommendation engines where user has rated very few items.

Random Cut Forest: Anomaly detection (unsupervised)

Use case: Unstructured, streaming data

Neural Topic Model: Neural-network topic model. Organizes documents into topics. Expensive and computationally heavy. (unsupervised)

Use case: Nuanced sentiment analysis

LDA: Another topic model (not using neural networks). More general purpose. Cheaper, efficient and more widely used. (unsupervised)

Use case: Discovering research trends

PCA: dimensionality reduction (unsupervised)

IP Insights: Security tool. Finds suspicious behavior. (unsupervised)

Use case: Security teams

New cards

Sagemaker Input Modes (*)

S3 File Mode: copies training data from S3 to local directory in Docker container. Good for a small amt of data.

S3 Fast File Mode: streaming from S3 source. Speeds up training time. Can do random access or sequential. (recommended)

Pipe Mode: streaming from S3 sequentially, continuous, forward-moving. Used for streaming high throughput, large datasets.

S3 Express One-Zone: High performance in One-zone.

FSx for Lustre: High performance, parallel distributed shared file server for compute heavy workloads.

New cards

Retrieval Augmented Generation (RAG)

A technique which allows a Foundational Model (FM) to retrieve an external data source outside of its training data (called an augumented prompt).

With RAG, language models can go beyond their original training data to use up-to-date, real-world information.

Use Case Examples: Customer Service Chatbots, Legal Research and Analysis, Healthcare Q/A

Products: Amazon Bedrock, AWS Kendra, AWS Sagemaker Jumpstart

Note: One of the challenges in question answering is a limited number of tokens in the context. It is resolved by using RAG. The architecture pattern for personalized and specific use cases results in reliable and accurate responses.

Note2: Does not change the weights of the model (unlike fine-tuning)

Note3: Very sensitive to the prompt templates.

New cards

Amazon Bedrock

Serverless, fully managed service that makes pre-trained FMs from Amazon and leading AI startups, available via an API

Keep control of all data used to train the model (data does not leave your account)

Security, Privacy and Governance and responsible AI features included.

Pricing: On-demand/Batch, Provisioned Throughput/Custom (provison throughput for performance requirements in exchange for time-based term commitment)

*Charged by the number of input tokens received and the number of output tokens that are generated // or image generated*

NOTE: Custom models can only use Provisioned Throughput (time-based commitment)

New cards

Fine-tuning Amazon Bedrock

Instruction-based Fine tuning: adapting a pre-trained model to follow instructions and perform specific tasks

Continuous Pre-training: continually updating model with the latest data using unlabeled data.

Single-turn messaging: single interaction between user and LLM (chatbots)

Multi-turn messaging.: conversation between user and LLM, maintaining conversation and understanding.

Usage: chat bots, training with exclusive data, targeted use cases, training with more up to date data.

Higher accuracy and lower latency. Loses generalization.

Fine tuning will change the weighting of the base foundational model (FM). Requires a higher budget and prepared data.

Training data: must adhere to a specific format and be stored in S3.

Use VPC with Privatelink to ensure secure data transfer during fine-tuning.

Note: not all models can be fine tuned.

New cards

Amazon Bedrock - Agents

Fully managed, securely connected and stateful piece of software that manage and carries out multi-step tasks, automatically.

Agents orchestrate multiple, complex interactions between foundation models (FMs), data sources, software applications, and user conversations.

Agents help users execute steps by calling appropriate tools, APIs and services.

Agents are proactive and autonomous, using a reasoning engine to plan and decide what to do.

*Note: Can automate data analysis, provide actionable insights, trace through chain-of-thought reasoning, gather external information, and enable proactive decision making within the enterprise.

New cards

Amazon Bedrock - Guardrails

Control the interaction between users and the FMs and for monitoring model outputs.

Filter any undesirable and harmful content or key words/topics

Remove Personable Identification Identification (PII)

Reduce hallucinations

Use only internal vs external information

New cards

Amazon Bedrock Studio

Web-based IDE environment for developers to quickly build and iterate GenAI applications using existing foundation models (FMs).

Specialized workshop where you take powerful, pre-built engines (the FMs) and quickly assemble them into applications.

New cards

Amazon Bedrock - Action Groups

Define the tools available to the LLM and guide it on when to use each based on the user's query

New cards

Amazon Augmented AI (A2I)

Combines ML automation (for easy tasks) and human review workflow (for tougher problems) in ML models

Used for analysis and predictions.

New cards

Amazon Q Business

Fully managed, Enterprise AI-powered solution trained on your internal data that answers questions, generates content, creates summaries, routine tasks, and automates tasks

Built with Amazon Bedrock (you can not define the FM model)

Third party plug-ins for data source integrations (Slack, MS Teams, etc)

Delivered using a built-in web experience or through APIs.

Admin controls are available; similar to guardrails.

Customizable security. User logins with IAM Identity Center

New cards

Amazon Q Apps

Create Gen-AI apps (without coding) by using natual language.

New cards

Amazon Q Developer

AI coding companion which helps increase productivity by developing code directly within the developers IDE, tracking dependencies, and ensuring compliance with open-source licenses

Broader tool which helps you understand your AI environment (compared to Code Whisperer)

Answers questions about AWS documentation and AWS service selection.

Features include: assisting with developer onboarding, writing boilerplate code (AI Code Companion), using unfamiliar languages, and detecting security vulnerabilities

New cards

AWS Glue

Fully managed, serverless, extract, transform, and load service (ETL) from multiple sources allowing for fast data integration.

Primarily for data warehouse & data lake integration

Easy for customers to prepare and load their data for analytics into a Redshift database

Use case:

Transform data from one format to another (.CSV to Parquet),

Migrate data from one system to another,

Prepare data for analytics/ML,

Improve data quality (error correcting),

Visualize data for BI

Works with Lambda function for Eventbridge

New cards

AWS Glue Data Catalog

A central repository to store structural and operational metadata for all your data assets.

New cards

AWS Glue Crawler

1) Scans/analyzes data,

2) Extracts schema info

3) Creates metadata tables in the AWS Glue Data Catalog.

New cards

AWS Glue Studio

Visual interface for defining and creating complex ETL workflows in an integrated development environment (IDE).

New cards

AWS Glue Jobs

Where the ETL is executed.

ie. AWS Glue FindMatches transform: identifies and merges duplicate records within a dataset.

New cards

AWS Glue Data Quality

Rules that allow you to assess and monitor the quality of your data, which is crucial for making informed business decisions.

New cards

AWS Glue DataBrew (*)

Visual UI for cleaning and pre-processing large data sets.

Over 250 ready-made transformations.

Handles PII information and automatic outlier detection.

Good for non-technical users.

New cards

EMR (Elastic MapReduce)/ EMR Serverless

Big data cloud platform for large scale, complex data processing using open source tools (Hadoop/Spark) running on EC2.

Help you to analyze, transform and move large amounts of data, identify patterns, and make predictions.

Highly flexible and customizable

Consists of a master node (manages the cluster), core node (hosts HDFS data) and task node (runs tasks).

Can be used on transient and long-running clusters.

EMR Serverless: pre-initialized capacity, fully automated solution

*NOTE: EMR is the platform and Spark is the engine that runs on EMR

New cards

Amazon Lookout (*)

Uses ML (Detectors) to detect and diagnose anomalies in datasets.

Tailored for business users rather than data scientists (Sagemaker Model Monitor)

Uses: S3, Athena, Redshift, Cloudwatch, third-party (Salesforce)

<p>Uses ML (Detectors) to detect and diagnose anomalies in datasets. </p><p>Tailored for business users rather than data scientists (Sagemaker Model Monitor)</p><p>Uses: S3, Athena, Redshift, Cloudwatch, third-party (Salesforce)</p>

New cards

Amazon Fraud Detector

Fully managed fraud detection, customized to your data.

Provides insights into which model feature variables are relevant when detecting fraud.

Ingests data from S3 or API

New cards

Apache Spark

Lightning fast, open source engine used for the maintenance of large scale data processing, query optimization and analytics.

Uses in-memory caching for optimized query execution.

Composed of large, heterogeneous data sets.

Breaks streams into micro-batches for processing.

Supports code reuse across batch processing, queries, Machine Learning.

Spark Streaming: integrates with Kinesis, Kafka, Redshift and EMR.

Can be integrated with Athena.

*NOTE: Used as a better alternative to MapReduce

New cards

Apache Flink(*)

Stateful, fully managed, open-source tool used with big data flows for quick event-time processing and minimal delays.

Can handle out-of order events using "event timestamps" (rather than processing time)

Ideal for fast processing: streaming ETL, continuous metric generation, real-time analytics, fraud detection, and monitoring

New cards

Apache Kafka / Amazon Managed Streaming for Apache Kafka (MSK)

Apache Kafka: Open-Source platform for building real-time streaming data pipelines and applications.

Fully managed Apache Kafka on AWS - *Alternative to Kinesis*

Can create custom configurations for your cluster (unlike Kinesis)

Authentication and Authorization allows you to determine who can read/write to specific topics.

Monitoring through Cloudwatch and Prometheus

New cards

Apache Airflow(*)

Open-source tool for scheduling, defining and monitoring complex workflows.

DAG (Directed Acyclic Graph) is a key component used for providing the sequence/dependencies for how the tasks should be executed

Provides flexibility, version control, and collaboration benefits.

New cards

Training Parameter Datasets

Small dataset: You can use a smaller batch size with more epochs to make the model more accurate.

Large dataset: You can use a larger batch size with fewer epochs for faster model training.

Complex models: You should use more epochs and be careful with batch size tuning if your model is complex.

New cards

Transformers

Giant neural networks enabling parallel computation

ie. Chat, Question and Answer, Sentiment Analysis, GPT (OpenAI)

New cards

Self Attention

Learns the contextual meaning of words

New cards

Masked Self Attention

A mask is used to prevent tokens from "peeking" into the future tokens (words)

New cards

BERT & ROUGE Score (*)

Measures for semantic similarity between generated and reference text.

*Used for predicting missing words from texts, filling in gaps in transcripts.*

Less reliable for short texts

New cards

Tokenization

Converting raw text into a sequence of tokens

Tokens can be words, phrases, or individual characters (like a period).

Each token has an ID which is easier for LLMs to process.

1000 tokens is approx 750 words

New cards

Token Embeddings

Captures semantic relationship between tokens

New cards

Inference Parameters (*)

System Prompts: describes how the model should behave and reply

Temperature (0 to 1): creativity and randomness of the output.

Lower values = more consistent, predictable

High values = more random and creative

Top P (0 to 1) cuts off low probability word choices based on cumulative probability. It tightens overall response distribution.

Top K (amount:#): how many words are to be considered for choosing the next word. Higher K increases sample size.

Maximum token count/Length: maximum length of the answer (response length, length penalty, stop sequences)

Stop Sequence: tokens that signal the model to stop generating output. (ie. ".", "?")

Prompt Latency: how fast the model responds.

Image-to-Image Prompt: inputting one image to get variations

Inpainting: reconstructing the missing parts of an image

Outpainting: constructing a seemless extension of an existing image.

Prompt Strength: Determines variability between prompt and output of the picture

Generation Step: Sequential process of predicting and

New cards

Hyperparameter tuning strategies (*)

In model training, used as a final step in ML workflow to fine-tune configuration settings to achieve the optimal performance

Hyperband: Multi-fidelity based tuning strategy which dynamically reallocates resources. Highly efficient and fast. Used for early stopping when resources are under-performing.

Grid search: Defines all possible values. Exhaustive and computationally heavy, yet provides best performance.

Random search: Samples a fixed number of random combinations. Addresses complexity of grid search

New cards

Transfer Learning

Fine-tunes a pre-trained model for a new task, such as feature extraction.

New cards

Embedding vectors

The ML algorithm that creates the vector embeddings

Transforms text, images or audio into numerical embeddings, capturing the semantic meaning and relationship between data points

Uses visualization to plot the embedding relationship.

Use cases: Q&A, Personalized recommendations

New cards

Chunking/Clustering

Used with longer text, dividing text into smaller groups, ensuring the model can process the text efficiently.

Helps manage the limitations of how many tokens can be processed at one time.

New cards

ML Deployments

Blue/green: two separate, production environments. Traffic is changed using a binary switch. Allows for near-instantaneous switching/rollback, minimizing risks.

Canary: small subset of users exposed to new environment

Shadow Testing: Simulates a copy of the production model under real-world conditions without affecting live traffic/users.

New cards

VPC Endpoints

Keeps traffic within a private network and connects to other VPCs and services without going through public internet

Provisions an ENI with a Private IP in the subnet

Can be associated with security groups to allow/deny traffic

Utilizes PrivateLink for connectivity

Connect to Amazon Bedrock privately.

*VPC endpoints are a security tool first, and then a connectivity device*

100

New cards