1/36
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Artificial Intelligence is defined as
Machine behavior and function that exhibit the intelligence and behavior of humans.
CPMAI Six Phase Methodology
Business Understanding — Define the problem, evaluate AI fit, set KPIs and Go/No-Go criteria
Data Understanding — Identify data needs, assess quality, availability, and data roles
Data Preparation — Clean, label, transform, and engineer data for model training
Model Development — Select algorithms, train, tune, and validate the model
Model Evaluation — Test against business and technical KPIs; determine readiness for deployment
Model Operationalization — Deploy, monitor, manage, and iterate the live AI system
Seven Patterns of AI?
Conversational & Human Interaction: Machines interact with humans using natural language (voice, text, images). Examples: chatbots, voice assistants, content generation, sentiment analysis, machine translation.
Recognition: Machines identify/understand real-world unstructured data. Examples: facial recognition, image classification, handwriting ID, object detection, gesture recognition.
Predictive Analytics & Decision Support: Uses ML to predict future outcomes from past/existing data. Supports human decisions — does NOT remove the human from the loop. Examples: dynamic pricing, equipment failure prediction, fraud detection, demand forecasting.
Goal-Driven Systems: Uses reinforcement learning to find optimal solutions through trial and error. Examples: game playing, resource optimization, robo-advising, bidding/auctions, simulations, iterative problem-solving.
Hyper-Personalization: Develops a unique, evolving profile of each individual using ML. Examples: personalized healthcare treatments, personalized finance plans, personalized education/training, behavior profiling.
Autonomous Systems: Physical and virtual systems that accomplish goals with minimal/no human involvement. NOT the same as automation. Examples: autonomous vehicles, autonomous robots, autonomous software systems.
Patterns & Anomalies: Identifies patterns in data and anomalies that deviate from those patterns. Examples: fraud detection, outlier detection, cybersecurity, predictive time-series analysis, content summarization.
- Perception (sensing and processing the environment)
- Prediction (foreseeing what might happen next)
- Planning (acting based on perception and prediction).
Together they form an intelligent feedback loop
What is the difference between Automation and Autonomous?
Automation handles repetitive, programmed, predictable tasks (it follows defined rules).
Autonomous means a system can perform dynamic, complex tasks with minimal or no human involvement, using intelligence to handle variability (think of a Waymo vehicle).
What is Algorithmic Explainability (XAI)?
The ability to understand and explain WHY a model arrived at a specific prediction or decision. Required by law in many regulated industries. Deep learning models are often NOT explainable (black box), making XAI compliance a Phase I constraint to identify early.
The Four V’s of Big Data
Volume: The challenge of dealing with very large amounts of data, often spread across different locations.
Velocity: The challenge of data that is rapidly changing or moving, requiring processing at high speed with necessary accuracy. Streaming data, stock prices, sensor data from aircraft.
Variety: The challenge of data in different formats, from different locations, and with varying levels of structure: structured, unstructured, and semistructured.
Veracity: The challenge of data with varying levels of quality, accuracy, trustworthiness, and consistency. Problematic especially at scale.
AI Data Types: The 10/80/10 Rule
Structured (~10%): Data with a defined format and rigid schema (rows and columns). Examples: relational databases, spreadsheets (CSV, Excel), SQL, Parquet, ERP/CRM systems. Key Characteristic: Easiest to work with. Standard BI tools and SQL queries apply. Easy to extract specific fields.
Unstructured (~80%): Data without a predefined schema — highly variable even within a single domain. Examples: images, videos, audio, emails, PDFs, social media posts, text documents. Key Characteristic: Requires specialized tools (ML, text/vision analytics). Cannot be queried with SQL. Holds the most untapped organizational value.
Semistructured (~10%): Data with partial structure via tags or metadata, but with variable content. Examples: JSON, XML, HTML, NoSQL databases, invoices, system log files. Key Characteristic: Requires parsers or NoSQL queries. Has elements of both structured and unstructured data.
Data Steward
Primary Responsibilities: Ensures data is accessible, trustworthy, usable, and secure. Enforces data governance policies. Manages data quality, lineage, cataloging, and monitoring. Serves as data advocate.
Key Distinguisher: Governance/quality/access role. Has both technical AND soft skills. Works across IT and business units. Owns data lineage documentation.
Data Engineer
Primary Responsibility: Builds and maintains data pipelines, data ingestion systems, ETL processes, and data infrastructure. Focuses on technical movement, transformation, and storage of data.
Key Distinguisher: Technical pipeline role. Moves data from source to destination. Builds the infrastructure the data scientist and AI model depend on. Most of the 80% data engineering effort lives here.
Data Scientist
Primary Responsibility: Extracts analytical insights from data. Builds and applies ML models. Translates business needs into mathematical/statistical approaches. Tests hypotheses using algorithms.
Key Distinguisher: Analytical modeling role. Works on classification, regression, clustering, and prediction. Focuses on insight and model quality, not pipeline infrastructure.
Data Governance
The set of processes, procedures, standards, roles, and tools an organization implements to ensure data is properly stored, managed, accurate, available, secured, and controlled over its life cycle. Deals with the WHAT and WHY — defining the policies, rules, and accountability structures. Addresses: security risks, privacy risks, data ownership, auditing, access control, data sharing, compliance.
Data Stewardship
The PRACTICE of ensuring the organization's data is accessible, trustworthy, usable, and secure. Focuses on the HOW — ensuring governance policies are actually enforced and followed. Encompasses the full data life cycle: collecting, transforming, using, storing, archiving, deleting.
3 Meanings of Bias in AI
Neural network bias — a mathematical adjustment factor in model weights.
Bias-variance trade-off — model tendency to underfit or overfit.
Informational bias — overrepresentation/underrepresentation in data.
Types of Analytics - 4 Categories
Descriptive Analytics: 'What happened?' Focuses on understanding historical data, relationships, and comparisons. NOT forward-looking. Uses reporting tools, charts, summaries. Best for deterministic, historical reporting.
Diagnostic Analytics: 'Why did that happen?' Focuses on identifying cause-and-effect relationships. Explores root causes of historical events. More sophisticated than descriptive.
Predictive Analytics: 'What could happen?' Uses past and current data to make predictions about future or unknown events. Requires ML models and data science techniques.
Prescriptive/Projective Analytics: 'What if?' Focuses on identifying the potential impact of decisions based on current data. Simulates scenarios and recommends optimal actions. Most sophisticated analytics type.
What is Ground Truth Data?
Training data gathered from real-world sources, reflecting actual users, transactions, and events. Key challenges: incompleteness, need for labels/annotation (especially for supervised learning), format and cleanliness issues, accuracy problems, labeling effort, and privacy/confidentiality concerns.
What is Training Data?
A data set of prepared, cleaned, and appropriately labeled data (if used for supervised learning) that is used to incrementally train an ML model to perform a particular task. It must be representative of real-world conditions to produce a reliable model.
What 3 core aspects of data must be addressed in Phase 2 (Data Understanding)?
Data Sources: where does the data come from, who owns or stewards it, and how is it accessed?
Data Description: what is its structure, type, and format — structured, unstructured, or semistructured?
Data Quality: how clean, complete, and representative is it?
Phase 2 (Data Understanding) Go/No-Go Criteria?
You can advance to Phase III when:
Data requirements have been adequately determined
Adequate workbook responses exist for Phase II items
No critical roadblocks prevent project success.
You do NOT need final answers — but you must know what data you have, where it is, and how to access it.
80/20 Rule
80% of an AI project's total effort is data engineering (Phases 1-3), and only approximately 20% is model development (Phases 4-6). This has direct consequences for team composition and resource planning.
Implication for Team Composition: You need significantly MORE data engineers than data scientists on an AI project team. Overstaffing data scientists and understaffing data engineers is one of the most common misallocation mistakes.
Implication for Project Planning: Most of the time, budget, and effort is spent BEFORE the model is ever trained. Phase III alone (data preparation) is where the majority of hands-on work occurs.
Implication for Sponsors: Sponsors who expect quick model results underestimate the data engineering investment. CPMAI requires communicating the 80/20 reality to stakeholders at Phase I.
The Exam Trap: GenAI projects do NOT reduce the 80/20 rule. CPMAI states that GenAI projects often require MORE data management intensity, not less — foundation models still need domain-specific data preparation, prompt engineering data curation, and RAG infrastructure.
Data: Splitting / Augmentation / Multiplication / Transformation / Labeling
Data Splitting: Separating data into training, validation, and test sets. Includes data sampling for very large data sets and data attribute pruning to reduce size and complexity.
Data Augmentation: Enhancing existing data QUALITY with additional manipulations or combinations to increase effective quantity. Adds necessary information to the original data. Especially important for unstructured data (text, images, video, audio).
Data Multiplication: Increasing the QUANTITY of prepared data by transforming and manipulating existing data. Example: starting with 1,000 images and generating 100,000–200,000 through transformations. Distinct from augmentation (which adds quality/information); multiplication adds volume.
Data Transformation: Changing data from one state to another — converting format, altering metadata. Part of the ETL/ELT process to change data into the right format for storage or model training.
Data Labeling: Adding descriptive tags or metadata to training data — especially for supervised learning. Provides meaning so models can learn by example. Without labels, supervised learning cannot proceed.
ETL vs. ELT - Data Pipeline Approaches
ETL (Extract, Transform, Load): Traditional method for loading data into DATA WAREHOUSES. Transform step occurs BEFORE loading because data warehouses require data in a specific format. Steps: Extract from source → Transform (merge, combine, convert) → Load into warehouse. Best for: structured data, relational systems, when data must conform to a fixed schema.
ELT (Extract, Load, Transform): Modern method introduced with DATA LAKES. Data is loaded as-is first, then transformed when needed for specific tasks. Steps: Extract from sources → Load into data lake → Transform on demand. Best for: big data environments, diverse formats, when flexibility is needed and transformation can be deferred.
Move Data to Processing vs. Move Processing to Data?
Move Data to Processing: Used when data is NOT too large AND does not change frequently. Export the data from its source to the processing location. Methods: manual export, automated export, replication.
Move Processing to Data: Used when data IS too large to move OR changes too quickly to keep copies current. Apply processing technology directly where the data already lives. Example: running ML inference on an edge device or at a data center rather than moving petabytes of data.
Change Data Capture (CDC)
Detects and captures ONLY the incremental changes (inserts, updates, deletes) in source data, then moves those changes to the target system efficiently. Best for: high-velocity data where moving full data sets repeatedly is impractical. Minimizes data movement cost and latency. CDC is the correct choice when data changes FREQUENTLY and volume is HIGH. Moving all data repeatedly is inefficient — CDC moves only what changed.
Four Triggers for using Synthetic Data
Privacy & Security Constraints: Real-world data contains PII or sensitive information that would violate privacy laws (GDPR, HIPAA) or create security risks if used for training. Synthetic data provides a safe alternative without exposing real individuals.
Intellectual Property (IP) Constraints: Real data is proprietary or licensed in ways that prohibit its use for AI training. Synthetic data can replicate the statistical properties without the IP exposure.
Insufficient Real-World Data: The required data does not exist in sufficient quantity, variety, or completeness. Common in niche, emerging, or rare-event domains.
Class Imbalance (Rare Events): One class is severely underrepresented in training data (e.g., fraud cases = 2% of all transactions). Synthetic data generates additional examples of the minority class to balance training. Also known as oversampling.
Data Labeling Approaches:
Data labeling is the process of adding descriptive tags or metadata to training data to provide meaning for ML models — REQUIRED for supervised learning. Labels tell the model what the correct output should be for a given input.
Manual Labeling: Humans use their knowledge to apply labels. Higher quality but slower and more expensive. Required when domain expertise is needed (e.g., medical imaging, legal documents).
Automated Labeling: Systems infer labels based on previously trained models. Faster and cheaper but may introduce label errors. Requires validation to ensure quality.
Bounding Boxes: Used in image labeling to identify and locate specific objects within an image. Machines see images as grids of pixels — bounding boxes tell the model which pixels constitute a meaningful object.
Sensor Fusion: Correlates data from multiple sensor sources simultaneously. Example: autonomous vehicles combining lidar, radar, ultrasonic sensors, and cameras into a unified 3D point cloud model of the environment.
Representational Fairness vs. Technical Cleanliness
Technical data cleanliness (removing errors, duplicates, noise) is NOT the same as representational fairness (ensuring all relevant groups are adequately represented).
Technical Cleanliness: Data is accurate, consistent, deduplicated, correctly formatted, and free of errors. A technically clean data set can still be biased. Addresses: Veracity, noise, format inconsistencies.
Representational Fairness: All relevant groups, categories, demographics, and scenarios are adequately represented in the training data. A representationally fair data set may still have technical quality issues. Addresses: Who or what is included in the data — and in what proportions.
Why Both Matter: A model trained on technically clean but representationally unfair data will perform well on the majority group and poorly on underrepresented groups. CPMAI requires both technical cleanliness AND representational fairness for trustworthy AI. Both must be verified during Phase III — they are separate checks, not the same thing.
What is sensor fusion, and which AI pattern does it primarily support?
Correlating data from multiple sensor sources simultaneously — combining lidar, radar, ultrasonic, and camera inputs into a unified 3D point cloud. Primarily supports the Autonomous Systems AI pattern, where vehicles or robots must understand their full environment in real time.
Phase 3 (Data Preparation) Go/No-Go
All data preparation tasks have been actually executed (not just planned)
Training, validation, and test data sets are prepared and split
Data quality checks and verification are complete
Data pipelines for both training AND inference are built and operational
Labeling, augmentation, and enhancement requirements have been addressed
Difference between a Data Warehouse and a Data Lake?
Data Warehouse: stores data in a FIXED, structured schema — requires ETL (transform before loading).
Data Lake: stores data in its NATIVE format (raw) — uses ELT (load first, transform later). Data lakes support big data environments with diverse, variable data formats
Three Types of Machine Learning?
Supervised Learning: Trains on LABELED data to predict outputs. Human-annotated examples teach the model the correct answer. Tasks & CPMAI Pattern Alignment: Classification, Regression. Aligns with: Recognition, Predictive Analytics & Decision Support patterns.
Unsupervised Learning: Finds patterns in UNLABELED data through discovery. No ground truth labels required. Tasks & CPMAI Pattern Alignment: Clustering, pattern discovery. Aligns with: Patterns & Anomalies pattern.
Reinforcement Learning: Learns through trial and error in an interactive environment. Maximizes rewards over time. Tasks & CPMAI Pattern Alignment: Goal-driven optimization. Aligns with: Goal-Driven Systems pattern.
ML Algorithm vs. ML Model?
ML Algorithm: The SET OF STEPS that tells the computer HOW to learn from data. Used during training. The method, not the result. Example: a neural network architecture specification.
ML Model: The TRAINED OUTPUT produced when an algorithm runs on data. Used in PRODUCTION to generate predictions. What you are actually using when you say you are 'using an AI system.' Example: Chat GPT-4, a fraud detection classifier.
Key ML Algorithm Concepts
Classification: Determines which category data belongs to. Binary (yes/no) or Multiclass (one of many). Examples: spam detection, image recognition. Uses supervised learning with labeled examples.
Clustering: Automatically groups similar data without predefined categories. Unsupervised — no labels required. Example: customer segmentation. Finds hidden patterns humans might not discover manually.
Regression: Predicts a continuous numerical value from input data. Discovers relationships between variables. Examples: home price prediction, demand forecasting, sales projections.
Neural Network: A versatile architecture applicable to supervised AND unsupervised tasks. Interconnected layers: input, one or more hidden layers, output. More hidden layers = more sophisticated learning capability.
Deep Learning: Neural networks with MORE THAN ONE hidden layer. 'Deep' refers to the number of hidden layers. Powerful and popular but computationally expensive and often NOT explainable (black box). If a simpler model achieves similar results, prefer the simpler option.
ML Model Taxonomy (Types)
Pretrained Model: Already trained on a large relevant data set. Ideal when speed is critical, data is limited, or technical capability is constrained. Requires due diligence on provider — may have bias, privacy, or compliance risks from unknown training data. Readily available for Conversational and Recognition patterns; limited for Autonomous Systems.
Foundation Model: Large pretrained model adaptable for a broad range of applications. Trained using SELF-SUPERVISED learning on massive data (billions of parameters, petabytes of data) — no labeled data required. Used directly for generic tasks OR fine-tuned for specific downstream tasks. Eliminates the need to train from scratch.
Transformer Model: Type of deep learning neural network that transforms input sequences into output sequences. Uses encoder-decoder architecture with attention mechanism. Processes sequential data efficiently; tracks context within a context window. The basis for most modern LLMs.
Large Language Model (LLM): Deep learning foundation models trained to generate human-understandable text or images from input prompts. Focus on natural language for input/output/training. Built on transformer architecture. PROS: ease of use, flexibility, fast results, wide knowledge base. CONS: not tailored to proprietary data, prone to hallucination, lack explainability (black box).
Generative AI (GenAI): A type of ML that creates NEW data based on patterns learned from existing data. Text generators optimized for text; image generators for images. Applications: chatbots, synthetic data, code generation, content creation.
AI Agent: Software that perceives its environment, makes decisions, and acts autonomously to achieve specific goals. Examples: virtual assistants, autonomous vehicles, recommendation systems.
Agentic AI: Advanced AI where agents autonomously define, optimize, and iterate on workflows with minimal human involvement. Agents evaluate their own performance, redesign processes, and collaborate with other agents. Requires orchestration platforms, agent lifecycle management, and sophisticated monitoring.
Model Bias vs. Variance (Dartboard Framework)
Bias: Degree predictions differ from target values. An ACCURACY problem. High bias = model makes faulty assumptions, misses patterns consistently. Causes UNDERFITTING (model too simple). Dartboard: arrows all miss in the same wrong direction.
Variance: Degree model is sensitive to fluctuations in training data. A GENERALIZATION problem. High variance = model over-adapts to training data, fails on new data. Causes OVERFITTING (model too complex). Dartboard: arrows scattered widely around the target.
Underfitting vs Overfitting
Underfitting (High Bias): Model too SIMPLE. Poor performance on BOTH training AND test data. Fix: use a more complex model.
Overfitting (High Variance): Model too COMPLEX. Excellent on training data, poor on new/unseen data. Fix: regularization, ensemble methods, or reduce model complexity.
Well-Generalized Model: The 'sweet spot.' Captures meaningful patterns without over-adapting to training data. Performs consistently across training, validation, and test sets.
Hyperparameter: Configuration settings determined by HUMANS BEFORE training begins. Controls how the model learns. Examples: learning rate, number of epochs, number of hidden layers, number of decision tree branches. Distinct from model parameters which are LEARNED BY THE MACHINE during training.