1/7
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Unimodal AI
System that can process, understand and generate information from only one type of data → One “modality”
Multimodal AI
AI system that can process, understand and generate information across multiple modalities

Challenges of multimodal AI
Generating a plausible world
Must satisfy physics, aesthetics, and language meaning
Not just realism
Control over structure, space, time and continuity
How does Multimodal AI work?
Input is tokenised
Transformer + Diffusion model work together to generate output
Tokenisation in Multimodal AI
Text tokenisation: Text → Sequence of (partial words)
E.g. “Un” “believ” “able”
Image tokenisation: Raw pixels → Visual patches
Audio tokenisation: Sound waves → Sequence of time slices
Transformer (Thinker)
Understands the meaning of the input
Turns input (image / text / audio) → Concepts → Conditioning signals
Diffusion model (Creator)
Starts from pure noise and conditioning signals
Gradually denoises it into a realistic input (image, video, sound)
Diffusion model
Making pictures messy
Add random noise to images and observe what noisy images look like
Learn to clean up
Use many images from step 1 to learn how to remove noise
Training → Prediction
Create from scratch
Start with random noise
Create a new image