Multimodal Retrieval Notes
Multimodal Retrieval
Multimodal Data Types
- Multimodal data encompasses various forms of information:
- Text
- Images
- Videos
- EEG signals
- Sensor data
- Charts
- Graphs
- Smell data
- It leverages fields like:
- Computer Vision (CV)
- Natural Language Processing (NLP)
Retrieval Queries and Answers
- Query: Asking for something.
- Query by Image
- Query by Text-Image
- Query by Visual Question Answering (VQA)
- Query by Text
- Answer: Accurately answering the query.
Query by Image Example
- Find images similar to a given query image.
- Example uses a similarity metric to rank results.
- Similarity scores are provided (e.g., 2.025973, 0.167937, 0.125448, etc.) to indicate how closely each result matches the query image.
Query by Text-Image Examples
- Combining text and image inputs to refine search.
- Example 1: User looks for a dress similar to a given image but in white with a ribbon sash. The system retrieves and suggests an alternative.
- Example 2: Instructions to modify an image (e.g., "No people and switch to night-time," "Add red cube to bottom-middle").
- Example 3: Complex scene description: "A horse in a city, occluding a bike and a car…"
Query by Visual Question Answering (VQA)
- Involves answering questions about an image.
- Examples:
- "What type of animal is this?"
- "Is this animal alone?"
- "Is it snowing?"
- "Is this picture taken during the day?"
- "What kind of oranges are these?"
- "Is the fruit sliced?"
- "What is leaning on the wall?"
- "How many boards are there?"
Query by Text
- Example using Google Search: User searches for "Tiger Woods".
- The search results include:
- Overview information (American professional golfer).
- Videos.
- Age (48 years, born December 30, 1975).
- Caddy (Joe LaCava).
- Wikipedia link.
- Top stories and news.
- Homepage link.
- Profiles on Instagram, X (Twitter), Facebook, YouTube.
- People also ask questions (e.g., "Why is Tiger Woods famous for?").
What is Multimodal Retrieval?
- Multimodal retrieval is the process of retrieving information from data containing multiple modalities (text, images, videos, audio, etc.).
- The goal is to allow users to search using queries that combine modalities (text, image, spoken query).
- Techniques from information retrieval, computer vision, NLP, and audio processing are combined.
- This involves feature extraction, similarity measurement, fusion of information, and relevance ranking.
- Applications include multimedia search engines, content-based image retrieval, video retrieval, cross-modal retrieval, and multimodal information retrieval in healthcare.
Applications of Multimodal Retrieval
- Databases
- Journalism and TV broadcasting
- Interactive TV (Astro byond, VOD, Netflix)
- Searching on YouTube, Facebook
- Gallery in your handphone
Timeline of Multimodal Retrieval
- Traditional Era:
- Up to 2015
- Methods: KNN, SVM, Decision tree, GLCM, Colour histogram, Filter, Supervised learning, Similarity distance
- Deep Learning Era:
- 2015 - Today
- Methods: CNN, RNN, LSTM, BERT, LLM
Traditional Era Retrieval
Retrieved through SQL (Structured Query Language).
Structured form (SQL).
Example SQL query:
SELECT city, count(*) FROM customer WHERE state = “CA” GROUP and ORDER by cityThis query finds all orders from the state of California.
Deep Learning Era Applications
- Image captioning
- Text generation
- Translation
- Object detection
- Knowledge transfer
- Etc.
- Concepts and methods that can be adapted into multimodal retrieval: CNN, RNN, LSTM, BERT, LLM.
Deep Learning Examples
- Image caption: Automatically create image descriptions (Karpathy, 2015).
- Example: "man in black shirt is playing guitar"
- Text generation: Computer-generated handwriting (Graves, 2014).
- Restoring photo: Damaged photos are restored (Dahl, 2017).
- Colorizing photo: Black and white photos are colorized (Iizuka, 2016).
- Object detection & YOLO: Various objects are detected and named (Tesla, 2017; Redmon 2018b; Krizhevsky, 2017).
- Knowledge transfer: Art photo is created by combining an original Van Gogh art piece and a photo through style transfer (Gatys, 2016).
- Create imagery photo: Original photo is transformed into an imaginary photo using deep dreaming (The Telegraph, 2015).
- Medical diagnosis: COVID-19 is detected in Chest X-ray (Mukherjee, 2020).
- Sound for desired scene: Sound is added based on the collection of sounds that is compatible with the desired scene (Owens et al, 2016).
- Fake Video: New video is created to re-enact a politician from the original video with lip synchronization (Suwajanakorn, 2017).
- Face recognition: Face detection using pre-trained models.
- Translation: Language translation is performed to produce the desired language with automated machine translation (Good, 2015).