Multimodal Retrieval Notes

Multimodal Retrieval

Multimodal data encompasses various forms of information:
- Text
- Images
- Videos
- EEG signals
- Sensor data
- Charts
- Graphs
- Smell data
It leverages fields like:
- Computer Vision (CV)
- Natural Language Processing (NLP)

Query: Asking for something.
- Query by Image
- Query by Text-Image
- Query by Visual Question Answering (VQA)
- Query by Text
Answer: Accurately answering the query.

Find images similar to a given query image.
Example uses a similarity metric to rank results.
Similarity scores are provided (e.g., 2.025973, 0.167937, 0.125448, etc.) to indicate how closely each result matches the query image.

Combining text and image inputs to refine search.
Example 1: User looks for a dress similar to a given image but in white with a ribbon sash. The system retrieves and suggests an alternative.
Example 2: Instructions to modify an image (e.g., "No people and switch to night-time," "Add red cube to bottom-middle").
Example 3: Complex scene description: "A horse in a city, occluding a bike and a car…"

Multimodal retrieval is the process of retrieving information from data containing multiple modalities (text, images, videos, audio, etc.).
The goal is to allow users to search using queries that combine modalities (text, image, spoken query).
Techniques from information retrieval, computer vision, NLP, and audio processing are combined.
This involves feature extraction, similarity measurement, fusion of information, and relevance ranking.
Applications include multimedia search engines, content-based image retrieval, video retrieval, cross-modal retrieval, and multimodal information retrieval in healthcare.

Traditional Era:
- Up to 2015
- Methods: KNN, SVM, Decision tree, GLCM, Colour histogram, Filter, Supervised learning, Similarity distance
Deep Learning Era:
- 2015 - Today
- Methods: CNN, RNN, LSTM, BERT, LLM

Example SQL query:

SELECT city, count(*)
FROM customer
WHERE state = “CA”
GROUP and ORDER by city

Image captioning
Text generation
Translation
Object detection
Knowledge transfer
Etc.
Concepts and methods that can be adapted into multimodal retrieval: CNN, RNN, LSTM, BERT, LLM.

Image caption: Automatically create image descriptions (Karpathy, 2015).
- Example: "man in black shirt is playing guitar"
Text generation: Computer-generated handwriting (Graves, 2014).
Restoring photo: Damaged photos are restored (Dahl, 2017).
Colorizing photo: Black and white photos are colorized (Iizuka, 2016).
Object detection & YOLO: Various objects are detected and named (Tesla, 2017; Redmon 2018b; Krizhevsky, 2017).
Knowledge transfer: Art photo is created by combining an original Van Gogh art piece and a photo through style transfer (Gatys, 2016).
Create imagery photo: Original photo is transformed into an imaginary photo using deep dreaming (The Telegraph, 2015).
Medical diagnosis: COVID-19 is detected in Chest X-ray (Mukherjee, 2020).
Sound for desired scene: Sound is added based on the collection of sounds that is compatible with the desired scene (Owens et al, 2016).
Fake Video: New video is created to re-enact a politician from the original video with lip synchronization (Suwajanakorn, 2017).
Face recognition: Face detection using pre-trained models.
Translation: Language translation is performed to produce the desired language with automated machine translation (Good, 2015).