Multimodal Retrieval Notes

Multimodal Retrieval

Multimodal Data Types

  • Multimodal data encompasses various forms of information:
    • Text
    • Images
    • Videos
    • EEG signals
    • Sensor data
    • Charts
    • Graphs
    • Smell data
  • It leverages fields like:
    • Computer Vision (CV)
    • Natural Language Processing (NLP)

Retrieval Queries and Answers

  • Query: Asking for something.
    • Query by Image
    • Query by Text-Image
    • Query by Visual Question Answering (VQA)
    • Query by Text
  • Answer: Accurately answering the query.

Query by Image Example

  • Find images similar to a given query image.
  • Example uses a similarity metric to rank results.
  • Similarity scores are provided (e.g., 2.025973, 0.167937, 0.125448, etc.) to indicate how closely each result matches the query image.

Query by Text-Image Examples

  • Combining text and image inputs to refine search.
  • Example 1: User looks for a dress similar to a given image but in white with a ribbon sash. The system retrieves and suggests an alternative.
  • Example 2: Instructions to modify an image (e.g., "No people and switch to night-time," "Add red cube to bottom-middle").
  • Example 3: Complex scene description: "A horse in a city, occluding a bike and a car…"

Query by Visual Question Answering (VQA)

  • Involves answering questions about an image.
  • Examples:
    • "What type of animal is this?"
    • "Is this animal alone?"
    • "Is it snowing?"
    • "Is this picture taken during the day?"
    • "What kind of oranges are these?"
    • "Is the fruit sliced?"
    • "What is leaning on the wall?"
    • "How many boards are there?"

Query by Text

  • Example using Google Search: User searches for "Tiger Woods".
  • The search results include:
    • Overview information (American professional golfer).
    • Videos.
    • Age (48 years, born December 30, 1975).
    • Caddy (Joe LaCava).
    • Wikipedia link.
    • Top stories and news.
    • Homepage link.
    • Profiles on Instagram, X (Twitter), Facebook, YouTube.
    • People also ask questions (e.g., "Why is Tiger Woods famous for?").

What is Multimodal Retrieval?

  • Multimodal retrieval is the process of retrieving information from data containing multiple modalities (text, images, videos, audio, etc.).
  • The goal is to allow users to search using queries that combine modalities (text, image, spoken query).
  • Techniques from information retrieval, computer vision, NLP, and audio processing are combined.
  • This involves feature extraction, similarity measurement, fusion of information, and relevance ranking.
  • Applications include multimedia search engines, content-based image retrieval, video retrieval, cross-modal retrieval, and multimodal information retrieval in healthcare.

Applications of Multimodal Retrieval

  • Databases
  • Journalism and TV broadcasting
  • Interactive TV (Astro byond, VOD, Netflix)
  • Searching on YouTube, Facebook
  • Google
  • Gallery in your handphone

Timeline of Multimodal Retrieval

  • Traditional Era:
    • Up to 2015
    • Methods: KNN, SVM, Decision tree, GLCM, Colour histogram, Filter, Supervised learning, Similarity distance
  • Deep Learning Era:
    • 2015 - Today
    • Methods: CNN, RNN, LSTM, BERT, LLM

Traditional Era Retrieval

  • Retrieved through SQL (Structured Query Language).

  • Structured form (SQL).

  • Example SQL query:

    SELECT city, count(*)
    FROM customer
    WHERE state = “CA”
    GROUP and ORDER by city
    
  • This query finds all orders from the state of California.

Deep Learning Era Applications

  • Image captioning
  • Text generation
  • Translation
  • Object detection
  • Knowledge transfer
  • Etc.
  • Concepts and methods that can be adapted into multimodal retrieval: CNN, RNN, LSTM, BERT, LLM.

Deep Learning Examples

  • Image caption: Automatically create image descriptions (Karpathy, 2015).
    • Example: "man in black shirt is playing guitar"
  • Text generation: Computer-generated handwriting (Graves, 2014).
  • Restoring photo: Damaged photos are restored (Dahl, 2017).
  • Colorizing photo: Black and white photos are colorized (Iizuka, 2016).
  • Object detection & YOLO: Various objects are detected and named (Tesla, 2017; Redmon 2018b; Krizhevsky, 2017).
  • Knowledge transfer: Art photo is created by combining an original Van Gogh art piece and a photo through style transfer (Gatys, 2016).
  • Create imagery photo: Original photo is transformed into an imaginary photo using deep dreaming (The Telegraph, 2015).
  • Medical diagnosis: COVID-19 is detected in Chest X-ray (Mukherjee, 2020).
  • Sound for desired scene: Sound is added based on the collection of sounds that is compatible with the desired scene (Owens et al, 2016).
  • Fake Video: New video is created to re-enact a politician from the original video with lip synchronization (Suwajanakorn, 2017).
  • Face recognition: Face detection using pre-trained models.
  • Translation: Language translation is performed to produce the desired language with automated machine translation (Good, 2015).