Lecture_04

Lecture Overview

  • Course: Machine Learning for Physical Systems (lv2987+2988)

  • Instructor: Trushal Sardhara (tsardhara@ukaachen.de)

  • Topic: Synthetic Data for Machine Learning

  • Date: 5th Nov 2023

Introduction to Synthetic Data

What is Synthetic Data?

  • Data generated using mathematical models or algorithms on computers.

Why Use Synthetic Data?

  • Real data acquisition can be expensive or time-consuming.

  • Machine learning applications often require extensive datasets.

  • Real data may lack known ground-truth values.

  • New conditions that are rare can be simulated.

Example Data Source

  • GTA-5 Gameplay BBBC005v1 from Broad Bioimage Benchmark Collection [Ljosa et al., Nature Methods, 2012].

Advantages of Synthetic Data

  • Can typically be generated on a standard computer.

  • Easily reproducible when needed.

  • Ground-truth values are directly available.

  • There is no limit to the quantity of data generated.

Disadvantages of Synthetic Data

  • Often too idealized or "too good" leading to possible misrepresentations.

  • Generation can be challenging and require complex modeling.

  • Computationally intensive processes may be necessary.

Real-World Applications

  • Autonomous vehicles simulations.

  • Synthetic cells using SIMCEP IDDA dataset.

FIB-SEM Tomography Basics

Fundamentals of FIB-SEM

  • FIB-SEM: Focused Ion Beam Scanning Electron Microscope.

  • Key concepts include:

    • Signal: Backscattered electrons

    • Imaging: SEM offers in-plane images.

    • Material Removal: Achieved via focused ion beam.

    • Resolution: Ranges from 1-10 nm.

    • Output: Stacks of images for 3D reconstruction.

Characteristics of FIB-SEM

  • Destructive imaging technique that may complicate data integrity.

  • The ground-truth for data acquisition is hard to establish due to costs and time constraints.

3D Reconstruction from FIB-SEM Imaging

Process of 3D Reconstruction

  • Utilizes image slices for reconstruction:

    • Segmentation Techniques:

      • Thresholding (e.g., classifying based on pixel values).

      • Clustering methods (e.g., k-means clustering).

Challenges of Reconstruction

  • Shine-through effect: Visibility of structures from posterior regions in current slices.

  • Intensity ambiguity in imaging; the challenge of consistent brightness levels impacts interpretations.

Ambiguity in Intensity

  • Different characteristics can yield the same intensity readings, making segmentation based on intensity alone problematic.

Synthetic Data Generation Methods

Addressing Lack of Ground Truth

  • More intricate machine-learning-based segmentation techniques are recommended.

  • Requirements for implementation:

    • Large training datasets.

    • Known ground-truth values for effective training and validation.

    • These challenges can potentially be met with synthetic data.

Monte Carlo Simulations

  • A computational algorithm involving repetitive random sampling used for:

    • Fluid simulations, cellular structures, electron trajectories, etc.

Creating Synthetic FIB-SEM Images

  • Utilize Monte Carlo methods to generate virtual initial structures.

  • Parameters for simulations include:

    • Accelerating voltage, number of electrons, working distance.

GANs and Synthetic Data

Introduction to Generative Adversarial Networks (GANs)

  • Generator: Creates new data points.

  • Discriminator: Evaluates data authenticity, differentiating between real and fake data.

GAN Training Procedure

  • Involves a dynamic where the generator produces images while the discriminator critiques them.

  • Emphasizes the need for models to output high-quality and unseen data effectively.

Conditional GANs

  • Allows control over output samples based on specific conditions to create more relevant synthetic images.

Limitations of GANs

  • Susceptible to producing similar samples (mode collapse).

  • Difficulty generating truly novel data.

  • Training and optimization can be arduous tasks.

Summary & Take-home Message

  • Synthetic data is invaluable where real data is hard to obtain.

  • FIB tomography data generation techniques can include Monte Carlo methods and GANs.

  • Crop adaptations between synthetic and real data distributions via machine learning techniques.