Lecture_04

Lecture Overview

Course: Machine Learning for Physical Systems (lv2987+2988)
Instructor: Trushal Sardhara (tsardhara@ukaachen.de)
Topic: Synthetic Data for Machine Learning
Date: 5th Nov 2023

Introduction to Synthetic Data

What is Synthetic Data?

Data generated using mathematical models or algorithms on computers.

Why Use Synthetic Data?

Real data acquisition can be expensive or time-consuming.
Machine learning applications often require extensive datasets.
Real data may lack known ground-truth values.
New conditions that are rare can be simulated.

Example Data Source

GTA-5 Gameplay BBBC005v1 from Broad Bioimage Benchmark Collection [Ljosa et al., Nature Methods, 2012].

Advantages of Synthetic Data

Can typically be generated on a standard computer.
Easily reproducible when needed.
Ground-truth values are directly available.
There is no limit to the quantity of data generated.

Disadvantages of Synthetic Data

Often too idealized or "too good" leading to possible misrepresentations.
Generation can be challenging and require complex modeling.
Computationally intensive processes may be necessary.

Real-World Applications

Autonomous vehicles simulations.
Synthetic cells using SIMCEP IDDA dataset.

FIB-SEM Tomography Basics

Fundamentals of FIB-SEM

FIB-SEM: Focused Ion Beam Scanning Electron Microscope.
Key concepts include:
- Signal: Backscattered electrons
- Imaging: SEM offers in-plane images.
- Material Removal: Achieved via focused ion beam.
- Resolution: Ranges from 1-10 nm.
- Output: Stacks of images for 3D reconstruction.

Characteristics of FIB-SEM

Destructive imaging technique that may complicate data integrity.
The ground-truth for data acquisition is hard to establish due to costs and time constraints.

3D Reconstruction from FIB-SEM Imaging

Process of 3D Reconstruction

Utilizes image slices for reconstruction:
- Segmentation Techniques:
  - Thresholding (e.g., classifying based on pixel values).
  - Clustering methods (e.g., k-means clustering).

Challenges of Reconstruction

Shine-through effect: Visibility of structures from posterior regions in current slices.
Intensity ambiguity in imaging; the challenge of consistent brightness levels impacts interpretations.

Ambiguity in Intensity

Different characteristics can yield the same intensity readings, making segmentation based on intensity alone problematic.

Synthetic Data Generation Methods

Addressing Lack of Ground Truth

More intricate machine-learning-based segmentation techniques are recommended.
Requirements for implementation:
- Large training datasets.
- Known ground-truth values for effective training and validation.
- These challenges can potentially be met with synthetic data.

Monte Carlo Simulations

A computational algorithm involving repetitive random sampling used for:
- Fluid simulations, cellular structures, electron trajectories, etc.

Creating Synthetic FIB-SEM Images

Utilize Monte Carlo methods to generate virtual initial structures.
Parameters for simulations include:
- Accelerating voltage, number of electrons, working distance.

GANs and Synthetic Data

Introduction to Generative Adversarial Networks (GANs)

Generator: Creates new data points.
Discriminator: Evaluates data authenticity, differentiating between real and fake data.

GAN Training Procedure

Involves a dynamic where the generator produces images while the discriminator critiques them.
Emphasizes the need for models to output high-quality and unseen data effectively.

Conditional GANs

Allows control over output samples based on specific conditions to create more relevant synthetic images.

Limitations of GANs

Susceptible to producing similar samples (mode collapse).
Difficulty generating truly novel data.
Training and optimization can be arduous tasks.

Summary & Take-home Message

Synthetic data is invaluable where real data is hard to obtain.
FIB tomography data generation techniques can include Monte Carlo methods and GANs.
Crop adaptations between synthetic and real data distributions via machine learning techniques.