Lecture_04
Lecture Overview
Course: Machine Learning for Physical Systems (lv2987+2988)
Instructor: Trushal Sardhara (tsardhara@ukaachen.de)
Topic: Synthetic Data for Machine Learning
Date: 5th Nov 2023
Introduction to Synthetic Data
What is Synthetic Data?
Data generated using mathematical models or algorithms on computers.
Why Use Synthetic Data?
Real data acquisition can be expensive or time-consuming.
Machine learning applications often require extensive datasets.
Real data may lack known ground-truth values.
New conditions that are rare can be simulated.
Example Data Source
GTA-5 Gameplay BBBC005v1 from Broad Bioimage Benchmark Collection [Ljosa et al., Nature Methods, 2012].
Advantages of Synthetic Data
Can typically be generated on a standard computer.
Easily reproducible when needed.
Ground-truth values are directly available.
There is no limit to the quantity of data generated.
Disadvantages of Synthetic Data
Often too idealized or "too good" leading to possible misrepresentations.
Generation can be challenging and require complex modeling.
Computationally intensive processes may be necessary.
Real-World Applications
Autonomous vehicles simulations.
Synthetic cells using SIMCEP IDDA dataset.
FIB-SEM Tomography Basics
Fundamentals of FIB-SEM
FIB-SEM: Focused Ion Beam Scanning Electron Microscope.
Key concepts include:
Signal: Backscattered electrons
Imaging: SEM offers in-plane images.
Material Removal: Achieved via focused ion beam.
Resolution: Ranges from 1-10 nm.
Output: Stacks of images for 3D reconstruction.
Characteristics of FIB-SEM
Destructive imaging technique that may complicate data integrity.
The ground-truth for data acquisition is hard to establish due to costs and time constraints.
3D Reconstruction from FIB-SEM Imaging
Process of 3D Reconstruction
Utilizes image slices for reconstruction:
Segmentation Techniques:
Thresholding (e.g., classifying based on pixel values).
Clustering methods (e.g., k-means clustering).
Challenges of Reconstruction
Shine-through effect: Visibility of structures from posterior regions in current slices.
Intensity ambiguity in imaging; the challenge of consistent brightness levels impacts interpretations.
Ambiguity in Intensity
Different characteristics can yield the same intensity readings, making segmentation based on intensity alone problematic.
Synthetic Data Generation Methods
Addressing Lack of Ground Truth
More intricate machine-learning-based segmentation techniques are recommended.
Requirements for implementation:
Large training datasets.
Known ground-truth values for effective training and validation.
These challenges can potentially be met with synthetic data.
Monte Carlo Simulations
A computational algorithm involving repetitive random sampling used for:
Fluid simulations, cellular structures, electron trajectories, etc.
Creating Synthetic FIB-SEM Images
Utilize Monte Carlo methods to generate virtual initial structures.
Parameters for simulations include:
Accelerating voltage, number of electrons, working distance.
GANs and Synthetic Data
Introduction to Generative Adversarial Networks (GANs)
Generator: Creates new data points.
Discriminator: Evaluates data authenticity, differentiating between real and fake data.
GAN Training Procedure
Involves a dynamic where the generator produces images while the discriminator critiques them.
Emphasizes the need for models to output high-quality and unseen data effectively.
Conditional GANs
Allows control over output samples based on specific conditions to create more relevant synthetic images.
Limitations of GANs
Susceptible to producing similar samples (mode collapse).
Difficulty generating truly novel data.
Training and optimization can be arduous tasks.
Summary & Take-home Message
Synthetic data is invaluable where real data is hard to obtain.
FIB tomography data generation techniques can include Monte Carlo methods and GANs.
Crop adaptations between synthetic and real data distributions via machine learning techniques.