IP Genomics: Bioinformatics Visualization, Clustering, and DNA Methylation
Logistics and Proteomics Group Presentations
Group Assignment Management:
Group assignments for the proteomics presentation have been finalized and communicated via email and a shared table/Excel sheet.
The Group ID listed in the final column of the sheet determines the presentation date and topic.
There are currently groups, each consisting of students.
Three students who did not sign up will be combined to form the final group; these students should contact the instructor to be connected.
Presentation Schedule and Topics:
Presentation dates are assigned by Group ID. For example, groups through are scheduled to present on Friday, May 22.
Topics are assigned sequentially: Group takes the first topic, Group takes the second, and so on.
Presentation Requirements:
Duration: minutes for the presentation followed by minutes for Q&A.
Slide Count: Exactly slides, excluding the title slide and the final reference slide.
Content: The title slide must include the presentation title, the specific topic, and the names of all team members.
Grading Criteria: Grades are based on presentation quality, style, timing, and the ability to handle questions.
Due Date: All slides must be submitted by noon on May 21, regardless of the assigned presentation day.
Late Policy: Point deductions will occur based on the amount of time elapsed after the deadline.
Audience Participation:
All students are required to submit questions for each topic by May 21; these submissions contribute to the final grade.
Visualizing Next-Generation Sequencing (NGS) Data
Volcano Plots:
Definition: A scatter plot used to visualize statistical significance versus the magnitude of change (effect size) for thousands of genes.
X-axis: Represents the log fold change (log FC), which signifies the difference in expression between two groups (e.g., treatment vs. control).
Y-axis: Represents the negative log10 of the p-value ().
Higher values on the Y-axis indicate greater statistical significance (smaller p-values).
Data Points: Each dot represents a single gene.
Color Coding: Genes crossing specific thresholds for both fold change and p-value are typically colored (e.g., red or blue). For instance, a dot on the far left with a high Y-value represents a gene that is significantly down-regulated.
MA Plots:
Definition: A plot used to visualize gene expression differences relative to their average intensity.
X-axis: Represents the mean expression (average expression across all samples) on a scale.
Y-axis: Represents the log2 fold change ().
Purpose: To evaluate the stability of estimates. Low-expressed genes (far left of the X-axis) often show high variability and stochastic noise, making fold change estimates less reliable. Highly expressed genes tend to provide more stable and statistically significant data points.
Experimental Design and Batch Effects
Batch Effects (Artifacts):
Occur when samples are processed differently (e.g., different days, different RNA extraction batches, or different sequencing runs).
Perfect Confounding: A severe design flaw where the treatment group is processed in one batch and the control group in another. This makes it impossible to distinguish biological signals from processing artifacts.
Balanced Design: The recommended approach where control and treatment samples are randomly mixed across all processing batches to ensure the condition is balanced across days.
Other Confounders:
Factors such as RNA quality, biological sex, and patient age must be matched or accounted for during analysis to avoid misleading results.
Downstream Bioinformatic Analysis
Pathway and Ontology Analysis:
Once a list of differentially expressed genes is generated using tools like DESeq2 or HR, researchers perform pathway analysis or Gene Ontology (GO) analysis.
The goal is to determine the biological functions or pathways affected by the observed changes, such as identifying biomarkers for cancer.
Clustering:
Definition: Grouping genes with similar expression profiles together.
Logic: Genes with similar expression patterns often share similar regulatory mechanisms (e.g., same transcription factors) or similar biological functions.
Benefit: Reduces the complexity of the data from thousands of individual genes to a smaller number of clusters (e.g., or clusters).
K-means Clustering Algorithm:
Initial Step: Specify the number of clusters, , beforehand.
Seed Selection: Randomly pick data points to serve as initial cluster centers (centroids).
Assignment: Calculate the distance (similarity metric) between every gene and the centroids; assign each gene to the closest centroid.
Recalculation: Compute the new geometric centers of the resulting groups.
Iteration: Repeat the assignment and recalculation steps until the cluster centers stabilize (converge) or reach a set number of iterations (e.g., ).
Mathematical Objective: To minimize the within-cluster sum of squares, reducing variability within groups.
Limitations: The user must know or guess ; the algorithm is sensitive to outliers and can be less robust if data is sparse.
Introduction to Epigenetics and Epigenomics
Terminology:
Epigenetics: The study of heritable changes in gene function that do not involve alterations to the DNA sequence ("beyond genetics").
Epigenomics: The global study of these changes across the entire genome.
Core Requirements of Epigenetic Marks:
Heritable: Passed from one cell generation to the next.
Reversible: Can be added or removed.
No Sequence Change: Does not alter the DNA code.
Functional: Regulates gene expression (acting as "switches" or "tuners").
Historical Case Study (Overkalix):
A study of an isolated farming community revealed that shocks in food supply (starvation vs. abundance) experienced by ancestors affected the health and longevity of their grandsons, demonstrating transgenerational epigenetic inheritance.
Biological Examples:
Identical Twins: Despite identical genomes, twins can have different health outcomes (e.g., one getting cancer) due to differing epigenomes.
Agouti Mice: Genetically identical mice can appear fat and yellow or skinny and brown. In the yellow mouse, the agouti gene is "on"; in the brown mouse, a methyl group attaches to the promoter, shutting the gene down.
Chromatin and Gene Regulation
Structural Hierarchies:
Histones: Proteins that DNA wraps around, appearing like "beads on a string."
Nucleosome: The basic unit of chromatin, consisting of DNA wrapped around a histone core.
Chromatin: The complex of DNA and protein.
Chromatin Remodeling:
Open Chromatin: Allows transcription factors (TFs) and RNA polymerase to access DNA, turning genes on.
Closed (Tightly Packed) Chromatin: Physically blocks access to DNA, preventing transcription and silencing genes.
Cell Specification: While every cell in the body shares the same genome, the epigenome differs between cell types (e.g., neurons vs. heart cells) to regulate specific gene expression.
DNA Methylation Mechanisms
Chemical Process:
Involves the addition of a methyl group to the cytosine residue, forming 5-methylcytosine ().
Typically occurs at CpG dinucleotides (Cytosine-phosphate-Guanine).
Catalyzed by the enzyme DNA cytosine methyltransferase.
Gene Silencing:
Large concentrations of methylated CpGs are often found in CpG islands located in promoter regions.
Direct Interference: Methyl groups can physically reduce the binding affinity of transcription factors.
Recruitment: Methylated DNA can be recognized by Methyl Binding Domain (MBD) proteins, which recruit co-repressors to further silence the gene.
Tumor Suppressors: In many cancers, the promoter regions of tumor suppressor genes become heavily methylated, silencing their protective functions.
Biological Applications:
X-chromosome Inactivation: One of the two X chromosomes in females is silenced via methylation.
Imprinting: Only one parental copy of a gene is expressed while the other is silenced via epigenetic tags.
Questions & Discussion
Question: When might the midterm grades be available?
Response: They should be graded in about a week.
Question: Can you provide the slides from the previous lecture (Monday)?
Response: Yes, the instructor noted that access was restricted and will send the slides out so students can take notes.