KA

Lab 1 Genomics and Bioinformatics for Gene Identification and Primer Design

Genomics and Bioinformatics: Understanding Gene Complexity and Primer Design

The Deeper Dive into Genomics

  • Beyond Basic Sequencing: While the human genome is well-characterized, modern genomics requires a much deeper understanding of genetic sequences and their complexity than simply identifying genes.

  • Genes as Annotations: Genes are not inherent physical entities but rather human annotations applied to specific genetic sequences within the genome. This perspective is crucial for understanding the high complexity of genetic operations.

  • Complexity in Action: Even small sequences on chromosome 12, spanning 2,000,000 to 12,000,000 base pairs, represent an incredible level of intricacy. Due to the double-stranded nature of DNA, a 2,600,000 bp sequence actually accounts for 5,200,000 bp.

  • Bidirectional Gene Expression: Genes are not exclusively expressed in a single direction (5' to 3' or 'sense'). They can also be expressed in the opposite direction, known as the 'antisense' direction, complicating the understanding of DNA function.

    • Sense vs. Antisense: This terminology is preferred over 'coding' and 'template' strands, as 'sense' and 'antisense' carry no bias regarding the direction of gene expression.

  • Navigating Complexity with Databases: The human mind struggles with this level of complexity. Therefore, scientists rely on sophisticated databases and computational tools to store, manage, and access genomic information.

    • Key Principle: Genes are expressed from both strands (5' to 3' on one strand and 5' to 3' on the opposite strand), a concept fundamental to advanced genomic studies.

Bioinformatics Lab - Gene Identification and Primer Design

Lab Objectives/Tasks:
  • Identify any given gene using databases (NCBI, Ensembl).

  • Identify basic gene structure: 5' UTR, promoter, exons, introns, 3' UTR. (The term 'terminator' is mentioned but not fully elaborated within the structure discussion).

  • Identify specific transcripts present on a gene.

  • Navigate databases effectively and accurately under time pressure (critical for assessment).

  • Prepare basic PCR primers and understand the design process.

  • Critically review and analyze primer design properties (degenerate primers are excluded from this module).

Introduction to Databases: Ensembl
  • Ensembl Database: A user-friendly database particularly useful for examining the genetic structure of specific genes. Access via embedded links in course materials.

  • Species Selection: Ensembl allows selection of various species (primates, rodents, fish, etc.), highlighting the need to understand how to apply bioinformatics skills across different organisms.

    • Mice: A clinically relevant, cheap small animal model of disease. Cost: approx. €20 per cage (several years ago). While individual costs are low, large-scale studies can involve thousands of cages, leading to substantial overall expenses (€40,000-60,000 for mice alone, excluding extensive care and regulatory compliance).

    • Zebrafish: Common model, especially in cardiovascular medicine, due to their ability to regenerate heart tissue after injury, unlike humans who lose cardiomyocytes permanently. Used to seek pathways for cardiomyocyte regeneration therapies.

    • Ethical Considerations: Significant ethical concerns exist when studying higher primates due to visible similarities with humans, impacting research.

  • Gene Search Example (TMEM179): Searching for TMEM179 (Transmembrane Protein 179) leads to gene and transcript entries.

Gene vs. Transcript: Understanding Expression and Diagnosis
  • Transcripts: Short pieces of mRNA that are translated into proteins at the ribosome. They represent the active expression of a gene.

  • Gene Expression Measurement: To examine gene expression (how much a gene is 'on' or 'off'), one must look at the transcript level. Cells regulate gene expression by increasing or decreasing the number of transcripts made from a gene.

    • Example (Weightlifting): When muscles grow, genes associated with muscle/protein synthesis increase their expression, leading to more transcripts being made. However, the number of gene copies (e.g., two, one maternal, one paternal) remains constant.

    • Mnemonic: Expression = Transcripts. These two concepts must be linked.

  • Genomic Diagnosis: For diagnosing genetic diseases based on changes in the core genetic sequence, genomic data (gDNA) is required.

  • Treatment Response/Disease State: To assess how a patient responds to treatment or their current disease state in terms of gene activity, transcript levels are examined.

Exploring TMEM179 in Ensembl
  • The TMEM179 human gene is identified by a gene number and is named Transmembrane Protein 179.

  • Transcripts: This specific gene has six transcripts. These details include amino acid count and whether transcripts are protein-coding or non-protein-coding.

  • Transcript Structure Visualization: Ensembl displays transcript structures:

    • Untranslated Regions (UTRs): White boxes.

    • Exons: Dark or yellow boxes, which become amino acids/code proteins.

    • Introns: Lines between exons, representing regions that are typically skipped during splicing.

    • Alternative Splicing: While introns are usually removed, sometimes they are included, leading to 'alternative splicing' and generating different protein isoforms.

  • Transcript Information Summary: Clicking on a transcript provides a summary:

    • Location, number of exons (e.g., TMEM179-212 has 4 exons and 223 amino acids).

    • Transcript ID, length (e.g., 1,430 base pairs), biotype (e.g., protein-coding).

    • Chromosomal base pair coordinates (e.g., 25,361 to 163,407).

  • Gene Length Calculation: The length of a gene is determined by subtracting the start base pair coordinate from the end base pair coordinate.

  • Downloading Sequences (Fasta Format):

    • To get the genetic sequence, use the 'Export Data' feature.

    • Select 'Fasta' format, which starts with a line containing > followed by the gene name and then the sequence itself.

    • Set flanking sequences to 0 to get only the gene sequence.

    • Choose 'genomic unmasked sequence' to get the raw DNA sequence without feature annotations.

    • Note: For retrieving cDNA (transcript) sequences, Ensembl's export function is considered less user-friendly or 'messy' compared to other tools.

Introduction to NCBI
  • NCBI (National Center for Biotechnology Information): A continuously evolving public database. It is a collection of multiple, often linked, biological databases.

  • Gene Database: The primary database for this lecture, but many other resources are available.

  • Navigating NCBI: A specific search (e.g., TMEM179) can lead to various entries.

    • Beta Views: Users should avoid 'beta' (pre-release/testing) user interfaces for critical tasks or assessments, as they may not be stable or fully supported.

  • cDNA vs. gDNA Terminology:

    • cDNA (complementary DNA): Refers to DNA sequences derived from mRNA transcripts (lacking introns).

    • gDNA (genomic DNA): Refers to DNA sequences directly from the genome (including introns).

  • Accession Numbers: Every entry in NCBI has an accession (catalog) number (e.g., mRNA entries start with NM, protein entries with NP).

  • Hover-and-Pop-up Feature: Hovering the mouse over an accession number brings up a pop-up menu with key information (e.g., mRNA, transcribed to protein, protein length: 233 amino acids).

  • Expression Visualization: NCBI can display tissue-specific gene expression patterns. For TMEM179, it shows biased expression in the brain and increased expression in the cortex during human fetal development (e.g., ten weeks, increase at sixteen weeks, spiked at eighteen weeks).

  • Related Information: Provides direct links to the relevant TMEM179 entries in other databases.

Polymerase Chain Reaction (PCR) and Primer Design

Applications of PCR:
  1. Diagnosis: Detecting the presence or absence of specific sequences (e.g., viral DNA like COVID-19, bacterial DNA).

  2. Genotyping: Identifying specific genetic variations (genotypes) associated with traits (e.g., cilantro tasting like soap) or diseases. While next-generation sequencing is more modern, PCR can still be used. (Example given: 23andMe for genotyping, with discussion on GDPR and data ownership).

  3. Gene Expression Analysis: Quantifying the number of transcripts present (a more advanced PCR technique, like qPCR, covered in a later lecture).

Basic PCR Mechanism:
  • Taq Polymerase: A thermal-stable DNA polymerase, originally isolated from bacteria in hot springs. It can withstand high temperatures required for DNA denaturation.

  • DNA Synthesis, Not Copying: PCR does not directly 'copy' a DNA strand. Instead, it synthesizes a complementary strand to a given template strand.

    • Example: For a 5'-ATAATGG-3' template, the polymerase synthesizes a 3'-TATTACC-5' complementary strand. Two original strands lead to two new complementary strands, effectively doubling the DNA.

  • Controlling the Reaction: The PCR reaction is controlled by:

    1. Primers: Short DNA sequences that bind to specific regions, defining the amplified area.

    2. Temperature Cycling: Controls the three main steps.

  • Basic PCR Steps:

    1. Denaturation: Heating the double-stranded DNA (e.g., to 95^ extbf{o}C) to separate it into two single strands, which serve as templates.

    2. Annealing: Cooling the reaction (e.g., to 55-65^ extbf{o}C) to allow primers (forward and reverse) to bind complementarily to their target sequences on the single-stranded DNA.

    3. Extension: Raising the temperature (e.g., to 72^ extbf{o}C) for Taq polymerase to synthesize new complementary DNA strands, starting from the bound primers and extending in the 5' to 3' direction.

  • Amplification: This cycle is repeated multiple times (e.g., 20-40 cycles), exponentially amplifying the specific DNA sequence between the two primers, known as the amplicon.

  • Polymerase Fidelity: Taq polymerase can make mistakes (lower fidelity). More expensive polymerases exist with higher fidelity (fewer errors).

  • Practical Notes (DKIT Specific): For the qPCR machine, optimal primer design aims for an annealing temperature between 55^ extbf{o}C and 60^ extbf{o}C for a quick, 2-hour run.

Primer Design Principles and Rules:
  • Purpose: Primers are typically short DNA sequences (never RNA for this course) that