Chapter 18: Genomics, Bioinformatics, and Proteomics
Pseudogenes
Former genes that have accumulated mutations and are nonfunctional. Nonfunctional duplications of protein coding regions.
Human Microbiome Project (HMP)
$170 million effort to sequence genomes of bacteria, viruses, yeasts, and other microorganisms that live inside humans. Characterize microbes living on human bodies when healthy. Determine how the microbiome differs in various diseases. UCLA study: 101 college student DNA sequences analyzed. 46 has acne, 52 did not and 1000 strains of Propionibacterium acnes oslated. WGS and bioinformatics revealed 10 related strain types. Strain types help dermatologists develop new drugs.
Applications of Bioinformatics
- Comparing DNA sequences -Identifying genes in genomic sequence - Finding gene-regulatory regions (promoters and enhancers) - Identify telomeric sequences - Predicting amino acid sequences - Deducing evolutionary relationships between genes
Two Surprises of HGP
- Less than 2% of the genome encodes for proteins - There are only about 20,000 protein-coding genes
Major Features of the Human Genome Project
- The human genome contains ~3.1 billion nucleotides, but protein-coding sequences make up only about 2% - The genome sequence is ~99.9% similar in individuals of all nationalities. SNPs and copy number variations (CNVs) account for genome diversity from person to person. - The genome is dynamic. At least 50% of the genome is derived from transposable elements, such as LINE and Alu sequences, and other repetitive DNA sequences. - The human genome contains ~20,000 protein coding genes, far fewer than the originally predicted number of 80,000-100,000. - The average size of a human gene is ~25kb, including gene regulatory regions, introns, and exons. On average, mRNAs produced by human genes are 3,000 nt long. - Many human genes produce more than one protein through alternative splicing, thus enabling human cells to produce a much larger number of proteins (perhaps as many as 200,000) from only ~20,000. - More than 50% of human genes show a high degree of sequence similarity to genes in other organisms; however, more than 40% of the genes identified have no known molecular function. - Genes are not uniformly distributed on the 24 human chromosomes. Gene-rich clusters are separated by genepoor "deserts" that account for 20 percent of the genome. These deserts correlate with G bands seen in stained chromosomes. Chromosome 19 has the highest gene density, and chromosome 13 and the Y chromosome have the lowest gene densities. - Chromosome 1 contains the largest number of genes, and the Y chromosome contains the smallest number. - Human genes are larger and contain more and larger introns than genes of invertebrates, such as Drosophila. The largest known human gene encodes dystrophin, a muscle protein. This gene, associated in mutant form with muscular dystrophy Chapter 14), is 2.5 Megabases in length (Chapter 12), larger than many bacterial chromosomes. Most of this gene is composed of introns. - The number of introns in human genes ranges from 0 (in histone genes) to 234 (in the gene for titin, which encodes a muscle protein).
Most genetic differences result from...
...Single-nucleotide polymorphisms (SNPs) and copy number variations (CNVs).
Haemophilus influenzae
1815 genes
Neanderthal (Homo neanderthalensis)
1997 - sequenced mitochondrial DNA from fossil 2006 - sequenced nuclear DNA from bone sample 2010 - rough draft of Neanderthal genome encompassed 4 billion bp Comparative genomic analysis identified where humans have undergone rapid evolution since diverging from Neandethals. 99% identical to humans. 78 new protein-coding sequences since divergence. Genomic studies suggest interbreeding took place between neanderthals and modern humans an estimated 45,000 to 80,000 years ago. Genome of non-African H. sapiens contains approximately 1-4% of sequences inherited from Neanderthals.
Human Genome Project Write
2016; Quest to create a synthetic human genome. Proposed to synthesize an entire human genome. Ethical implications questioned. 10 year timeline and $100 million committed to project. 2018 project scaled back to focus on recoding genome to create microbe resistant cells.
DNA Microarray Analysis
A biochip used to measure changes in expression or mRNA levels, to detect single nucleotide polymorphisms (SNPs), or to genotype.
Open Reading Frames (ORF)
A continuous stretch of DNA containing codons that specify an amino acid sequence. Protein coding sequences contains these because they are sequences of nucleotides translated into amino acid sequence of a protein. Typically begins with initiation sequence ATG. Ends with termination sequence TAA, TAG, TGA, which corresponds to stop codons UAA, UAG, UGA. In eukaryotic ORFs, exons and introns can be identified.
GenBank
A database of previously sequenced and identified genes available through the National Center for Biotechnology Information (NCBI). Largest publicly available database of DNA sequences. Each sequence deposited in GenBank receives accession number.
RNA Sequencing
A method used to determine the transcribed regions of a genome within some specific cell population, tissue sample, or organism. Allows for in situ analysis of gene expression. Whole transcriptome shotgun sequencing. Allows for quantitative analysis of all RNAs expressed in tissue. Provides actual sequence data. Can be carried out in situ (inside the cell). Renders microarrays obsolete.
Mycoplasma genitalium
A parasitic pathogen. Genome of 580 kb, 525 genes. One of the smallest bacterial genomes.
Celera Genomics
A privately funded human genome project led by J. Craig Venter, was to use whole genome shotgun sequencing in computer automated high-throughput DNA sequencers
ELSI Program
A program established by the National Human Genome Research Institute in 1990 as part of the Human Genome Project to sponsor research on the ethical, legal, and social implications of genomic research and its impact on individuals and social institutions. Program to ensure personal genetic information would be safeguarded.
Basic Local Alignment Search (BLAST)
A software application used to compare a segment of genomic DNA to sequences throughout major databases. Calculates: - Identity value - E-value
Isoelectric Focusing
A specialized method of separating proteins by their isoelectric point using electrophoresis; the gel is modified to possess a pH gradient. Causes proteins to migrate based on charge.
Mass Spectrometry (MS)
A type of spectrometry that determines mass-to-charge ratio of ions formed from the molecule being analyzed. The ratio can be used to identify structures and determine sequences of proteins.. Instrumental in the development of proteomics. Analyze ionized samples in gaseous form. Measure mass-to-charge (m/z) ratio of different ions in sample. Matrix-assisted laser desorption ionization.
Human Proteome Map (HPM)
Aimed to catalog human proteome via proteomic analysis. Revealed: ~20,000 protein-coding genes in human genome. Can produce 290,000 different proteins due to co-translational or post-translational modifications, methylation, acetylation, and phosphorylation.
Human Genome Project (HGP)
An international collaborative effort to map and sequence the DNA of the entire human genome. Coordinated effort to sequence and identify all genes of human genome.
Synthetic Biology
Applies engineering design principles to biological systems. JCVIs goal is to create microorganisms that can synthesize biofuels. Other applications in creating synthetic microbes: - Bioremediation - Biopharmaceutcal products - Synthetic chemical and fuels - Semisynthetic crops
Omics
Areas of biological research having an "omics" connection are continually developing. Include: Proteomics Metabolomics Glycomics Toxicogenomics Metagenomics Pharmacogenomics Transcriptomics
How are functional categories assigned for human genes?
Based on: - Functions determined previously - Comparison to known genes and predicted protein sequences from other species - Predictions based on annotation and analysis of protein functional domains and motifs
Origin of HGP
Began in 1990 under direction of James Watson. Dr.Francis Collins led the project under coordination of the Department of Energy and National Center of Human Genome Research (NCHGR). Budget of $3 billion and 15-year plan in which all human genes were to be sequences and mapped.
Accessing HGP
Can access databases on the internet that displays maps for all human chromosomes. This contributed to: - Identification of disease genes - Development of new treatment strategies - Extensive maps developed for genes implicated in human disease conditions
Contigs
Continuous fragments. Overlapping fragments adjoining segments that collectively form one continuous DNA molecule within the chromosome.
Hallmark Feature
Can be recognized during annotation. Gene sequences of in bacteria or eukaryotes can be identified using bioinformatics software. Gene regulatory sequences found upstream are marked by identifiable sequences such as promoters, enhancers, and silencers.
Number of Genes Essential for Life
Comparative genetics estimated 256 genes might represent the minimum set of genes essential for life. Transposon based methods used to confirm minimum number of genes necessary for life. Selectively mutates each gene. Mutations in genes that produce lethal phenotype indicate essential genes. Non-essential mutated genes would not be lethal. 2106 - JCVI announced that 473 genes as minimal number required for life in bacterial genome. Future outlook: Gene editing approaches life CRISPR-Cas expected to make genome altering easier. Using CRISPR-Cas to know out bacterial genes has already been conducted. Able to screen for phenotypic changes.
Whole-Genome Sequencing (WGS)
Complete DNA sequence of an organism's genome at a single time; generating accurate reference genomes for microbial identification, and other comparative genomic studies. Also called shotgun sequencing/cloning. Most widely used strategy for sequencing and assembling an entire genome. 1. Genomic DNA cut with restriction enzymes to create series of overlapping fragments. 2. Overlapping fragments aligned using computer programs to assemble entire chromosome. 3. Fragments are aligned based on identical DNA sequences - creates contig Method developed by J. Craig Venter. The first genome sequenced from bacterium Haemophilus influenzae. WGS methods predominant for sequencing genomes. Computer automated sequencers made genomics possible and essential for the Human Genome Project.
Comparative Genomics
Computer-aided comparison of DNA sequences between different organisms to reveal genes with related functions. Gene discovery and development of model organisms to study human diseases. Incorporates study of gene and genomic evolution. Explores relationship between organisms and environment. As of 2018 - 23,000 whole genomes sequenced. Model orgnaisms and viruses: - S. cerevisiae - E. coli - Caenorhabditis elegans - Arabidopsis thaliana - M. musculus - Danio rerio - Drosophila
Encyclopedia of DNA Elements (ENCODE)
Created with the aim of using both experimental approaches and bioinformatics to identify and analyze functional elements that regulate expression of human genes. Aimed to identify every functional element in the human genome. Functional elements: - Transcription start sites - Promoters - Enhancers
Identity Value
Determines the sum of identical matches between aligned sequences divided by total number of bases aligned.
Protein Domains
Discrete structural units in a protein, often associated with a particular function(s). Can be used for predicting a protein function. Includes ion channels, membrane-spanning regions, secretion, and export signals.
Orthologs
Genes from different species thought to have descended from common ancestor. Homologous genes separated by a speciation event
Whole-Exome Sequencing (WES)
Genomic technique for sequencing all of the protein coding genes (exons) in the genome. Sequencing only 180,000 exons in person's genome. Reveals mutations by focusing on protein-coding segments in genome. More disease related genetic variations in exome. Limitation: Fails to identify gene regulatory regions that influence gene expression.
Transcriptome Analysis (Transcriptomics)
Global analysis of gene expression. Studies expression of genes by genome qualitatively and quantitatively. Qualitatively: identifies which genes are expressed and which are not. Quantitatively: measure varying levels of expression of different genes.
Synthetic Genome Synthesis
Homologous recombination technique used to assemble cassettes. Combined to completely span 1.08MB of M. mycoses genome. Genome transplantation: - Synthetic genome of M. Mycoides named J C V-syn1.0 transplanted into M. capricolum as recipient cells - Transplantation resulted in J C V-syn1.0 genotype and new phenotype of M. myocides. - Transplantation was verified by expression of LacZ gene.
Sea Urchin Genome
In 2006, researchers completed 814 billion bp genome of sea urchin Strongylocentrotus purpuratus. It is estimated to have 23,500 genes. Many genes with important functions in humans. 25-30% of the genome has pseudogenes - nonfunctional duplications of protein coding regions. 1000 genes for sensing light and odor. Humans and sea urchins share ~7000 orthologs.
Mycoplasma mycoides
In 2010, chemically synthesized over 1,000 1080-bp segments (cassettes) to cover the entire genome.
E-Value
Indicates the probability of a random match between the two sequences in comparison. Based on the number of matching sequences in database expected by chance.
Similarity Searches
Interferring gene function. Genome sequence statistically similar to gene with known function likely encode for protein with similar function.
Accession Number
Number generated by laboratory information system (LIS) when specimen request is entered into the computer. Used to access and retrieve sequence for analysis.
Algorithm-Based Software Programs
One of earliest bioinformatics applications developed for genomic purposes. Create DNA-sequence alignment: - Similar sequence lined up for comparison - Alignment identifies overlapping sequences - Allows scientists to reconstruct their order in the chromosome
Whole-Genome Sequencing Mechanism
One strategy (shown here) involves using restriction enzymes to digest genomic D N A into contigs, which are then sequenced and aligned using bioinformatics to identify overlapping fragments based on sequence identity. Notice that Eco R I digestion produces two fragments (contigs 1 and 2-4), whereas digestion with Bam H I produces three fragments (contigs 1-2, 3, and 4).
HMP Venn Diagram
Represents overlapping dat in metagenomics datasets. Human gut microbe genes from people with liver cirrhosis, Type 2 diabetes and irritable bowel syndrome. Each disease shows unique profile of microbial genes. Out of 580,000 microbial genes, 403 were shared common markers for all three diseases.
Genome 10K Plan
Project that is designed to create a genomic "zoo" which contains the sequences of 10,000 vertebrate species. Genome scientists and museum curators have proposed sequencing 10,000 vertebrate genomes. Will provide insight into genome evolution and speciation.
Matrix-Assisted Laser Desorption Ionization (MALDI)
Proteomic analysis of tissue samples treated under different conditions.
Annotation
Relies heavily on bioinformatics. Process of identifying gene regulatory sequences, sequences of interest in genome, and enables scientists to map out genes.
Sodium Dodecyl Sulfate Polyacrylamide Gel Electrophoresis (SDS-PAGE)
Second migration (after 2-D gel). Proteins separated by molecular mass. Electric current applied to gel.
Protein Motifs
Secondary structure that predicts what the protein will become. Helix-turn-helix, leucine zipper, or zinc-finger motifs.
Copy Number Variations (CNVs)
Segments of DNA that are duplicated or deleted.
Two-Dimensional Gel Electrophoresis (2DGE)
Separates proteins based on isoelectric point and then size. Technique for separating hundreds of thousands of proteins with high resolution: 1. Proteins isolated from cell or tissue 2. Loaded on polyacrylamide tube gel 3. Separated by isoelectric focusing - causes proteins to migrate based on charge
Alternative Splicing
Splicing of introns in a pre-mRNA that occurs in different ways, leading to different mRNAs that code for different proteins or protein isoforms. Increases the diversity of proteins. The HGP revealed the number of genes is lower than the number of predicted proteins. Many genes code for multiple proteins. Over 50% of genes undergo alternative splicing to produce multiple transcripts and proteins.
Stone-Age Genomics
Study of ancient DNA. Uses small amounts of ancient DNA from bone and other tissue. Data used to study evolutionary relatedness of various extinct and present-day species such as Egyptian mummy, mosses, platypuses, and mammoths.
Proteomics
Study of the structure and function of proteins in the human body. Identification, characterization, and quantitative analysis of all proteins proteomes. Allows for comparison of proteins in normal and diseased tissues. Can lead to identification of proteins as biomarkers for disease conditions. Provides information on: - Protein structure and function - Post-translational modifications - Protein-protein, protein-nucleic acid, protein-metabolite interactions - Cellular localization of proteins - Protein stability and aspects of translational and post-translational level of gene-epression regulation - Shared domains and evolutionary history of proteins
Single Nucleotide Polymorphisms (SNPs)
Variations in the DNA sequence due to the change of a single nitrogen base. Single-base changes in genome. Variations associated with disease conditions.
Synthetic or Artificial Genome
What is the minimum number of genes (core genes) to support life? Comparative genetics estimated 256 genes might represent the minimum set of genes essential for life.
Proteomes
The complete complement of proteins that a cell or organism can make. Complete setoff proteins encoded by genome.
Pangenome
The core genome of a bacterial species plus all genes found in some strains but not others. Attempts to visualize all genomic segments and gene variations found in a species. Notice that there are variations in individual genomes not represented in the reference genome, but these variations are included in the pangenome.
Personal Genome Projects
The cost for sequencing is expected to reduce, leading to WGS for individual people. Cost is less than $1,000 to sequence a genome, but genetically analyzing it will be expensive. By 2018, estimated 400,000 people had their genomes sequenced.
Protein Domains and Motifs
When a gene sequence is used to predict polypeptide sequence, the polypeptide can be analyzed for specific protein domains and motifs Protein domains: Ion channels, membrane-spanning regions, secretion, and export signals Motifs: Helix-turn-helix, leucine zipper, or zinc- finger motifs.
Somatic Mosaicism
When the somatic cells of the body are of more than one genotype, typically due to mitotic DNA replication errors at first or later cleavages. Cells in individual person do not all contain identical genomes. Individuals made up of population of cells, each cell with its own unique personal genome. Can result from errors in DNA replication, creating aneuploidy, CVNs and SNPs. Somatic cell variations passed to daughter cells during mitosis, but NOT to offspring.
Genomic Analysis
The study of genomes. The most rapidly advancing areas of modern genetics. Provides unprecedented information about genomes of different organisms.
Nutrigenomics
The study of how nutrition interacts with specific genes to influence a person's health. Considers genetics and diet. Focuses on understanding interactions between nutrition and genes. Nutrigenomic tests analyze your genome and provides customized nutritional report on nutritional metabolism based on genes associated with medical conditions.
Functional Genomics
The study of the relationship between genes and their function. Interprets a DNA sequence and establishes gene functions. Based on RNAs or proteins they encode. Identifies gene-regulatory elements in genome. Involves experimental approaches to confirm or refute computational predictions.
Bioinformatics
The use of computers, software, and mathematical models to process and integrate biological information from large data sets. Organize, share, and analyze data related to gene structure, sequence and expression, and protein structure and function.
Metagenomics (Environmental Genomics)
The use of whole-genome shotgun (WGS) approaches to sequence genomes from entire communities of microbes in environmental samples of water, air, and soil. Oceans, glaciers, deserts, and so on are being sampled. Involves isolating DNA from environmental sample without cultures of microbes and viruses (Microbiota of NY city subways studied): - Most were non-disease causing bacteria - Occassionally pathogens (Bacillus anthracis) were identified - Half of the DNA sequenced did not match any organism in genome database
Microarrays
Thousands of nucleic acid sequences are arranged in grids on glass or silicon. DNA or RNA probes are hybridized to the chip, and a scanner detects the relative amts of complementary binding. Used to profile gene expression levels or to detect single nucleotide polymorphisms (SNPs). Prepared by "spotting" ssDNA molecules attached to glass microscope slide using high speed robotic arm (arrayer). Arrayers are fitted with tiny pins. Pins immersed in solution of ssDNA molecules. Arrayer fixes DNA onto slide and scanned by computer.
Homologous Genes
Two or more genes derived from the same ancestral gene. Genes that are evolutionarily related. Similarity searches are able to identify homologous genes.
BLAST Searches
Used to screen databases and compare a sequence to a known sequence.
