Lecture 25 - The Human Genome Project
Illumina HiSeq 2000
Highest output Fastest data rate Highest number of reads
The idea of a coordinated effort to sequence the human genome
first raised at a meeting at the University of California at Santa Cruz in 1985
BRACAnalysis® is a genetic test that confirms the presence of BRCA1 or BRCA2 gene mutation
BRCA mutations are responsible for the majority of hereditary breast and ovarian cancers. People with a mutation in either the BRCA1 or BRCA2 gene have risks of up to 87% for developing breast cancer and up to 44% for developing ovarian cancer by age 70. Mutation carriers previously diagnosed with cancer also have a significantly increased risk of developing a second primary cancer. Genetic testing, specifically the BRACAnalysis test, identifies patients who have these mutations. Genes occur naturally in every human, and in addition to moral questions raised, patenting them would constitute an obstacle to biomedical research worldwide. The discovery of their relevance to breast cancer was funded by the public. The company was selling its breast cancer diagnostic test for a price many described as "outrageous": $4000, the price of a whole genome sequencing (around 20,000 genes analyzed), when the test only looked at two genes. Many universities and hospitals were offering the test for a much lower cost, Myriad forced them to stop, leading to a starkly increased cost for patients
Bioinformatics
Because of the massively parallel nature of next gen sequencers, huge amounts of data are produced quickly requiring terabytes of storage Each run produces 1.5Tb of data
Illumina/Solexa
Bought by Illumina in 2007 ($615 million) Also sequencing by synthesis
Celera Genomics vs UCSC
Celera also promised to publish their findings, by releasing new data annually (the HGP released its new data daily), although, unlike the publicly funded project, they would not permit free redistribution or scientific use of the data. The publicly funded competitor UC Santa Cruz was compelled to publish the first draft of the human genome before Celera for this reason (Jim Kent). On July 7, 2000, the UCSC Genome Bioinformatics Group released a first working draft on the web. The scientific community downloaded one-half trillion bytes of information from the UCSC genome server in the first 24 hours of free and unrestricted access to the first ever assembled blueprint of our human species.
Views of public effort vs. Celera
Celera's view of International Consortium: -Unfair competition: IC delivering the same goods but with state funding. International Consortium's view of Celera: -Unfair competition: Celera delivering the same goods but can use IC data, while IC cannot use Celera data.
ENCODE
ENcyclopedia of CODing Elements
EST Projects
EST=Expressed Sequence Tag Short, single pass reads from bits of mRNA In practice random reads from cDNA libraries polyA primed/random primed Sometimes libraries are tissue specific
Advantages of Sanger sequencing
Each individual reaction is fairly cheap (~$1-$25) Each reaction sequences ~500bp very, VERY accurately. This is perfect for small applications that require high accuracy. Each reaction requires a single pair of primers spanning that ~500bp region. In practice this requires a lot of groundwork before any sequencing can be done.
Cost and funding
Estimated that it cost $3 billion over the 15 year project that was funded by the Department of Energy The first draft was announced in 2000 with the more complete version released in 2003 (2 years ahead of schedule)
Sanger Method
Fred Sanger, 1958-2013 -Was originally a protein chemist -Made his first mark in sequencing proteins -Made his second mark in sequencing RNA Sanger Sequencing: Partial copies of DNA fragments made with DNA polymerase Collection of DNA fragments that terminate with A,C,G or T using ddNTP Separate by gel electrophoresis Read DNA sequence
Maps
Genetic map -determined from recombination frequencies Physical map -based on physical distances -the physical location of a particular cloned sequence of DNA -BAC Shotgun Sequencing
What Goals Were Established for the Human Genome Project When it Began?
Identify all of the genes in human DNA. Determine the sequence of the 3 billion chemical nucleotide bases that make up human DNA. Store this information in data bases. Develop faster, more efficient sequencing technologies. Develop tools for data analysis. Address the ethical, legal, and social issues (ELSI) that arise from the project.
Celera Genomics
In 1998, Craig Venter founded Celera Genomics The $300M Celera effort was intended to proceed at a faster pace and at a fraction of the cost of the roughly $3 billion publicly funded project. Celera used a technique called whole genome shotgun sequencing, employing pairwise end sequencing. Celera initially announced that it would seek patent protection on "only 200-300 genes", but later amended this to seeking "intellectual property protection" on "fully-characterized important structures" amounting to 100-300 targets. The firm eventually filed preliminary ("place-holder") patent applications on 6,500 whole or partial genes.
Celera Genomics - presidential orders
In March 2000, President Clinton announced that the genome sequence could not be patented, and should be made freely available to all researchers. The statement sent Celera's stock plummeting and dragged down the biotechnology-heavy Nasdaq. The biotechnology sector lost about $50 billion in market capitalization in two days. But the public release of the data ensured its fair use and availability to all mankind. The competition proved to be very good for the project, spurring the public groups to modify their strategy in order to accelerate progress. UC Santa Cruz and Celera initially agreed to pool their data, but the agreement fell apart when Celera refused to deposit its data in the unrestricted public database GenBank. Celera had incorporated the public data into their genome, but forbade the public effort to use Celera data.
BAC Sequencing
It was far too expensive at that time to think of sequencing patients' whole genomes. The genome was broken into smaller pieces; approximately 150,000 base pairs in length. -These pieces were then ligated into a type of vector known as "bacterial artificial chromosomes", or BACs The vectors containing the genes can be inserted into bacteria where they are copied by the bacterial DNA replication machinery. Each of these pieces was then sequenced separately as a small "shotgun" project and then assembled.
Next Generation Sequencers
Next (or 3rd) generation sequencers came onto the scene in the early 2000's General characteristics include: -Amplification of genetic material by PCR -Ligation of amplified material to a solid surface -Sequence of the target genetic material is determined using Sequence-by-Synthesis (using labelled nucleotides or pyrosequencing for detection) or Sequence by ligation Sequencing done in a massively parallel fashion and sequence information is captured by a computer
23andMe
On November 22, 2013 the FDA ordered 23andMe to stop marketing its Saliva Collection Kit and Personal Genome Service (PGS) as 23andMe had not demonstrated that they have "analytically or clinically validated the PGS for its intended uses and the FDA is concerned about the public health consequences of inaccurate results from the PGS device
Roche / 454 : GS FLX - Parallel Sequencing
Owned by Roche ($115 million) Shipping machines since ~2006 Many publications (Neanderthal, James Watson re-sequencing) Sequencing by synthesis
ESTs pros and cons
Pros: -Represent the part of the genome (most) people care about -Does not require a sequenced genome -Find genes -Find SNPs -Find splice isoforms Cons: -Libraries are highly biased -Can be hard to know when two ESTs are derived from the same gene -(generally) high error rates
Public effort and Celera strategies
Public - BAC shotgun sequencing Celera - whole genome shotgun sequencing, employing pairwise end sequencing.
Single Clone Molecule Array
RANDOM ARRAY OF CLUSTERS ~1,000 molecules per ~1um cluster ~40M clusters per flowcell
Solexa Chemistry
Sequencing by synthesis: -Add four-color reversible terminators -Image fluorophore -Remove 3' block and fluorophore -Add next set of bases Takes 48-72 h/run plus 8h analysis
Two Different Groups Worked to Obtain the DNA Sequence of the Human Genome
The HGP is a multinational consortium established by government research agencies and funded publicly. Celera Genomics is a private company whose former CEO, J. Craig Venter, ran an independent sequencing project. Differences arose regarding who should receive the credit for this scientific milestone. June 6, 2000, the HGP and Celera Genomics held a joint press conference to announce that TOGETHER they had completed ~97% of the human genome.
Your Genome is Published
The International Human Genome Sequencing Consortium published their results in Nature, 409 (6822): 860-921, 2001. "Initial Sequencing and Analysis of the Human Genome" Celera Genomics published their results in Science, Vol 291(5507): 1304-1351, 2001. "The Sequence of the Human Genome"
Capillary Sequencing
Trace files (dye signals) are analyzed and bases called to create chromatograms. Chromatograms from opposite strands are reconciled with software to create double-stranded sequence data.
What are some major concerns about sequencing?
Who will fund it? What impact will it have on biology? Who's DNA should be sequenced?
Sanger Method - Greater detail
in-vitro DNA synthesis using 'terminators', use of dideoxi- nucleotides that do not permit chain elongation after their integration DNA synthesis using deoxy- and dideoxynucleotides results in termination of synthesis at specific nucleotides Requires a primer, DNA polymerase, a template, a mixture of nucleotides, and detection system Incorporation of di-deoxynucleotides into growing strand terminates synthesis Synthesized strand sizes are determined for each di-deoxynucleotide by using gel *is more efficient to run everything in one lane