Genomics 468 Test 1

¡Supera tus tareas y exámenes ahora con Quizwiz!

If I sequenced 53,000 plasmids of approximately 800 bp, what is the probability that I have sequence any given sequence of a 1000 MB genome (see slides 8-10)?

0.0415 or 4.15% chance that you sequenced it. • N=ln(1-P)/ln(1-a/b) - N is the number of clones (53,000) - P is the probability of the library containing the desired piece of DNA (?) - a is the average size of the DNA insert (800) - b is the size of the genome (1000MB or 1,000,000,000 bases) • P=1-(1-a/b)N • P = 1-(1-800/1000000000)53000 • P = 1-.958486 = .0415 or 4.15% chance of piece of DNA being in the library • Or • use (p) probably of mission a spot = e^-coverage • Coverage = (53,000 x 800)/1,000,000,000 = .0424 • p = e^-0.0424 • Probability of mission a spot = 0.9584 or 95.84% • 100% - 95.84% = 4.16% chance of piece being there

If a genome is sequenced at 30x coverage, approximately percentage of the genome remains unsequenced?

1-9.36x10^-14 which results to essentially zero percent data left unsequenced. Probability of missing spot = e^-coverage

Describe three differences between 454, Ion Torrent, Illumina, and Sanger sequencing.

454: 3-5µg starting material, Bead Based Sequencing, Luciferase emitted light Ion Torrent: 100ng-1µg starting material, pH and Voltage based Sequencing, No light (voltage) Illumina: 1-5µg starting material, Flow Cell based Sequencing, Fluorescent Reversible Terminating Bases Sanger: 1-40ng starting material, ddNTP based Sequencing, Fluorescent Terminating Bases

pcr duplication

A bias originating from PCR. During PCR some fragments may be preferentially amplified and thus more present within your library. These duplicate fragments can waste sequencing resources by having the same fragment sequenced multiple times.

Chimeric sequence

A chimeric sequence has pieces of DNA from 2 distinct genomic positions. This usually takes place when a DNA strand under amplification is terminated early and binds to a foreign DNA and replication continues.

Cluster generation

A cluster of bridge amplified DNA (very similar to PCR, but with bridge amplification). Takes place inside the cBOT machine.

What is a flow-cycle (or more generally a 'sequencing cycle')?

A cycle in which a set of the same dNTP is washed over the beads containing the template DNA within wells in the PTP and if they bind to the template DNA, light is released, detected by the machine, and the base call is recorded. Remaining dNTP's are removed and another set is washed.

Phospholinked nucleotides

A different colored fluorescent molecule is attached to the gamma phosphate of each nucleotide, which is naturally cleaved off as the polymerase incorporates the nucleotide

Bioanalyzer

A machine which quantifies the size range of DNA or RNA in a sample. It also give a quantification of the concentration, but is not reliable, and other methods should be used.

Pyrosequencing

A method of DNA sequencing based on the sequencing by synthesis principle developed in 1996. Pyrosequencing relies on the production of pyrophosphate after a nucleotide is added to the DNA copy of the template. This pyrophosphate is converted to ATP and then the energy of the ATP is released as light as the result of many different enzymatic reactions. The remaining unincorporated nucleotides and ATP are then degraded by apyrase so the next nucleotide wash and sequencing by synthesis can begin.

cyclic reversible termination

A method that comprises nucleotide attatching, fluorescence, and cleaving as a repeatable-stacking process (see question 5 below).

What is a physical map and how does it differ from a genome sequence?

A physical map is based on the use of restriction enzyme cutting sites, STS markers, FISH, and other data to map which contigs overlap, whereas genome sequences rely on the combination of DNA sequence 'letters' to determine overlap.

PTP or chip (Pico-titer Plate)

A plate that has many microscopic wells that allow for sequencing of millions of fragments in a 454 reaction.

SPRI beads

A simple method of DNA cleanup and purification using paramagnetic beads to prep for sequencing. It is simple and reproducible enough to be automated. SPRI beads are also used for size selection. max. read length - The maximum length of a sequence reads output by Illumina.

Why is amplification of templates often required for the fluorescent detection of an added base during sequencing?

A single fluorescent molecule isn't visible to the Illumina computer sensor, so multiple amplified templates are necessary to create a fluorescence signal strong enough for the computer to visualize.

sff file

An SFF file encodes flowgrams from 454 pyrosequencing.

What step used human manipulation during the construction the FPC fingerprint data?

Assembling contigs together by comparing end clones from each contig to other end contigs and allowing a lower (50%) stringency. This reduced the number of contigs by five fold. Also identifying chimeric contigs and getting rid of them.

Gene Ontology

Consists of a controlled set of vocab which gives consistent description of the functions and products of a gene.

Describe the activities that take place during the pyrosequencing reaction (enzymes, beads, and substrates).

DNA Polymerase - Adds Nucleotides. Only one nucleotide type is added at a time ATP Sulfurylase - Generates ATP from pyrophosphate released by incorporation of dNTPs Luciferase - ATP bonds broken and converted to light Apyrase - Degrades excess nucleotides and ATP Beads (see question 5 below)

Dephasing and sequence phase

Dephasing is "noise" within the fluorescence. It could happen from a portion of the templates on a bead not incorporating the maximal number of nucleotides, so they are one step behind the other templates on the same bead. This can happen successively causing more and more dephasing.

How is the nucleotide label different with PacBio technology than other platforms?

Each of the four types of nucleotides is labeled with a different fluorophore, attached to the gamma phosphate of each dNTP. The fluorophore is cleaved off upon nucleotide incorporation by the polymerase anchored at the bottom of the ZMW well. It's fluorescence is detected by the machine.

emPCR

Emulsion PCR takes place when sheared gDNA fragments are attached to adaptors, which are bound to beads containing complementary adaptors (one fragment per bead). Oil is then added to the water/DNA/buffer mixture to compartmentalize each bead in tiny aqueous droplets. The typical PCR process of denaturation, annealing, and replication then takes place on each bead.

FPC

Fingerprint data - the pattern of DNA migration in a gel to determine which stretch of DNA is found in the gel (size, restriction enzyme cuts).

FISH

Fluorescent In Situ Hybridization - a technique which allows fluorescent probes to be hybridized to a complimentary DNA sequence to show the location of that site or gene on a chromosome.

HMW DNA

High Molecular Weight DNA (genomic DNA larger than 150Kb)

Describe three key differences between Illumina and Sanger sequencing.

In Illumina sequencing small fragments must be created in order to be bridge amplified, but the total output is very large (Gb) though composed of small fragments, whereas in Sanger sequencing much larger fragments can be used and amplified, resulting in longer reads, but a smaller total output. Also they have two different amplification methods. Illumina uses flow cell bridge amplification and Sanger uses PCR-like amplification. You also get much higher coverage in Illumina sequencing than in Sanger. Illumina is massively parallel whereas Sanger is one sequence at a time.

What are sequencing adapters?

In the case of Illumina sequencing, they are the Y-shaped adapters that have a double-stranded end and a non- basepaired end (the Y-end) that are ligated onto both ends of your sheared and repaired DNA. The Y-shape insures that after PCR amplification, each molecule will have a P5 and a P7 adaptor on their respective ends. In general, sequencing adapters are DNA that is added to the ends of genomic DNA fragments that have a known sequence that is required for the high-throughput sequencing platform. Y-shaped adaptors for Illunima (or P5 and P7), A and B adaptors for 454 and Ion Torrent, etc.

SMRT sequencing

It is literally Single Molecule Real-Time sequencing. The polymerase is immobilized at the base of the ZMW chamber and each of the four types of nucleotides is labled with a different fluorophore, attached to the gamma phosphate of each dNTP. The fluorophore is cleaved off upon nucleotide incorporation. It's fluorescence is detected by the machine.

SOLiD sequencing does not use DNA polymerase. What does it use instead?

It uses DNA ligase (SBL) to ligate hybridized oligonucleotides to the growing DNA molecule during sequencing. Unlike most other sequencing technologies, no DNA polymerase is involved.

Why was the BAC map constructed and how was it used?

It was constructed to put the genome fragments from the BAC library in order based on overlapping fragments. It allows the researchers (especially in the human genome) to be able to get past all the tandem repeats that fill the genome and construct a minimum tiling path

How was accuracy of contig chromosomal positions determined?

It was tested by FISH. The use of FISH allowed the researchers to pinpoint the location of the BAC inserted DNA on the chromosomes by binding the same probe. This allowed the researchers to know not only where the clones DNA was located on the map, but also the order of their BAC clones to create the human genome.

Name two types of DNA shearing.

Mechanical shearing (nebulizing), and acoustic shearing (Covaris)

Was the H. influenzae genome assembled only from the shotgun clones? Explain.

No. Approximately 78% of the genome was covered by lambda clones.

How many plasmids of 1.5 kB are needed to cover a 145 kB BAC at 15x coverage (not a trick question)?

1450 = ((145/1.5)x15)

homo-polymer

A homopolymer is formed from multiple repeats of the same monomer (AAAAAAAAAAAAAAAAAAAA...)

nucleotide ambiguities

A location on a DNA sequence where the nucleotide at a specific position is not clear. Instead we use ambiguities codes (R, Y, W, S...) showing it could be any combination of nucleotides depending on the letter abbreviation used. These ambiguities can be resolved by looking at enough individual sequences that cover the ambiguity to see what is most common nucleotide (the consensus)

Genscan

A program that identifies gene structures within a genome, their location, and intron-exon boundaries.

Gap closure

Closing the gaps left in the consensus sequence after assembly

Why is size selection used in the preparation of DNA libraries?

It is used to have a defined size range of DNA molecules that you use to make your libraries so that you know that the DNA is the correct size for your sequencing platform and also you know the distance between paired-end reads during genome assembly. No sequencing platform could sequence whole intact chromosomes thus they need to be broken down into known size fragments.

Ionogram

The name of the format you can view base calls in after an Ion Torrent sequencing run.

Scaffolds

a compilation of DNA sequence contigs into one digital chromosome, or section of a chromosome. This still contains sequence gaps or physical gaps

DNA shearing

breaking genomic DNA into small multi-hundred bp fragments

Describe the enzymatic activities that take place on each Illumina colony.

On each Illumina colony there is a lawn of forward and reverse oligonucleotides with single stranded DNA bound to some of them. These single stranded DNA molecules bind to the opposite oligonucleotide and DNA polymerases duplicate the strands, which are then denatured so those strands can bind again and duplicate again with the DNA polymerase. Enzymes also play a role in the sequencing reaction by adding single nucleotides with blockers with reversible terminator fluorescent molecules. These fluorescent molecules are cleaved by other enzymes to allow the next nucleotide to add.

How was the whole-genome BAC data integrated with other map data?

Other maps were integrated by hybridizing markers for other maps against colony filter replicas and by using STSs to link other maps to the BAC data.

PGM

Personal Genome Machine, Ion Torrent's current sequencing machine meant for small genomes and targeted sequencing.

STS marker

Sequence Tagged Site are short sequences of gDNA that can be uniquely identified and recognized in a DNA sequence.

Color space

Sequences that are detected and encoded by different colors rather that are then aligned by color and not sequence, or directly translated into sequence space (As, Ts, Gs and Cs). Color space is used in SOLiD sequencing and assembly.

Why does DNA need to be sheared?

So the NGS machine has fragments small enough that it can sequence the length of the fragment.

BAC library

Stands for Bacterial Artificial Chromosome. It is a group of bacterial clones that have the DNA from a single organism inserted into their own. These libraries can cover any amount of DNA, proportional to the number of colonies. Each individual BAC can contain an ~130kb fragment of genomic DNA

What needs to be 'reversed' on the incorporated nucleotide?

The 3' blocker (3′-O-azidomethyl) and fluorescent dye attached to the incorporated nucleotide need to be cleaved off after each round of incorporation and detection to allow a free 3'-hydroxyl and the 3' end to continue extension.

flow cell (or channel)

The glass base plate with 8 channels upon which oligonucleotides are bound which then are where the sheared DNA fragments are immobilized.

How were lambda libraries used in the assembly process?

The lambda clones were used to close the gaps in the assembly consensus sequence by looking at the ends of the contigs surrounding a gap and designing oligonucleotide probes for those sites. Upon finding two probes a good distance apart, primers were ordered based on those sequences to close the gaps from the lambda clone library DNAs.

What are the differences between 454 and Ion Torrent sequencing?

The library prep methods for the two are almost identical involving emPCR and harry beads. Both use PTPs to trap the beads for sequencing, both flow in one type of nucleotide at a time (SNA) and can have multiple bases added at once, but the difference is the method of detection. 454 converts the pyrophosphate released from the incorporation of a nucleotide into light that is detected, whereas Ion Torrent detects hydrogen ions that are released upon nucleotide incorporation and measures a pH change.

shotgun sequencing

The method of sequencing in which the genomic DNA is sheared into multiple fragments and then those seemingly random product fragments are sequenced and assembled into a complete genome

PostLight

The method of sequencing involving semiconductor chips. This greatly reduces the cost of sequencing because light and fluorescence is not involved. It is directly written to a computer chip.

What is meant by 'PostLight' sequencing?

The method of sequencing involving semiconductor chips. This greatly reduces the time of sequencing because light and fluorescence is not involved. It is directly written to a computer chip.

chip density

The more dense the chip used in the 316 or 318 format the more individual beads and sequences you can get off a single chip/run.

FlyBase

The online database where the fly genome is stored, and accessible. Gene annotations can also be made there

pyrogram/flowgram/ionogram

The output file that shows base calls for the 454 reaction (usually in the sff file format described above).

3R

The right side of the 3rd chromosome (in this case in the genome of fly)

minimum tiling path

The tiling path is the minimum set of BACS that contain a whole chromosome with the minimum overlap possible

What is the maximum output of an Illumina run? Per lane?

The total output of an Illumina run on a HiSeq 2500 HT v4 is ~1 Tb and an individual lane can sequence ~62.5 Gb.

CMOS factory

The type factory that makes the common semiconductor chips. These chips are used in Ion Torrent sequenicing

2-base encoding

The type of encoding for sequencing done with the SOLiD method where each incorporation of an oligonucleotide give you the sequence for two nucleotides instead of just one.

How many types of beads are used in 454 sequencing and what is their function?

There are four types of beads: The first is the streptavidin coated beads that are used to insure that only DNA fragments with both an A and a B adapter are in the library. The second type is the primer coated capture beads that capture the original strand of DNA and are the base of replication for emPCR. The third type of bead is enzyme coated with ATP Sulfurylase and Luciferase to create the fluorescent light used for base detection. The fourth type are packing beads used to fill remaining space in all the wells of the pico-titer plate.

How many fluorescent colors are used in pyrosequencing? Why?

There is just one color of light because the bases are washed across one at a time and there is only one enzyme producing light. In fact it is not even fluorescent light that is emitted.

In addition to 'gluatamate', name two other unique things identified in the genome sequence.

There were none of the NtrC class regulators found in E. coli, suggesting a different regulatory system from E. coli. Also there was no CpxR regulator found.

How were the sequences Chr. 21 and 22 used to estimate the physical map coverage of genome?

They estimated the coverage of the FPC map and then took the completed DNA sequences for Chr. 21 & 22 and broke them up into simulated fragments in silico. They then assembled those fragments and compared them with the FPC map and found them to be about 96% identical. Therefore, they assumed they had about 96% coverage on their physical map.

Why is a large amount of glutamate needed to culture H. influenza?

They found in the paper that it lacks specific genes that code for enzymes in the TCA (Tricarboxylic acid) cycle required to gain carbon for the synthesis of amino acids. Glutamate can be converted by alpha-ketoglutarate to be usable in that cycle.

What is two base encoding in SOLiD sequencing and why is it theoretically superior to other methods?

Two base encoding is that each oligonucleotide that hybridizes to the unknow sequence you are sequencing matches two consecutive bases at a time, thus the color of the fluorophore that is detected represents not one nucleotide, but a combination of two nucleotides, and has to be translated into sequence by knowing the identity of the previous nucleotide.

316/318

Two types of chips for the PGM

bridge amplification

When a DNA fragment on an Illumina colony with ligated adaptor ends hybridizes to other complementary attached primers in the on the plate. The sequence between these attached primers is duplicated (forming a bridge between the two) and then the ds molecules are denatured and they hybridize to other complementary primers, thus amplifying again and again.

A-tailing

Where a sheared DNA sequence is adenylated with an overhanging 3'- A at each end at which site the adaptors will bind.

Annotation Jamboree

Where all the experts on a particular organism get together to annotate a genome all at once.

WSG

Whole Genome Shotgun sequencing method, in which the genome is fragmented and a specific size range of fragments is sequenced.

physical map contig

a contig made up of physical map markers (STS, FISH, Restriction Enzyme Cut Sites) NOT DNA sequence data.

sequence contig

a contig that is made of DNA sequence data

Fastq

a file format used to store a nucleotide sequence and its quality scores

Physical gap

a gap between contigs/scaffolds whose DNA sequence is not found within our clone library

Sequence gap

a gap in the DNA sequence whose DNA is in our clone library

MinION

a new third-generation sequencing format that uses nano-pores and the detection of electrical changes as single stranded DNA passes through the pore one nucleotide at a time to accomplish single molecule sequencing. This format is portable, plugs into a usb port and can sequence long read lengths which are determined by the size of the template DNA and the amount of time the sequencing is allowed to run.

restriction patterns

a pattern from a BAC that is cut by a restriction enzyme and run out on a gel to determine the exact cutting sites (and position/amount of overlap) in relation to the other BACS

Describe the process of Illumina DNA library preparation (i.e. up to, but not including the sequencing cycles).

a. HMW genomic DNA is sheared (see question 2), size selected and quantified. b. The ends of the DNA fragments are repaired so they are blunt and have a 5'-phosphate on each end. c. An additional adenosine is added to the 3'ends d. Y-shaped adaptors are ligated to each end of the DNA molecules e. A few rounds of PCR are performed so each molecule now has a double stranded P5 adaptor on one end and a double stranded P7 adaptor on the other. f. The complete library is now quantified and denatured to make it double stranded g. The library is added to the flow cell and bridge amplification or cluster generation is performed on the cBOT machine.

biotinylated-adapters

attaching a biotin molecule to the 5' end of the adapter so that end will bond to the "streptavidin coated beads" to select for correctly linkered templates.

primer-coated capture beads

beads coated with oligonucleotides that are complimentary to the adapters that bind to the bead. This allows the bead to bind the DNA allowing duplication during emPCR. These beads after replication (when they are 'hairy') are deposited on the picotiter plate and covered in enzyme coated beads and packing beads

Contig

overlapping sequence reads that have been assembled by a computer program to form a continuous sequence

Sequence contigs

overlapping sequence reads that have been assembled by a computer program to form contigs

Singleton

sequence reads that do not assemble to any contig. Often these are contaminants or sequencing errors

SBL

sequencing by ligation. The method of sequencing used by SOLiD where no DNA polymerase is used, but instead, DNA ligase is used to ligate labeled oligonucleotides that reveal the sequence through hybridization

SBS

sequencing by synthesis. Sequencing like Illumina, Sanger, 454, Ion torrent where DNA polymerase is used to add nucleotides to the growing DNA molecule which are then detected to indicate the sequence (with fluorescence, pyrophosphate production or voltage/pH change).

SNA

single nucleotide addition, sequencing like 454 or ion torrent when one type of nucleotide is added at a time and then washed off. Homopolymeric runs in the sequence would result in multiple nucleotide of the same kind being added in a row giving off twice, three times, four times, etc. the amount of light or pH change.

Consensus sequence

the assembled sequences with ambiguities resolved by coverage

Assembly

the assembly of all the contigs sequenced to create one long continuous strand of DNA composed of many contigs and scaffolds.

DNA library

the library of sheared DNA fragments that will be attached to the flow cell base plate

Coverage

the number of times that a sequence was sequenced. (individual overlapping reads)

finishing phase

the phase that is after the basic assembly where all the gaps are revealed and site specific sequencing begins to fill in those gaps.

size selection

the selection of fragment sizes that will be used in the Illumina sequencing reaction

Scaffold

the type of file that holds related unitigs (high-confidence contig) or proteins in succession, although gaps may exist between them.

Heterochromatin

tightly packed sequences of DNA in a compressed form so fewer regions are available for transcription machinery. For Drosophila, regions of highly repetitive DNA and transposons that contain few genes

Dephasing

when a sequence becomes out of phase. In the case of Illumina sequencing this is usually when some of the templates in an individual cluster incorporate nucleotides that either lack a fluorophore or a terminator so the extension doesn't stop with only one base and it skips a space in the frame. Thus a portion of the cluster will incorporate the wrong nucleotide in the next round. This is also caused by homopolymeric runs in 454 sequencing


Conjuntos de estudio relacionados

Khan Academy Forms of Linear Equations Unit Review

View Set

History 1 - My America’s Freedom pp4-7

View Set

Chapter 49: Introduction to the Endocrine System

View Set

Java Accel Midterm Review January 29th, 2016

View Set