Genomics and Bioinformatics
Steps in Microarray Analysis
(part2) Once amount of repression or induction for each gene is determined, data must be "sorted" computationally using algorithms. Eg. Hierarchical Clustering 3. By "clustering" similarly expressed genes, groups of genes that "turn on/off" in response to the same variable. 4. Hierarchical Clustering finds the pair of genes that are most similar, joins them together, and then identifies the next most similar pair of genes.
Extrinsic method of Gene Annotation
- 'been there, seen that' - Recognizes regions corresponding to previously known genes (similarity of translated amino acid sequence to known proteins) - Matching expressed sequence tags (ESTs) to sequences of known genes • Few hundred bases from a cDNA.
C-value
- Genome size • amount of DNA in a haploid cell • C refers to constancy of the amount of DNA/cell in a species. • Compared DNA content among species
Linkage
- Linked traits are governed by genes on the same chromosome. - During gamete formation, alleles on different chromosomes of a homologous pair can recombine (crossing-over). - Frequency of crossing-over is a measurement of the distance between genes. • Genes are in a linear order on the chromosome. • Genetic distance (map units) is additive. • Unit of length in a gene map is a Morgan • 1 cM corresponds to a 1% recombination frequency. • 1 cM is approximately 10e6 bp in humans. - Frequency of crossing over is variable between sexes and in the genome. • ~80% of genetic recombination takes place in no more than ~25% of our genome.
A priori method of Gene Annotation
- Recognize sequence patterns within expressed genes • Initial (5') exon starts with a transcription start point, preceded by a core promoter site (i.e. TATA box), about 30 bp upstream. • Free of stop codons before first GT splice site. • Internal exons are free of inframe stop codons. • Final 3' exon starts immediately after an AG splice signal and ends with a stop codon, followed by a poly-A signal sequence.
Steps in a Microarray Expt
1)Synthesize probes - robotic systems that apply individually synthesized thousands of probes in situ on slide (one spot contains thousands of probes for one gene). 2)Acquire mRNA - oligo dT column extraction of pure mRNA from cells. 3)Label mRNA or cDNA - fluorescent dye (or biotinylated nucleotide). Control and experimental cDNA labeled Cy3 green - Control Cy5 red - Experimental 4. Hybridization - Both samples hybridized simultaneously to microarray 5. Fluorescence Analysis -Relative intensity determined using confocal laser scanner 6. Data Analysis - Ratios of red:green are next analyzed mathematically for relationships among the genes. Hierarchical clustering can be used.
Improvements on Sanger Method:
1. Two phase system replaced by the use of dideoxynucleotide as "chain terminators" in the primer extension reaction (at low concentration relative to the other dNTPs). 2. Development of fluorescent dyes to replace the radioactive labels on the newly synthesized DNA fragments, led to the semi-automation of gel electrophoresis and reading gels.
Limitations of Sanger Method
1. Using an oligonucleotide primer required knowing some DNA sequence located directly adjacent to the DNA to be sequenced. 2. The "random extension" of the primer did not necessarily generate an even distribution of fragments of all desired length.
Genetic mapping
Based on differences in recombination frequency between genetic loci
Physical mapping
Based on distances in base pairs between specific sequences found on the chromosome
Cytological mapping
Based on histologically stained banding patterns on the chromosome
Genomic Tests
Chromosomal Microarray (CMA), Exome Sequencing, Genome Sequencing, Multicolor FISH (M-FISH)/Spectral, Karyotyping (SKY), Subtelomeric FISH Screen.
Basics of microarrays
DNA attached to solid support Glass, plastic, or nylon RNA is labeled Usually indirectly as cDNA. Bound DNA to solid support is the "probe" and is in excess. Labeled RNA or cDNA is the "target."
Exome sequencing
Exome sequencing - targeted sequencing of regions in DNA that code for parts of expressed proteins (exons). - Approximately 180,000 exons in human genome, 30 Mb or ~1% of genome. - Also called "targeted exome capture" • More economic way of doing personal genome sequencing or collecting information on population variability (SNPs).
Antisense RNA versus dsRNA
Fire and Mello (1998) injected dsRNA, which was 100 times more effective at gene silencing than the antisense RNA strand alone - called phenomenon: dsRNA-dependent gene silencing interference or RNAi.
GEO Databases
GEO Profiles - gene level database stores individual gene expression and molecular abundance profiles. presented as a chart that displays the expression level of one gene across all Samples within a DataSet. experimental context is provided in the bars along the bottom of the charts making it possible to see at a glance whether a gene is differentially expressed across different experimental conditions. profiles have various types of links including internal links that connect genes that exhibit similar behaviour, and external links to relevant records in other NCBI databases. Go to ncbi, select GEO Profiles, enter "CFTR"
Steps in Microarray Analysis
Heat Map (Expression table or matrix) - represent the level of expression of many genes across a number of comparable samples from a DNA microarray. Gene Vectors = Rows Sample Vectors = Columns Scatter plot - each point represents the expression value of a gene in two samples or experiments. Scatter plot of microarray data for healthy vs. obese patients - searching for differentially expressed genes.
DNA Barcoding
How it works: Sample organism Extract DNA Amplify Barcode DNA Sequence DNA Compare against sequence database
Photolithography - Affymetrix microarrays
Light-activated chemical reaction For addition of bases to growing oligonucleotide Custom masks Prevent light from reaching spots where bases not wanted Mirrors also used NimbleGen™ uses this approach
Linkage
Linkage: distribution of loci among chromosomes.
NChIP - Native ChIP
Mapping the DNA target of histone modifiers, with native chromatin used as starting chromatin. Chromatin is sheared by micrococcal nuclease digestion, cuts DNA in linker DNA leaving nucleosomes intact. DNA fragments of one nucleosome (200bp) to five nucleosomes (1000bp) in length.
Detection of genome wide methylation
MeDIP-seq - antibody to methylated cytosines. MDB-seq - binding to recombinant methyl binding domain. RRBS/MethylC-Seq - use bisulphite converted DNA to calculate the ration of methylated cytosines read as C in sequence compared with unmethylated cytosines, converted to T in sequence reads. Infinium array - compares hybridization to either methylated or unmethylated probes on an array to calculate an approximate of % methylation at a given CpG.
ChIP-on-chip or ChIP-chip
Microarrays contain a large number of known genomic sequences and allow a genome-wide view of DNA-protein interactions. Localize protein binding sites that may help identify functional elements in the genome. Localize the distribution of histones throughout the genome in specific cell types or stages. List all protein-DNA interactions in selected organisms under various physiological conditions.
COX1
Mitochondrially encoded cytochrome c oxidase 1: • 658 bp region (the Folmer region) of Mt-CO1 is used as a barcode. • Fast mutation rate so sequence differences between species can be recognized. • As of 2009, database of CO1 sequences included at least 620,000 specimens from over 58,000 species of animals.
Maps (genetic, physical, cytological, linkage)
Most powerful when genetic and physical mapping are combined
Capillary electrophoresis
Newer automated sequencers use very thin capillary tubes • Run all four fluorescently tagged reactions in same capillary • Can have 96 capillaries running at the same time
Affymetrix GeneChips
Oligonucleotides Usually 20-25 bases in length 10-20 different oligonucleotides for each gene Oligonucleotides for each gene selected by computer program to be the following: Unique in genome Nonoverlapping (prevents cross-hybridization) Composition based on design rules Empirically derived to prevent hairpin structures
ChIP
Protein and associated chromatin in a cell lysate are temporarily bonded (or cross-linked). DNA-protein complexes (chromatin-protein) are then sheared. DNA fragments associated with the protein(s) of interest are selectively immunoprecipitated. The associated DNA fragments are purified and their sequence is determined. These DNA sequences are supposed to be associated with the protein of interest in vivo.
Northern Blot
RNA samples loaded onto an agarose gel. electrophoresis separation of the RNA obtained, gel is placed in contact with a nylon or nitrocellulose membrane xCapillary action transfers RNA from the gel to membrane, maintaining separation membrane processed causing RNA to bind tightly to it hybridization of a labeled probe to the RNA. probe: DNA that contains a radioactive isotope or an epitope labeled DNA is added to the membrane containing the separated RNA. Hybridization takes place and the probe finds its complementary RNA, membrane washed
ChIP-seq
Requires only 30 to 50 bp long sequence reads to map to the reference genome and identify binding sites. Use of NGS Illumina Genome Analyzer to directly identify fragments enriched by ChIP. ChIP-seq can make sue of any NGS technology, but Illumina has become the most popular sequencer in this field. Generates up to 230 million reads per lane at a cost of ~$1000 per lane, and supports multiplexing sequencing (multiple samples per lane).
Transposons
SINEs and LINEs only purpose is to insert copies of itself to other places in the genome
SINEs and LINEs
Short interspersed transposable elements -1.5 million in genome -200-300 bps in length -Alu most common (280bps) ~300,000 times copies/genome -must use transposase elements encoded by LINEs Long interspersed transposable elements -most common copy number=~20,000 -encode for their own transposase -850,000 in genome -1-5kb -10s -10,000s copies/genome both similar in relatives and unique to individuals basis of DNA sequencing used for parentage used to determine evolutionary relationships
HapMap
Testing all of the 10 million common SNPs in a person's chromosomes would be extremely expensive. The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend to be inherited together.
Linkage disequilibrium
The deviation of the genotype distribution in the population from the ultimate 1:2:1 ratio is called the linkage disequilibrium.• Close linkage of 2 loci on a chromosome is a common source of long-term persistence of linkage disequilibrium.• Linkage disequilibrium: the distribution of allelic patterns in populations.
Scaffolding
The process through which the read pairing information is used to order and orient the contigs along a chromosome is called scaffolding.
Finishing
The ultimate goal of any sequencing project is to determine every single base-pair of the original set of chromosomes. Rarely is an assembly program able to reconstruct a single piece of DNA per chromosome, leading to gaps in the reconstruction of the genome. These gaps are filled in through directed sequencing experiments in a process called finishing or gap closure. At this stage in the sequencing project, additional laboratory experiments and extensive manual curation are performed to validate the correctness of the final assembly, leading to a high-quality reconstruction of the original genome. • Process of assembling raw sequence reads into accurate contiguous sequence - Required to achieve 1/10,000 accuracy • Manual process - Look at sequence reads at positions where programs can't tell which base is the correct one - make determination - Fill gaps by new sequencing experiments - Ensure adequate coverage • To fill gaps in sequence, design primers and sequence from primer • To ensure adequate coverage, find regions where there is not sufficient coverage and use specific primers for those areas
Steps in Microarray Analysis
Thresholds and "Cutting the Tree" By retracing the order in which genes were progressively joined into clusters and by knowing the correlation value of each step, you can map out which genes are related to each other closely and which genes are related only distantly. Setting a threshold value within the range of -1 and +1 is called "cutting the tree."
XChIP- Cross-linked ChIP
Used to map DNA target of transcription factors: 1. Uses reversibly cross-linked chromatin as a starting material (formaldehyde or UV light). 2. Cross-linked chromatin is sheared by sonication, yielding fragments of 300-1000bps. 3. Cell debris is cleared by sedimentation and protein-DNA complexes are immunoprecipitated with specific antibodies of interest. 4. These antibodies are coupled to beads. 5. Immunoprecipitated complexes (the bead-antibody-protein-target DNA sequence complex) is collected. 6. Protein-DNA cross-link is reverse and proteins are removed by digestion with proteinase K. 7. DNA associated with complex is purified and identified by PCR, microarrays (ChIP-on-chip), molecular cloning and sequencing, or direct high-throughput sequencing (ChIP-seq).
Polony Sequencing
Used to sequence a full genome in 2005. • George Church group at Harvard. • Inexpensive, accurate, high-throughput was to re-sequence genomes of interest by comparison to a reference genome. • In vitro shotgun genomic libraries are clonally amplified by emulsion PCR.
Orthologs and Paralogs
When comparing sequence from different genomes, must distinguish between two types of closely related sequences
Protein Signature
a protein category such as a domain or motif
Protein Domain
a region of a protein that can adopt a 3D structure a fold a family is a group of proteins that share a domain examples: zinc finger domain immunoglobulin domain
Protein Motif (or fingerprint)
a short, conserved region of a protein typically 10 to 20 contiguous amino acid residues
psiblast
allows the user to build a PSSM (position-specific scoring matrix) using the results of the first blastp run
Maximum parsimony
an optimal tree is one that postulates the fewest mutations.
Maximum likelihood
assigns quantitative probabilities to mutational events, rather than merely counting them.
Target labeling: fluorescent cDNA
cDNA made using reverse transcriptase Fluorescently labeled nucleotides added Labeled nucleotides incorporated into cDNA
Antisense strand
complementary strand which is used as the template to produce mRNA
deltablast
coonstrcuts a PSSM using the results of a Conserved Domain Database search and searches a sequence database
Cladistic method
deals explicitly with the patterns of ancestry implied by the possible trees relating a set of taxa. - Specialized to sequence data, starting from a multiple sequence alignments (e.g. CLUSTAL)
Micro RNAs (miRNAs)
derived from precursor RNAs encoded by gene where miRNA regulates.
De novo sequencing
determination of a full-genome sequence without using a known reference sequence from an individual of the of the species to avoid the assembly step.
Phenetic (clustering)
determine phylogenetic relationships with no reference to history. - Choose the two most closely related species and insert a node to represent their common ancestor. - Replace the two selected species by a set containing both, and then replace the distances from the pair to the others by the average of the distances of the two selected species to the others. - Repeat the process. - Process called: unweighted pair group method with arithmetic mean.
Dicer (RNase-III-like endoribonuclease)
digests the longer dsRNA or stem-loop structure formed by miRNA precursor into short segments (20-25 bp long).
Sense strand
encodes a gene
Paralogs
genes found in the same species that were created through gene duplication events
Orthologs
genes found in two species that had a common ancestor
MaturaseK (MatK)
highly conserved gene in plant systematics • Involved in Group II intron splicing. • 1500 bp in length, located in the intron of the trnK gene. • MatK contains high substitution rates within the species, making it a desirable bar code marker.
Small interfering RNAs (siRNAs
made artificially or in vivo from dsRNA precursors.
blastn
nucleotide BLAST database: DNA query: DNA sequence
megablast
nucleotide BLAST finds highly similar sequences very fast use to identify a nucleotide sequence
Emulsion PCR
one bead, one read. DNA joined to adapters at either end of the fragmented DNA. DNA amplified in an emulsion PCR (included 1 um agarose bead with complimentary adaptors to fragmented DNA. PCR amplified allowing up to 1 million identical fragments around one bead and finally dropped into a PicoTitreTube (PTT).
ORF
part of gene that actually encodes a protein. • Characterized by a Start Codon (ATG/AUG) followed by an open reading frame long enough to produce a protein before coming to a Stop Codon TAG, TAA, TGA/UAG, UAA, UGA. • Codon usage matches the frequency characteristic for the organism's coding regions. • One piece of evidence in gene prediction.
phiblast
performs the search but limits alignments to those that match the pattern query
PAGE
polyacrylamide gel replaced by capillary electrophoresis
blastp
protein BLAST database: protein query: protein sequence simply compares a protein query to a protein database
Haplotypes
regions of linked variants Genetic variants that are near each other tend to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A.
454 sequencing
the first cyclic sequencing • The nucleotide added in each synthesis cycle is detected from each feature, building up millions of sequences in parallel
Human Genome Project
to sequence the entire genome, not just transcribed genes or disease genes; to sequence the genome to a high level of accuracy, with less than one error in 10,000 bases (originally, this margin was set at less than one error in 100,000 bases); to develop genomic resources that would be useful for all genes (for example, collections of physical markers); and to develop economies of scale. This last goal has meant the concentration of sequencing in a few centers, where it is carried out on an industrial scale.
tblastx
translated BLAST database: 6-frame translated DNA query: 6-frame translated DNA
tblastn
translated BLAST database: 6-frame translated DNA query: protein sequence
blastx
translated BLAST database: protein query: 6-frame translated DNA
Gene duplication and polyploidy
• A driving force in evolution? - Large-scale duplication of genes - Differentiation of function in duplicated genes over time • Evidence - Polyploidy • More than two copies per chromosome - Endopolyploidy - Evidence of past genome duplications in diploid organisms
Microarray
• A microarray is a solid support (such as a membrane or glass microscope slide) on which DNA of known sequence is deposited in a grid-like array. • High-throughput analysis by running many hybridizations in parallel. • Can contain 400,000 probe oligomers on one slide, with 10,000 - 250,000 positions per cm2. • The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. • The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized • to microarrays in order to measure the expression levels of thousands of genes.
Whole genome shot gun sequencing
• Alternative solution: fragment entire genome - Sequence each fragment - Assemble overlapping sequences to form contiguous sequence • Focus on principles and techniques of mapping and sequencing • Approach used by Craig Venter's Celera. • Sequence random pieces of DNA and put them in order. - Drosophila melanogaster: random pieces of 2, 10, and 150 kb. - Sequence of 500 bps of each end was determined (reads) - Computer assembled reads into maximal set of contiguous sequence (contig) - Fully assembled genome sequence - Golden Path. Limitations Include: • Internal repetitive sequences create problem for assembly. • High skewed base compositions (~80 mol% AT in Plasmodium falciparum). • Gaps found in assembled sequence. • Sample from diploid organism can have allelic differences between homologous chromosomes. - Must map these fragments to the same location and not mismap them to different contigs based on sequence polymorphisms.
Pyrosequencing
• Based on "sequencing by synthesis" principle. • Also called "light sequencing". • Relies on the detection of pyrophosphate (PPi) that is released on nucleotide addition to the 3' end of a primer. • The 454 pyrosequencing machine produces the longest reads of any of the next-generation synthesizers (700 bps). • Used to sequence the genome of James Watson in 2007. • Emulsion PCR is used to amplify genomic fragments that are hybridized to DNA-capture beads. • Then beads with their attached polonies are arrayed on a picotiter plate (2 million wells of 28um in diameter - diameter of one bead). • Each well has a single DNA-capture bead, along with smaller beads containing immobilized enzymes for pyrosequencing (light sequencing). • Sequential addition of each dNTP gives sequence • Apyrase enzyme used to degrade dNTPs after reaction completed • Sequence read from amount of light emitted as each dNTP is added
Cycling sequencing
• Demand for low-cost sequencing (personal genomics and personalized medicine) has driven development of high-throughput sequencing. • Produces thousands or millions of sequences at once (parallelizing sequencing process). • Faster, cheaper. • The nucleotide added in each synthesis cycle is detected from each feature, building up millions of sequences in parallel.
Phred
• Developed by Phil Green (Wash Univ, then Univ. of Washington) for assembly of cosmids in large-scale cosmid shotgun sequencing in HGP. • Phred - basecalling software, determines a sequence of base calls from the processed DNA sequence tracing. - The phred software reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base. - Phred can use the quality values to perform sequence trimming. --order or magnitude of sequence quality • Phred works well with trace files from the following manufacturers' sequencing machines: Amersham Biosciences, Applied Biosystems, Beckman Instruments, and LI-COR Life Sciences. • Phred runs on most computers and operating systems including Apple Mac OS X, *BSD, Hewlett-Packard HP-UX, HP-Compaq Tru64, IBM AIX, Linux, Microsoft Windows, Silicon Graphics IRIX, and SUN Solaris.
Helicos sequencing
• First to sequence individual molecules instead of molecule ensembles created by an amplification process. - Identifies the exact sequence of a single piece of DNA. • Avoids PCR-induced bias and errors, simplifies data analysis and tolerates degraded samples. • Read lengths are about 32 nucleotides long. • Raw read error is 0.5%, but doing sequencing in parallel can deliver a finished read accuracy of 0.99%. • Why single strand sequencing? - Not affected by biases or errors introduced in a library preparation or amplification step. • Minimal amount of input DNA. • Allows identification of DNA modifications, commonly lost in the in vitro amplification process. • By the end of 2009, only four HeliScope machines had been installed worldwide.
Read (Single-end, paired-end, length) and coverage
• Fragment - small piece of DNA subject to an individual partial sequence determination or read. • Single-end read - sequence reported from only one end of a fragment. • Paired-end read - sequence is reported for both ends of a fragment (with a number of undetermined bases between the reads that is known only approximately). • Read length - number of bases from a single experiment on a single fragment. • Assembly - inference of the complete sequence from the data on individual fragments from a region (piecing overlaps). • Contig - partial assembly of data from overlapping fragments into a contiguous region of sequence.
Microarray (probe/target), Affymetrix GeneChips
• Global expression analysis: microarrays • RNA levels of every gene in the genome analyzed in parallel
Microarrays
• Hybridization based, highthroughput with many hybridization experiments in parallel. • RNA-seq • SAGE (Serial Analysis of Gene Expression) • Sequence fragments of cDNAs • MPSS (Massively Parallel Signature Sequencing) • Combines hybridization and sequencing • Real-time PCR
EST sequencing
• Idea: sequence only "important" genes - Those genes expressed in a particular tissue • Sequence random cDNAs made from RNA extracted from tissue of interest • cDNAs in microtiter plates • Venter while at NIH • Make cDNA library • Select clones at random • Sequence in from one or both ends - One-pass sequencing • The resulting sequence = expressed sequence tag (EST) • Enough information to perform a homology search. • Use as probes to find genomic DNA. • Advantages - Relatively inexpensive - Certainty that sequence comes from transcribed gene - Information about tissue and developmental stage • Disadvantages - No regulatory information - Usually less than 60% of genes found in EST collections - Location of sequence in genome unknown until used as probe of genomic library.
Global expression analysis: Northern blot
• Limited by number of lanes in gel • No more than three probes can be used simultaneously.
Linkage Disequilibrium
• Linkage is common in the human population, particularly in genetically isolated sub-populations. • A group of alleles for neighboring genes on a segment of a chromosome are very often inherited together. • Such a combination of linked alleles is known as a haplotype. • When linked alleles are shared by members of a population, it is called a linkage disequilibrium. • Linkage disequilibrium: the non-random association of alleles at linked loci. • A measure of the tendency of some alleles to be inherited together on haplotypes descended from ancestral chromosomes. • If these where the only two haplotypes in the population, then alleles G and A ( C and T) are in perfect linkage disequilibrium. • If we genotype the first SNP, we know what the alleles are at the second SNP
Hierarchical sequencing
• Major problem in large-scale sequencing: - Current technologies can only sequence 600-800 bases at a time • One solution: make a physical map of overlapping DNA fragments - Determine sequence of each fragment - Then assemble to form contiguous sequence (contig) - Method used by NIH group sequencing human genome. • Cut DNA into fragments of about 150kb. • Clone them into BACs. • Identify a series of clones in the library that contains overlapping fragments (make a contig map): - Overlap of restriction fragment size patterns - Amplification of single-copy DNA between interspersed repeat elements and checking for similar size patterns of fragments - Map sequence-tagged sites (STSs) and looking for fragments sharing STSs.
Antigenic shift
• Mechanism of mutation caused by the recombination of the viral genome when a cell becomes simultaneously infected by two different strains of type A influenza. • Avian = A/H5N1 • Swine = A/H1N1 • Human = A/H1N1, H1N2, and H3N2
Ion torrent sequencing
• Rather than using an optical sensor to detect light emitted from fluorescent dyes, Ion Torrent uses a semiconductor sensor to detect pH changes during DNA synthesis. • An ion-sensitive field-effect transistor (ISFET) detects the pH change in the microwell, and an electrical signal from the transistor is directly interpreted by the base-calling software. • It is the fastest of the commercially available sequencers, completing up to 200 bases of sequence per microwell in a 2 hour run.
Verification
• Region verified for the following: - Coverage (how many times same region was sequenced) - Sequence quality (ambiguity removed for all positions in the sequence). - Contiguity (one uninterupted sequence) • Determine restriction-enzyme cleavage sites - Generate restriction map of sequenced region - Must agree with fingerprint generated of clone during mapping step
Exome Sequencing
• Selectively sequence the coding region of the gene rather than the whole genomic sequence. • Combines target enrichment (using PCR) with exon specific sequencing.
SOLiD sequencing
• Sequencing by Oligonucleotide Ligation and Detection • (Life Technologies - 2008, 3rd of next generation technology). • Also called: 2 base encoding • Based on Ligation sequencing, rather than sequencing by synthesis. • Applies similar principle of pyrosequencing as the amplification of fragmented DNA on an agarose bead (emulsion PCR) is repeated, but is based on ligation sequencing. • A library of DNA fragments is prepared from the sample to be sequenced, and are used to prepare clonal bead populations (emulsion PCR to form a polony). • Emulsion PCR takes place in microreactors containing all the necessary reagents for PCR. • The resulting PCR products attached to the beads are then covalently bound to a glass slide. • Difference from other next-generation sequencing rxns: • Sequence extension carried out by ligases, NOT by polymerases. • "Sequencing by ligation" • The sequencing step is to expose the samples to fluorescently labeled probes eight nucleotides long. • The first two positions include all possible dinucleotides. • The remaining six positions are 'wild cards'. • The 5' end of the probe bears one of the four fluorescent tags. • Note degenerate nature of these interrogation probes (next slide)
Genome Sequencing
• Sequencing of the whole genomic region of a gene, including 5'UTR, exons, introns, and 3'UTR. • Jan 2012 - Life Technologies introduced sequencer to decode a human genome in one day for $1000.00
Multicolor FISH (M-FISH)/Spectral Karyotyping™ (SKY™)
• Simultaneously visualize all the pairs of chromosomes in different colors using chromosome specific fluorescently labeled probes. • Used to identify structural aberrations. • FISH - fluorescence in situ hybridization
Microarray analysis (supervised versus unsupervised approach)
• Supervised approaches: analysis to determine genes that fit a predetermined pattern. • Unsupervised approach: analysis to characterize the components of a data set without the a priori input or knowledge of a training signal.
Chromosomal Microarray (CMA)
• Technique to detect genomic copy number variations • 5-10 kb of DNA sequences can be detected at a higher resolution level than traditional karyotypes or comparative genomic hybridization. • Molecular markers along chromosome that are in areas of known clinical significant • Can detect deletions or duplications of 100 to 500 kb in the genome. • To determine the pathogenicity of a find, the size of the deletion/duplication is determined, the particular genes in the area, and whether the deletion/duplication is inherited from one of the parents.
Illumina (Solexa) sequencing
• Unlike pyrosequencing, the Illumina method uses dye-terminator bases and engineered polymerases. • Unlike pyrosequencing, the DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera. • DNA molecules attached to primers on a slide in a flow cell. • Amplified so that local clonal colonies are formed (bridge or cluster PCR -> 1000 amplicons). • One strand is removed to prevent it from hindering the extension reaction sterically or by complementary base pairing (cleaved at known sequence in adaptor anchor). • Four types of reversible terminator bases (fluorescent labeled) are added. • DNA extended one nucleotide at a time. • Camera records fluorescently labeled nucleotides, then terminal 3' blocker removed and process is repeated. • 1 billion base pairs per day per machine.
Subtelomeric FISH Screen
• Useful in detecting subtle or submicroscopic deletions at the terminal chromosome regions in patients with unexplained mental retardation and developmental delay. • May also detect duplications and balanced rearrangements involving these terminal chromosome regions.
Contig
• are groups of overlapping pieces of chromosomal DNA - Make contiguous clones • For sequencing one wants to create "minimum tiling path"(smallest number of inserts that covers a region of the chromosome)_
Sanger sequencing
• sequencing method which relies on synthesis of new DNA fragments using DNA polymerase. • Short oligonucleotide primer is hybridized to a ssDNA template. • 2 phases of original design: - Sequencing primer is extended using all four dNTPs, - Then partially extended using 3NTPs with one dropped out, causing synthesis to stop at the position of the dropped out nucleotide.
Synteny
•genes that are in the same relative position on two different chromosomes • Genetic and physical maps compared between species - Or between chromosomes of the same species • Closely related species generally have similar order of genes on chromosomes •can be used to identify genes in one species based on map position in another • Genes found in similar places on chromosomes are indicated