GENOMICS BLOCK
reverse genetics
discovering gene function from a genetic sequence 1. Start with genome sequence.2. Identify in silico predicted genes (see later).2. Go back to organism & engineer mutation or RNAi knockdown in organism. Reverse genetics is an approach to discovering the function of a gene by analyzing the phenotypic effects of specific gene sequences obtained by DNA sequencing. This investigative process proceeds in the opposite direction of so-called forward genetic screens of classical genetics. Simply put, while forward genetics seeks to find the genetic basis of a phenotype or trait, reverse genetics seeks to find what phenotypes arise as a result of particular genes. Automated DNA sequencing generates large volumes of genomic sequence data relatively rapidly. Many genetic sequences are discovered in advance of other, less easily obtained, biological information. Reverse genetics attempts to connect a given genetic sequence with specific effects on the organism. "Regular" genetics begins, as we have seen, with a mutant phenotype, proceeds to show the existence of the relevant gene by progeny ratio analysis, and finally clones and sequences the gene to determine its DNA and protein sequence. However, reverse genetics, a new approach made possible by recombinant DNA technology, works in the opposite direction. Reverse genetics starts from a protein or DNA for which there is no genetic information and then works backward to make a mutant gene, ending up with a mutant phenotype Often the starting point of reverse genetics is a protein or a gene "in search of a function," or, looked at another way, a protein or a gene "in search of a phenotype." For example, a cloned wild-type gene detected by sequencing as an open reading frame (ORF) of unknown function can be subjected to some form of in vitro mutagenesis and reinserted back into the organism to determine the phenotypic outcome. In regard to a protein of unknown function, the amino acidsequence can be translated backward into a DNA sequence, which can then be synthesized in vitro. This sequence is then used as a probe to find the relevant gene, which is cloned and itself subjected to in vitro mutagenesis to determine the phenotypic effect. This approach will become even more significant as data accumulate in whole-genome-sequencing projects. Sequencing reveals numerous unknown ORFs, many of which have no resemblance to any known gene, and the function of these genes can be determined by reverse genetics. This type of analysis is underway in Saccharomyces cerevisiae; the complete sequencing of the genome of this organism revealed many unassigned ORFs, which are now being systematically mutated into null alleles to detect a phenotype that might provide some clue about function. Important tools for reverse genetics are in vitro mutagenesis and gene disruption, also known as gene knockout. One approach is to insert a selectable marker into the middle of a cloned gene and then to use this construct to transform a wild-type recipient. Selecting for the marker yields some transformants in which the disrupted (mutated) gene has replaced the in situ wild-type allele. RNA Interference (RNAi)Targeted knockouts/knock-insCRISPR - Genome EditingThese methods have played an immense role in studying genomes - genomics.
What is a 'gene
functional segment of DNA occupying a fixed location on a chromosome 'controller' of a trait ('gene for blue eyes')
recombinational genetic mapping
genes can be mapped by a combination of ' recombinational genetic mapping' - if we have 2 genes on different chromosomes, they will segregate independently during meiosis which means they are not linked, however, if two genes are on the same chromosome they will show linkage. genetic mapping depends on the rate of recombination genes that are very close to each other have a very low recombination rate and tend to be inherited together genes that are very far away from each other have a high recombination rate and they tend not to be inherited together recombination is essential for evolution recombination is more or less random in recombinational genetic mapping we don't require the molecular sequence we need a visible phenotype to perform recombinational genetic mapping 1 map unit ( m.u) = 1% rate of recombination in the figure, because the distance between fog-1 and glp-4 genes is big; the rate of recombination is high and these two genes tend to not be inherited together and segregate into different chromosomes due to crossing over event during meiosis. if the recombination rate is 50% or higher, the two genes belong to different linkage groups either different chromosomes or same chromosomes but far apart Recombinational Genetic mapping.m.u. = map unit. Genetic mapping - recombination. 1 m.u. is 1% recombination per meiosis. Frequency of recombination - genetic maps. mutants led to the generation of the genetic linkage maps, if they're linked they get inherited together recombinational genetic maps: the recombination frequency between genes can be used to determine the order of genes a chromosome and estimate the relative distance between genes. crossing over will happen at random up and down positions within the chromosome' two genes that lie farther apart are more likely to undergo crossing over during meiosis and will less likely be inherited together ( high recombination rate)
How did genes and gene families evolve and what is meant by protein domains?
if we run BLAST for the entire protein sequence in one go, we may get a relatively low overall similarly between the sequence and the matches
How to sequence a large genomic region
start by making a genomic or regional library, then shotgun sequence each clone • vectors used in Human Genome Project: BACS/YACS/cosmids
forward genetics
1- Identify mutant random mutagenesis - phenotype 1 . - gamma radiation or chemical mutagens that result in random mutations in the whole genome 2- Linkage map mutant gene. = map the mutant genes via recombinational genetic mapping 3- Rescue mutant with wild type cosmid (or other clone). 4- Identify sequence within cosmid that is responsible for rescuing Many C. elegans genes identified this way (~thousand). Forward genetics is the molecular genetics approach of determining the genetic basis responsible for a phenotype. This was initially done by using naturally occurring mutations or inducing mutants with radiation, chemicals, or insertional mutagenesis (e.g. transposable elements). Subsequent breeding takes place, mutant individuals are isolated, and then the gene is mapped. Forward genetics can be thought of as a counter to reverse genetics, which determines the function of a gene by analyzing the phenotypic effects of altered DNA sequences.[1] Mutant phenotypes are often observed long before having any idea which gene is responsible, which can lead to genes being named after their mutant phenotype (e.g. Drosophila rosy gene which is named after the eye colour in mutants) you start with genetic and move towards molecular sequence
Illumina Next Generation Sequencing
1. Several million clusters per channel 2. Each cluster can now generate > 75 - 150 bp sequence. 3. These short sequence runs are aligned in silico. 4. Usually compiled onto existing sequence "scaffold". What can this be used for? Re-sequence genomes - individual humans, cancers etc. Deep sequencing of cDNA - Test the actual sequence of all cDNAs. Determine splice variations Abundant mRNAs will be sequenced more often that less abundant. Make sure you understand this last point - next gen sequencing is used in place of microarray to measure relative abundance of mRNAs
mapping in c elegans
C. elegans is particularly amenable to transgenesis, which provides a relatively straightforward method of correlating a particular mutant phenotype with a region of nucleotide sequence. By means of cosmid/plasmid transformation rescue experiments (Fire 1986), the phenotype caused by a genetic mutation may be complemented, or "rescued," by the addition of an extrachromosomal copy of the wild-type gene. Initially, this was done by direct injection into hermaphrodites heterozygous for the allele to be tested. In a region of ∼200 kb immediately to the left of unc-22(IV), the direct injection of six cosmids was used to rescue six genes: let-56and let-653, let-92; par-5; dpy-20, and let-60. Once injected into the worm, it is thought that DNA molecules undergo repeated homologous recombination, forming large extrachromosomal arrays containing multiple copies of the injected molecules. These arrays are thought to be molecularly stable once formed, and a percentage become heritable and are transmitted from generation to generation in a non-Mendelian fashion. Heritable extrachromosomal arrays can be considered precisely characterized free duplications, and can be moved between strains in the same manner. Hence, a more flexible approach of introducing extrachromosomal DNA into C. elegans entails crossing a genetically marked array, previously constructed in another strain, into a mutant strain of interest. Although both procedures allow regions of DNA sequence to be correlated with mutant phenotypes, the latter method can be used to test the cloned DNA contained in an array against mutations in different strains, without having to repeat the injection procedure. Previous systematic studies in our laboratory and in others have indicated that an average of one essential gene per cosmid clone can be rescued. This rate of rescue does not necessarily reflect our ability to rescue essential genes, but rather reflects our current ability to saturate a particular region of sequence for essential gene mutations. This approach has the potential to produce high-resolution map alignment, and results in mutant phenotypes being identified for large numbers of predicted proteins. More than 1500 genetic loci have been identified in C. eleganssince the scientific community first began studying the nematode. Although the consortium has now sequenced the majority of these genes, the actual physical position for most of these genes has yet to be identified. Whereas the relative order of genes along a chromosome is expected to be the same on both the physical and genetic maps, the ratio of physical distance to meiotic recombination distance varies depending on the position of the gene along a chromosome. For example, the central gene cluster regions of autosomal chromosomes are reduced recombinationally relative to regions in the chromosome arms. Thus, it is often difficult to predict with accuracy the physical position of a gene, solely on the basis of its genetic map position. As the number of genes with both genetic map positions and positive physical clones increases, the resolution of the map alignment is improved. In this paper we describe a large-scale construction of transgenic strains and their usefulness for comprehensive rescue of mutant phenotype
mapping snps..........2
Clicking on the red letters of B0403:33022, we bring up an additional window that shows the actual sequences surrounding the SNP in black lettering (usually ~500 bp upstream and downstream) as well as the SNP itself in red lettering [C/T]. This designation indicates that N2 contains a C at this position whereas CB4856 contains a T. Also, if it is an RFLP-type SNP, the top of this page will show predicted digestion sites for the displayed DNA sequence from N2 and CB4856 (listed here as "HA" for Hawaiian), using one or more enzymes. Looking at this, we notice that in the CB4856 background the presence of the T results in the sequence AGATCT, which is the recognition site for the restriction enzyme BglII. This enzyme cuts once in this segment of the CB4856 sequence and not at all in N2. Thus if we were to amplify this region from N2 and CB4856 worms using PCR and cut the PCR product with BglII, CB4856 would produce a doublet of about 500 bp each, whereas N2 would run as a single band of 1,000 bp. The other enzymes listed as distinguishing this polymorphism (e.g., MnlI and MboI) although technically correct, are not of much practical use, as they cut many times in both N2 and CB4856 sequences. Therefore, discerning these two largely identical digestion patterns (using a standard agarose gel) would be difficult or impossible. Moving down to the unconfirmed SNP just below B0403, we find C36B7:21571. The presence of a C in N2, and an A in CB4856, leads to the creation of a new site for the enzyme ApoI (consensus RAATTY; where R is an A or G and Y is a C or T. For a complete listing of abbreviations, see the back of the NEB catalog). Here we see that ApoI cuts five times in strain CB4856 (59, 405, 500, 638, 648). Directly above this, we see that the N2 digest is listed as "none". Beware: this does not mean that CB4856 cuts five times with ApoI and not at all in N2! In fact, N2 cuts four times with ApoI (59, 405, 638, 648), just not at the middle position where the actual SNP is located (500). This is obviously misleading. By "none", they just mean that the polymorphism results in no new enzyme sites that specifically cut the N2 sequence. Another thing to be aware of is that for non-palindromic sites, it may be the bottom (non-scripted) strand of DNA that is relevant. Because many of the listed SNPs are not experimentally confirmed, the question arises: how many SNPs are actually real and is it possible to intuitively distinguish the real ones from the false ones? (The false ones are simply due to errors in the single sequencing reads of CB4856). For all non-confirmed SNPs, a probability index (Psnp) is given at the top of the page that contains the sequence information. For C36B7:21571, the Psnp is 0.9427, meaning that there is supposedly a 94% chance that the SNP is real based on the quality of the read. For a non-confirmed SNP, this is as good as it gets. Also, if you've ever stared at the electropherogram from a sequencing read, you'd know that it would be difficult under most circumstances (even for a computer) to mistake a C for an A. In contrast, it is our experience that SNPs with Psnp indices below 0.5 are invariably bogus. In addition, SNPs that result in single base-pair deletions or insertions within a run of repetitive nucleotides (e.g., A7 versus A8) are often suspect. Although some of these may turn out to be real, common sense dictates that sequencing errors are more likely to occur when attempting to distinguish between these sorts of differences than when comparing sequences such as ATG and ACG. Thus, you will want to use some discretion in your true/false predictions beyond the Psnp index. Of course, you will always want to substantiate or disprove any unconfirmed SNP before attempting any significant mapping exercises, no matter what the probability index or your intuition tells you. As described below, most fine mapping will require that the investigator identify new SNPs that are not currently in the database. This is simply done by amplifying random intergenic sequences in the region of interest from CB4856. We usually amplify an ~1,600-bp region and use two internal sequencing primers that point inward. The sequences obtained can then be used in BLAST searchers against the N2 sequence to identify polymorphisms. Potential differences are always further confirmed by looking at the electropherogram readouts. More often than not, one will find at least a single difference within a region of this size
visualisation of the genome at a finer scale - FISH............2
Fluorescence in situ hybridization (FISH) uses DNA fragments incorporated with fluorophore-coupled nucleotides as probes to examine the presence or absence of complementary sequences in fixed cells or tissues under a fluorescent microscope. This hybridization-based macromolecule recognition tool was very effective in mapping genes and polymorphic loci onto metaphase chromosomes for constructing a physical map of the human genome. FISH technology offers three major advantages including high sensitivity and specificity in recognizing targeted DNA or RNA sequences, direct application to both metaphase chromosomes and interphase nuclei, and visualization of hybridization signals at the single-cell level. These advantages increased the analytic resolution from Giemsa bands to the gene level and enabled rapid detection of numerical and structural chromosomal abnormalities. Clinical application of FISH technology had upgraded classical cytogenetics to molecular cytogenetics. With the improvement in probe labeling efficiency and the introduction of a super resolution imaging system, FISH has been renovated for research analysis of nuclear structures and gene functions. This review presents the recent progress in FISH technology and summarizes its diagnostic and research applications The principles of fluorescence in situ hybridization. (A) The basic elements are a DNA probe and a target sequence. (B) Before hybridization, the DNA probe is labelled indirectly with a hapten (left panel) or directly labelled via the incorporation of a fluorophore (right panel). (C) The labelled probe and the target DNA are denatured to yield single-stranded DNA. (D) They are then combined, which allows the annealing of complementary DNA sequences. (E) If the probe has been labelled indirectly, an extra step is required for visualization of the non-fluorescent hapten that uses an enzymatic or immunological detection system. Finally, the signals are evaluated by fluorescence microscopy. In biology, a probe is a single strand of DNA or RNA that is complementary to a nucleotide sequence of interest. RNA probes can be designed for any gene or any sequence within a gene for visualization of mRNA, lncRNA and miRNA in tissues and cells. FISH is used by examining the cellular reproduction cycle, specifically interphase of the nuclei for any chromosomal abnormalities. FISH allows the analysis of a large series of archival cases much easier to identify the pinpointed chromosome by creating a probe with an artificial chromosomal foundation that will attract similar chromosomes. The hybridization signals for each probe when a nucleic abnormality is detected. Each probe for the detection of mRNA and lncRNA is composed of 20 oligonucleotide pairs, each pair covering a space of 40-50 bp. For miRNA detection, the probes use proprietary chemistry for specific detection of miRNA and cover the entire miRNA sequence. Urothelial cells marked with four different probes Probes are often derived from fragments of DNA that were isolated, purified, and amplified for use in the Human Genome Project. The size of the human genome is so large, compared to the length that could be sequenced directly, that it was necessary to divide the genome into fragments. (In the eventual analysis, these fragments were put into order by digesting a copy of each fragment into still smaller fragments using sequence-specific endonucleases, measuring the size of each small fragment using size-exclusion chromatography, and using that information to determine where the large fragments overlapped one another.) To preserve the fragments with their individual DNA sequences, the fragments were added into a system of continually replicating bacteria populations. Clonal populations of bacteria, each population maintaining a single artificial chromosome, are stored in various laboratories around the world. The artificial chromosomes (BAC) can be grown, extracted, and labeled, in any lab containing a library. Genomic libraries are often named after the institution in which they were developed. An example being the RPCI-11 library, which is named after the Roswell Park Cancer Institute in Buffalo NY. These fragments are on the order of 100 thousand base-pairs, and are the basis for most FISH probes • Still visualising the genome-but considerably finer scale localisation • late 1980s onwards-Fluorescence insitu hybridisation •hybridise cloned section of DNA to metaphase spread - Probe labeled with fluorescent dye
Whole genome shotgun sequencing
Method of sequencing a genome in which sequenced fragments are assembled into the correct sequence in contigs by using only the overlaps in sequence. This strategy breaks the genome into fragments that are small enough to be sequenced, then reassembles them simply by looking for overlaps in the sequence of each fragment. It avoids the laborious process of making a physical map. However, it requires many more sequencing reactions than the clone-by-clone method, because, in the shotgun approach, there is no way to avoid sequencing redundant fragments. There is also a question of the feasibility of assembling complete chromosomes based simply on the sequence overlaps of many small fragments. This is particularly a problem when the size of the fragments is smaller than the length of a repetitive region of DNA. Nevertheless, this method has now been successfully demonstrated in the nearly complete sequencing of many large genomes (rice, human, and many others). It is the current standard methodology. However, shotgun assemblies are rarely able to complete entire genomes. The human genome, for example, relied on a combination of shotgun sequence and physical mapping to produce contiguous sequence for the length of each arm of each chromosome. Note that because of the highly repetitive nature of centromeric and telomeric DNA, sequencing projects rarely include these heterochromatic, gene poor regions. In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random firing pattern of a shotgun. The chain termination method of DNA sequencing ("Sanger sequencing") can only be used for short DNA strands of 100 to 1000 base pairs. Due to this size limit, longer sequences are subdivided into smaller fragments that can be sequenced separately, and these sequences are assembled to give the overall sequence. There are two principal methods for this fragmentation and sequencing process. Primer walking (or "chromosome walking") progresses through the entire strand piece by piece, whereas shotgun sequencing is a faster but more complex process that uses random fragments. In shotgun sequencing,[1][2] DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.[1] Shotgun sequencing was one of the precursor technologies that was responsible for enabling full genome sequencing. To apply the strategy, a high-molecular-weight DNA strand is sheared into random fragments, size-selected (usually 2, 10, 50, and 150 kb), and cloned into an appropriate vector. The clones are then sequenced from both ends using the chain termination method yielding two short sequences. Each sequence is called an end-read or read 1 and read 2 and two reads from the same clone are referred to as mate pairs. Since the chain termination method usually can only produce reads between 500 and 1000 bases long, in all but the smallest clones, mate pairs will rarely overlap. Although shotgun sequencing can in theory be applied to a genome of any size, its direct application to the sequencing of large genomes (for instance, the human genome) was limited until the late 1990s, when technological advances made practical the handling of the vast quantities of complex data involved in the process. Historically, full-genome shotgun sequencing was believed to be limited by both the sheer size of large genomes and by the complexity added by the high percentage of repetitive DNA (greater than 50% for the human genome) present in large genomes. It was not widely accepted that a full-genome shotgun sequence of a large genome would provide reliable data. For these reasons, other strategies that lowered the computational load of sequence assembly had to be utilized before shotgun sequencing was performed. In hierarchical sequencing, also known as top-down sequencing, a low-resolution physical map of the genome is made prior to actual sequencing. From this map, a minimal number of fragments that cover the entire chromosome are selected for sequencing. In this way, the minimum amount of high-throughput sequencing and assembly is required. The amplified genome is first sheared into larger pieces (50-200kb) and cloned into a bacterial host using BACs or P1-derived artificial chromosomes(PAC). Because multiple genome copies have been sheared at random, the fragments contained in these clones have different ends, and with enough coverage (see section above) finding a scaffold of BAC contigs that covers the entire genome is theoretically possible. This scaffold is called a tiling path.
mapping snps
Mapping with SNPs has become a powerful complement (and in some cases outright alternative) to the standard genetic mapping procedures described above. In fact, the advent of SNP mapping has been nearly as significant for C. elegansforward genetics as RNAi has been for reverse genetics. With SNP mapping, basically all mutations are now theoretically clonable, something that wasn't true in the past. Moreover, SNP mapping can be routinely used to narrow down the known physical location of mutations to regions smaller than a single cosmid (~30,000 bp; ~6-7 genes). With genetic mapping, even in the best of circumstances, the implicated regions usually span 6-10 complete cosmids or more. In fact, SNP mapping can theoretically be used to narrow down the implicated region to a single gene, although this level of mapping is usually unnecessary and can become inordinately time consuming. Although several approaches for mapping using polymorphisms have been described, we will focus here on the use the Hawaiian C. elegans isolate, CB4856. Because of geographical separation, several million years of evolutionary drift have led to a sizeable number of genetic differences (DNA polymorphisms) between the Hawaiian and English (N2) C. elegans populations. In fact, differences in the genomic sequences of CB4856 and N2 occur on average every 1,000 base pairs. The majority of these changes occur in non-coding or intergenic regions and probably have no functional consequence. Some polymorphisms, however, clearly affect protein activity or gene expression, as N2 and CB4856 differ notably in a number of respects including their mating behaviors and relative sensitivities to RNAi (also see below). The term SNP is a bit of a misnomer. Although many of the sequence variations between N2 and CB4856 are indeed single-nucleotide changes (for example from an A to a G), small deletions or insertions are also very common. What is experimentally most relevant, however, is whether or not these polymorphisms affect the recognition site for an endonuclease. SNPs that result in restriction-fragment length polymorphisms (RFLPs; also called snip SNPs) are easier to work with, as digestion by enzymes is much more rapid and inexpensive then sending off samples for sequencing. Also, it generally doesn't matter whether it's the N2 or CB4856 DNA that is cleavable (however, also see below), just so long as the digestion patterns of the two isolates are clearly distinguishable. The C. elegans SNP database contains only a partial listing of the many polymorphisms that undoubtedly exist between N2 and CB4856. Nevertheless, this resource provides a very useful (if incomplete) inventory of SNPs for the strains N2 and CB4856. The SNP database is organized according to the physical map by chromosomes, chromosomal subsegments, and cosmids. For example, at the top of sequence Segment 9 on Chromosome X, you will find the SNP B0403:33022 S=CT. This means the polymorphism is on cosmid B0403 at nucleotide position 33,022 and that the two strains differ in having either a C or T at this position. SNPs listed in red lettering have presumably been experimentally confirmed, whereas SNPs listed in white lettering are as yet unconfirmed. In fact, our lab has had at least one bad experience with a "confirmed" SNP, thus it is essential to make sure that any SNP you work with behaves as expected in your own hands.
Microarray slide hybridization using fluorescently labeled cDNA
Microarray hybridization is used to determine the amount and genomic origins of RNA molecules in an experimental sample. Unlabeled probe sequences for each gene or gene region are printed in an array on the surface of a slide, and fluorescently labeled cDNA derived from the RNA target is hybridized to it. This protocol describes a blocking and hybridization protocol for microarray slides. The blocking step is particular to the chemistry of "CodeLink" slides, but it serves to remind us that almost every kind of microarray has a treatment step that occurs after printing but before hybridization. We recommend making sure of the precise treatment necessary for the particular chemistry used in the slides to be hybridized because the attachment chemistries differ significantly. Hybridization is similar to northern or Southern blots, but on a much smaller scale. There are several advantages to cDNA microarrays. The two-color competitive hybridization can reliably measure the difference between two samples because variations in spot size or amount of cDNA probe on the array will not affect the signal ratio. The cDNA microarrays are relatively easy to produce. In fact, the arrayer can be easily built, and microarrays can be manufactured in university research labs. Also, cDNA microarrays are in general much cheaper compared with oligonucleotide arrays and are quite affordable to most research biologists. There are also some disadvantages for this system. One is that the production of the cDNA microarray requires the collection of a large set of sequenced clones. The clones, however, may be misidentified or contaminated. Second, genes with high sequence similarity may hybridize to the same clone and generate cross-hybridization. To avoid this problem, clones with 3′ end untranslated regions, which in general are much more divergent compared with the coding sequences, should be used in producing the microarrays. A microarray is a set of short Expressed Sequence Tags (ESTs) made from a cDNA library of a set of known (or partially known) gene loci. The ESTs are spotted onto a cover-slip-sized glass plate, shown here as a 8x12 array. In practice, microarrays of many thousand ESTs are possible. A complete set of mRNA transcripts (the transcriptome) is prepared from the tissue of an experimental treatment or condition, e.g. fish fed a high-protein diet, or an individual with breast cancer. Complementary DNA (cDNA) reverse transcripts are prepared and labelled with a [red] fluorescent dye. A control library is constructed from an untreated source, e.g. a standard fish diet, or non-cancerous breast tissue; this library is labelled with a different fluorescent [green] dye. The experimental and control libraries are hybridized to the microarray. A Dual-Channel Laser excites the corresponding dye, and the fluorescence intensity indicates the degree of hybridization that has occurred. Relative gene expression is measured as the ratio of the two fluorescence wavelengths. Increased expression or "up-regulation" of genes in the experimental transcriptome relative to the control will be visualized as a "hotter" red "pseudo-colour," and decreased expression or "down-regulation" shows as a "cooler" green. Intensity of color is proportional to the expression differential. Unchanged, constitutive expression (1:1 ratio of experimental to control) shows as a neutral black.
The Hawaiian strain of C. elegans
Mutant identification by combined SNP mapping/whole genome sequencing and CloudMap data analysis C. elegans strains - N2 (Bristol); CB4856 (Hawaii) All C. elegans genetics done in N2 background - the standard strain. Hermaphrodite sex - highly inbred. Almost clonal. CB4856 Hawaiian strain - highly polymorphic with N2. Many bp changes throughout genome If you sequence Hawaiian genome SNPs distinguish all regions from N2 Whole-genome sequencing (WGS) is becoming a fast and cost-effective method to pinpoint molecular lesions in mutagenized genetic model systems, such as Caenorhabditis elegans. As mutagenized strains contain a significant mutational load, it is often still necessary to map mutations to a chromosomal interval to elucidate which of the WGS-identified sequence variants is the phenotype-causing one. We describe here our experience in setting up and testing a simple strategy that incorporates a rapid SNP-based mapping step into the WGS procedure. In this strategy, a mutant retrieved from a genetic screen is crossed with a polymorphic C. elegans strain, individual F2 progeny from this cross is selected for the mutant phenotype, the progeny of these F2 animals are pooled and then whole-genome-sequenced. The density of polymorphic SNP markers is decreased in the region of the phenotype-causing sequence variant and therefore enables its identification in the WGS data. As a proof of principle, we use this strategy to identify the molecular lesion in a mutant strain that produces an excess of dopaminergic neurons. We find that the molecular lesion resides in the Pax-6/Eyeless ortholog vab-3. The strategy described here will further reduce the time between mutant isolation and identification of the molecular lesion. Whole-genome sequencing (WGS) is a new and powerful means to identify molecular lesions that result in specific mutant phenotypes. The WGS approach has been used in several studies in multiple model organisms, and our laboratory has successfully employed this strategy in the nematode C. elegans. Sequenced genomes of mutagenized strains contain a substantial number of variants (on average more than 300 variants per chromosome, about 30 of which are in protein coding sequences) and a significant number of variants remain after outcrossing and/or may even be introduced by outcrossing. Identification of the phenotype-causing mutation among the entire complement of sequence variants can be significantly facilitated by whole genome sequencing of two distinct alleles of the same locus. In such cases, little if any mapping is required since the phenotype-causing mutant locus will be one of the few, if not only loci found to be mutated in both WGS datasets. However, to avoid multiple WGS runs and/or if multiple alleles are not available, it may be desirable to first map the phenotype-causing variant to a chromosomal interval. The most commonly used mapping strategy in C. elegans employs single nucleotide polymorphism (SNP)-based mapping. In this strategy, a mutant strain (usually derived from N2 Bristol) is crossed with a polymorphic Hawaiian C. elegans isolate and the mutant F2 progeny of such a cross are analyzed for their distribution of SNP markers. Genomic regions close to the mutation of interest show a decreased incidence of Hawaiian SNPs while unlinked regions contain an even representation of Hawaiian vs. Bristol SNPs. However, this conventional SNP mapping strategy usually starts with a relatively small, arbitrarily chosen number of SNPs. Through re-iterative SNP mapping to finer and finer regions, the gene can be fine-mapped, but this can be a relatively tedious and time-consuming process. We describe here the employment of a strategy - modeled on a similar strategy ("SHOREmap") previously employed in plants that combines WGS with a very fine-grained SNP mapping strategy in a single step. This combination significantly reduces the time between mutant isolation and identification of its molecular identity.
visualisation of the genome at a finer scale - FISH
Refinements in cytogenetic techniques over the past 30 years have allowed the increasingly sensitive detection of chromosome abnormalities in haematological malignancies, with the advent of fluorescence in situ hybridization (FISH) techniques providing significant advances in both diagnosis and research of haematological malignancies and solid tumours1. Chromosome banding techniques (Giesma staining) revolutionized cytogenetic analysis and have been pivotal in the understanding of genetic changes in both constitutional and acquired diseases (in particular, the knowledge of the contribution of specific chromosome abnormalities to leukaemia). However, the resolution of banding analysis is such that it can only detect rearrangements that involve >3 Mb of DNA1. Banding techniques are limited to mitotically active cells with the additional problem of the difficulties involved in deciphering highly rearranged chromosomes using a monochrome banding pattern. The introduction of FISH in the late 1980s, as a technique that can readily detect trisomies and translocations in metaphase spreads and interphase nuclei using entire chromosome-specific DNA libraries, was heralded as a further revolution in cytogenetic analysis1, 2. The high sensitivity and specificity of FISH and the speed with which the assays can be performed have made FISH a powerful technique with numerous applications, and it has gained general acceptance as a clinical laboratory tool3. 'Chromosome painting', competitive hybridization using entire chromosome-specific libraries for chromosomes as probes and human genomic DNA as the competitor,2 was one of the first applications of FISH. It provided intense and specific fluorescent staining of human chromosomes in metaphase spreads and interphase nuclei, allowing the distinctive identification of chromosomes involved in complex rearrangements. The advent of the Human Genome Project has made available a repertoire of single-locus probes that have provided a significant boost to gene mapping strategies and led to the identification of the breakpoints of consistent translocations.4, 5 The first specific translocation identified in human neoplasia was t(9;22)(q34;q11) resulting in the Philadelphia chromosome6, 7; and the delineation of critical deleted regions associated with specific disease subtypes.1, 8 FISH is essentially based upon the same principle as a Southern blot analysis, a cytogenetic equivalent that exploits the ability of single-stranded DNA to anneal to complementary DNA. In the case of FISH, the target is the nuclear DNA of either interphase cells or of metaphase chromosomes affixed to a microscope slide, although FISH can also be performed using bone marrow or peripheral blood smears, or fixed and sectioned tissue3. Once fixed to a microscope slide, the desired cells are hybridized to a nucleic acid probe. This anneals to its complementary sequence in the specimen DNA and is labelled with a reporter molecule which is either an attached fluorochrome, enabling direct detection of the probe via a coloured signal at the hybridization site visualized by fluorescence microscopy, or a hapten that can be detected indirectly.3, 9 This second method relies on immunohistochemistry (IHC) for probe detection which is based on the binding of antibodies to specific antigens, once antigen-antibody binding occurs, it is demonstrated with a coloured histochemical reaction visible by light microscopy or fluorochromes with ultraviolet light.10 IHC is limited by the availability of antibodies. For direct detection FITC, Rhodamine, Texas Red, Cy2, Cy3, Cy5 and AMCA are the most frequently used reporter molecules. Biotin, Digoxigenin and Dinitrophenol are the reporter molecules typically used for indirect detection methods such as IHC.9 Fig. 1 gives a diagrammatic presentation of an overview of the FISH process
Background to the Human Genome Project
The Human Genome Project arose from two key insights that emerged in the early 1980s: that the ability to take global views of genomes could greatly accelerate biomedical research, by allowing researchers to attack problems in a comprehensive and unbiased fashion; and that the creation of such global views would require a communal effort in infrastructure building, unlike anything pre- viously attempted in biomedical research. Several key projects helped to crystallize these insights, including: (1) The sequencing of the bacterial viruses FX1744,5 and lambda6, the animal virus SV407 and the human mitochondrion8 between 1977 and 1982. These projects proved the feasibility of assembling small sequence fragments into complete genomes, and showed the value of complete catalogues of genes and other functional elements. (2) The programme to create a human genetic map to make it possible to locate disease genes of unknown function based solely on their inheritance patterns, launched by Botstein and colleagues in 1980. (3) The programmes to create physical maps of clones covering the yeast10 and worm11 genomes to allow isolation of genes and regions based solely on their chromosomal position, launched by Olson and Sulston in the mid-1980s. (4) The development of random shotgun sequencing of comple- mentary DNA fragments for high-throughput gene discovery by Schimmel12 and Schimmel and Sutcliffe13, later dubbed expressed sequence tags (ESTs) and pursued with automated sequencing by venter and others The idea of sequencing the entire human genome was ®rst proposed in discussions at scienti®c meetings organized by the US Department of Energy and others from 1984 to 1986. A committee appointed by the US National Research Council endorsed the concept in its 1988 report23, but recommended a broader programme, to include: the creation of genetic, physical and sequence maps of the human genome; parallel efforts in key model organisms such as bacteria, yeast, worms, ̄ies and mice; the development of technology in support of these objectives; and research into the ethical, legal and social issues raised by human genome research. The programme was launched in the US as a joint effort of the Department of Energy and the National Institutes of Health. In other countries, the UK Medical Research Council and the Wellcome Trust supported genomic research in Britain; the Centre d'Etude du Polymorphisme Humain and the French Mus- cular Dystrophy Association launched mapping efforts in France; government agencies, including the Science and Technology Agency and the Ministry of Education, Science, Sports and Culture sup- ported genomic research efforts in Japan; and the European Com- munity helped to launch several international efforts, notably the programme to sequence the yeast genome. By late 1990, the Human Genome Project had been launched, with the creation of genome centres in these countries. Additional participants subsequently joined the effort, notably in Germany and China. In addition, the Human Genome Organization (HUGO) was founded to provide a forum for international coordination of genomic research. Several books24±26 provide a more comprehensive discussion of the genesis of the Human Genome Project.
summary
The Human Genome is comprised of 23 pairs of chromosomes - 22 autosomes and 1 sex chromosome (X,Y) having 3x109 bp of DNA, about 100,000 genes, of which about 5,000 are disease genes. The smallest human chromosome: Y, 50 M bp. The largest human chromosome: 1, 250 M bp. Average-sized gene: 30 K bp, encoding a 1,000 amino acid protein. Karyotype: analysis of chromosomes via microscope based on shape (size and banding pattern). Mapping and sequencing the Human Genome Mapping: dividing each chromosome into small segments, char- acterizing them, and arranging them sequentially (mapping) on the chromosome. A genome map describes the order of genes, other known DNA segments of no known functional protein, and the spacing between them on each chromosome. a) Genetic Map depicts the order by which genes are arranged along a chromosome. The determination of such a sequence (Genetic Map) is facilitated by known markers: genes or other DNA stretches. Distances between markers are measured in centimorgans (cM). 1 genetic cM is about 1 M bp (on physical distance). b) Genetic Linkage Map: Shows the relative location of a specific DNA marker along the chromosome. Markers must be polymorphic (variations in DNA sequence occurring once every 300-500 bp) to be useful in mapping. Most variations occur in introns, whereas in exons this could result in observable changes such as eye color, etc. The human genetic linkage map is constructed by observing how frequently two markers are 'interherited' together. The closer the markers are to one another on the same chromosome, the more tightly linked they are, the more likely they will be passed together to the next generation, i.e., they will not be separated by recombination events. Hence, the distance between two markers the Human Genome Project can be determined. This can also assist in locating a gene, especially that of genetic disease on a chromosome. Genetic maps assisted in chromosomal location of several inherited diseases, such as sickle cell disease, cystic fibrosis, Tay-Sachs disease, fragile X syndrome, myotonic dystrophy, ataxia telangiectasia. Goals: Complete detailed genetic map of 1 cM resolution. c) Physical Map shows the actual sites of genes on the genome. There are several physical maps with different degrees of resolution. The physical map of DNA, like a topographic map, is comprised of mapped landmarks, such as restriction enzymes and STS (see below), providing reference points relative to which functional DNA sequences such as genes can be localized. The lowest resolution physical map is the Cytogenetic Map where chromosomal band patterns are viewed with light microscope of stained chromosomes. cDNA Map shows the location of genes. iii. The highest resolution map shows a complete DNA bp sequence. Mapping Methods a) Terminology RFLP (restriction fragment length polymorphism): sequence variations in DNA sites that can be cleaved by restriction enzymes. STRP (short tandem repeat polymorphism): variable number of tandem repeat sequences, most commonly of 2 bp. The advantages of STRPs are:- Repeated up to thousands of times throughout the Genome.- Even distribution throughout the Genome.- Amplifiable by PCR.- Number of repeats vary among individuals.In some cases, an excess of repeats of trinucleotides could lead to Yossi Segal inherited diseases such as fragile X syndrome, Huntington's disease, myotonic dystrophy.
What about sequencing RNA with Next Gen?
The number of different sequence runs from the same gene relative to that for other genes is a measure of gene expression
Background to the Human Genome Project.........2
Through 1995, work progressed rapidly on two fronts. The ®rst was construction of genetic and physical maps of the human and mouse genomes27±31, providing key tools for identi®ca- tion of disease genes and anchoring points for genomic sequence. The second was sequencing of the yeast32 and worm33 genomes, as well as targeted regions of mammalian genomes these projects showed showed that large-scale sequencing was feasible and developed the two-phase paradigm for genome sequencing. In the ®rst, `shotgun', phase, the genome is divided into appropriately sized segments and each segment is covered to a high degree of redundancy (typically, eight- to tenfold) through the sequencing of randomly selected subfragments. The second is a `®nishing' phase, in which sequence gaps are closed and remaining ambiguities are resolved through directed analysis. The results also showed that complete genomic sequence provided information about genes, regulatory regions and chromosome structure that was not readily obtainable from cDNA studies alone. In 1995, genome scientists considered a proposal38 that would have involved producing a draft genome sequence of the human genome in a ®rst phase and then returning to ®nish the sequence in a second phase. After vigorous debate, it was decided that such a plan was premature for several reasons. These included the need ®rst to prove that high-quality, long-range ®nished sequence could be produced from most parts of the complex, repeat-rich human genome; the sense that many aspects of the sequencing process were still rapidly evolving; and the desirability of further decreasing costs. Instead, pilot projects were launched to demonstrate the feasi- bility of cost-effective, large-scale sequencing, with a target comple- tion date of March 1999. The projects successfully produced 39®nished sequence with 99.99% accuracy and no gaps . They also introduced bacterial arti®cial chromosomes (BACs)40, a new large- insert cloning system that proved to be more stable than the cosmids and yeast arti®cial chromosomes (YACs)41 that had been used previously. The pilot projects drove the maturation and conver- gence of sequencing strategies, while producing 15% of the human genome sequence. With successful completion of this phase, the human genome sequencing effort moved into full-scale production in March 1999. The idea of ®rst producing a draft genome sequence was revived at this time, both because the ability to ®nish such a sequence was no longer in doubt and because there was great hunger in the scienti®c community for human sequence data. In addition, some scientists favoured prioritizing the production of a draft genome sequence over regional ®nished sequence because of concerns about com- mercial plans to generate proprietary databases of human sequence that might be subject to undesirable restrictions on use The consortium focused on an initial goal of producing, in a ®rst production phase lasting until June 2000, a draft genome sequence covering most of the genome. Such a draft genome sequence, although not completely ®nished, would rapidly allow investigators to begin to extract most of the information in the human sequence
ENSEMBL
provides a genome browser that acts as a single point of access to annotated genomes for mainly vertebrate species (Figure 2). Information such as gene sequence, splice variants and further annotation can be retrieved at the genome, gene and protein level. This includes information on protein domains, genetic variation, homology, syntenic regions and regulatory elements. Coupled with analyses such as whole genome alignments and effects of sequence variation on protein, this powerful tool aims to describe a gene or genomic region in detail. ave put lots of effort into showing the ordinary scientist what the human genome looks like • could argue....it still looks 'busy' and is quite hard to use...!! • Browser enables diagrammatic view showing location, extent and direction of all genes
The need for assembly
Given that the length of a single, individual sequencing read is somewhere between 45bp and 700bp, we are faced with a problem determining the sequence of longer fragments, such as the chromosomes in an entire genome of humans (3 x109 bp). Obviously, we need to break the genome into smaller fragments. There are two different strategies for doing this: (1) clone-by-clone sequencing, which relies on the creation of a physical map first then sequencing, and (2) whole genome shotgun sequencing, which sequences first and does not require a physical map.
..
Forward genetic screens provide a powerful approach for inferring gene function on the basis of the phenotypes associated with mutated genes. However, determining the causal mutation by traditional mapping and candidate gene sequencing is often the rate-limiting step, especially when analyzing many mutants. We report two genomic approaches for more rapidly determining the identity of the affected genes in Caenorhabditis elegans mutants. First, we report our use of restriction site-associated DNA (RAD) polymorphism markers for rapidly mapping mutations after chemical mutagenesis and mutant isolation. Second, we describe our use of genomic interval pull-down sequencing (GIPS) to selectively capture and sequence megabase-sized portions of a mutant genome. Together, these two methods provide a rapid and cost-effective approach for positional cloning of C. elegans mutant loci, and are also applicable to other genetic model systems. DETERMINING mutant gene identity is a key step for understanding gene function in forward genetic screens following mutagenesis and phenotype-based mutant isolation. In some organisms such as fungi and bacteria, a recessive mutant allele can be complemented with a plasmid-borne wild-type gene to establish gene identification. In organisms that lack robust DNA transformation methods, mapping with visible or selected single nucleotide polymorphism (SNP) markers to progressively finer genomic intervals is the traditional route to ascertain identity of the mutant gene. Now whole genome sequencing (WGS) methods can significantly reduce the time required to identify the causal mutation. For example, WGS can simply be used to determine all of the sequence alterations present in a mutant strain (Sarin et al.2008; Smith et al. 2008; Srivatsan et al. 2008; Blumenstiel et al. 2009; Irvine et al. 2009). However, some mapping data are still required to differentiate the background mutational load from the causal mutation. More recently, WGS has been performed on outcrossed mutant progeny to combine mapping and sequencing for pinpointing the position of the causal mutation (Doitsidou et al.2010; Zuryn et al. 2010). While resequencing a genome to identify mutant alleles is being used more frequently, in some cases it is more efficient to sequence only a portion of a genome. For example, sequencing of a single chromosome, a defined genomic interval, exonic sequences, or a single locus can be more cost effective when there is evidence that a mutation resides within a specific genome feature. There have been several throughput-enhancing advances in capturing targeted regions of a genome using DNA annealing since the first reported use of this methodology whereby individual microarray spots were physically scraped from the substrate (Ksiazek et al. 2003; Rota et al. 2003; Wang et al.2003). For example, genomic DNA can be annealed to microarrays printed with oligonucleotides covering the region to be targeted, washed, and then eluted for sequencing (Albert et al. 2007; Hodges et al. 2007; Okou et al. 2007). Alternatively, oligonucleotides can be used to capture homologous genomic DNA in solution (Gnirke et al.2009). While these approaches are extremely high throughput, they also can be prohibitively expensive.
The Basic Local Alignment Search Tool (BLAST)
- A computer program designed to search for homologous sequences in databases - finds regions of local similarity between sequences. - The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. - BLAST compares small sequential blocks - or WINDOWS- of sequence against massive databases. - It looks for regions of similarity and scores them Small windows of comparison - detect LOCAL regions of similarity. Output - % identity and % similarity (permits conservative substitutions of aa.) Gives overall score and probability of relatedness. If the entire protein sequence was compared in one go, you may get a relatively low overall similarity. How did genes and gene families evolve and what is meant by protein domains? We need to come back to this - remember the question! The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
Hybridisation of fluorescent labelled cDNA to microarray
- Can test for genes up or down regulated under different conditions// example using the same type of cells however exposed to different conditions such as high temperature and studying how high temp can affect level of gene expression. - Cancer cells versus normal; cells with drug versus without etc. - By having multiple probes for each gene - can detect differential splicing. example: in each well we add oligonucleide sequences probes for a particular splice pattern of the gene then we determine if the sample of cDNA hybridises to that splice pattern
where did our genome come from ?
- Common ancestor => common genome - Each species' genome descended with modification from genome of ancestor - Comparative genomics - tells us about state of ancestor and changes along each branch
Computer based predictions
- GENEFINDER (C.elegans), - BLAST (all genomes) - can compare the translation of all 6 reading frames - and other computer programs. - the computer programmes tell us • Evidence that a prediction is correct?• Homology with genes in other organisms - homologues. • Known protein families. •Experimental evidence.
performing genome analysis using computer based predictions
- GENEFINDER (C.elegans), BLAST (all genomes) and other computer programs. - They're capable of identifying biases in the sequences: in C. elegans non-coding is AT rich. Splice site signals, initiator methionines, termination codons. Likely exons and probable/possible splice patterns. - BLAST: can compare the translation of all 6 reading frames - it shows homology with genes in other organisms - shows protein families - experimental evidence - calculates statical significance of similarity between each match window and the sequence of interestt that we run blast on
• Target-'complete' human genome sequence • Strategy-two parallel projects emerged:
- Human Genome Project (HGP) • vs arch rival..... - Celera Genomics - Craig Venter • HumanGenomeProject(HGP) - Chose accurate but slow method, with lots of checkpoints - Hung seq. on BAC map (BAC map was in turn the result of lots of previous mapping phases) - Sequence each BAC in minimal tiling ('golden') path• done by generating sequence contigs from lots of smaller-scale shotgun clone sequences - result - high quality draft genome sequence - useful to all - chose to place all sequences in public databases immediately • publicly funded; prevented patenting
Microarray - large-scale analysis of gene expression
- Microarray- Each spot on microarray contains thousands of copies of a target probe Normally synthetic oligonucleotides - maybe 2-5 for each gene tested. - Gene expression analyses study the occurrence or activity of the formation of a gene product from its coding gene. It is a sensitive indicator of biological activity wherein a changing gene expression pattern is reflected in a change of biological process. Gene expression profiling goes beyond the static information of the genome sequence into a dynamic functional view of an organism's biology and is a widely used approach in research, clinical and pharmaceutical settings to better understand individual genes, gene pathways, or greater gene activity profiles. Gene expression analysis can be achieved through a variety of means, however real-time PCR has risen as the most popularly used approach and the range of products in the Bioline portfolio are well placed to support these studies. Scientists know that a mutation - or alteration - in a particular gene's DNA may contribute to a certain disease. However, it can be very difficult to develop a test to detect these mutations, because most large genes have many regions where mutations can occur. For example, researchers believe that mutations in the genes BRCA1 and BRCA2 cause as many as 60 percent of all cases of hereditary breast and ovarian cancers. But there is not one specific mutation responsible for all of these cases. Researchers have already discovered over 800 different mutations in BRCA1 alone.The DNA microarray is a tool used to determine whether the DNA from a particular individual contains a mutation in genes like BRCA1 and BRCA2. The chip consists of a small glass plate encased in plastic. Some companies manufacture microarrays using methods similar to those used to make computer microchips. On the surface, each chip contains thousands of short, synthetic, single-stranded DNA sequences, which together add up to the normal gene in question, and to variants (mutations) of that gene that have been found in the human population What is a DNA microarray used for? When they were first introduced, DNA microarrays were used only as a research tool. Scientists continue today to conduct large-scale population studies - for example, to determine how often individuals with a particular mutation actually develop breast cancer, or to identify the changes in gene sequences that are most often associated with particular diseases. This has become possible because, just as is the case for computer chips, very large numbers of 'features' can be put on microarray chips, representing a very large portion of the human genome. Microarrays can also be used to study the extent to which certain genes are turned on or off in cells and tissues. In this case, instead of isolating DNA from the samples, RNA (which is a transcript of the DNA) is isolated and measured. Today, DNA microarrays are used in clinical diagnostic tests for some diseases. Sometimes they are also used to determine which drugs might be best prescribed for particular individuals, because genes determine how our bodies handle the chemistry related to those drugs.With the advent of new DNA sequencing technologies, some of the tests for which microarrays were used in the past now use DNA sequencing instead. But microarray tests still tend to be less expensive than sequencing, so they may be used for very large studies, as well as for some clinical test How does a DNA microarray work? To determine whether an individual possesses a mutation for a particular disease, a scientist first obtains a sample of DNA from the patient's blood as well as a control sample - one that does not contain a mutation in the gene of interest. The researcher then denatures the DNA in the samples - a process that separates the two complementary strands of DNA into single-stranded molecules. The next step is to cut the long strands of DNA into smaller, more manageable fragments and then to label each fragment by attaching a fluorescent dye (there are other ways to do this, but this is one common method). The individual's DNA is labeled with green dye and the control - or normal - DNA is labeled with red dye. Both sets of labeled DNA are then inserted into the chip and allowed to hybridize - or bind - to the synthetic DNA on the chip. If the individual does not have a mutation for the gene, both the red and green samples will bind to the sequences on the chip that represent the sequence without the mutation (the "normal" sequence). If the individual does possess a mutation, the individual's DNA will not bind properly to the DNA sequences on the chip that represent the "normal" sequence but instead will bind to the sequence on the chip that represents the mutated DNA.
Malonyl CoA: acyl carrier protein transacylase.
- Parent domain (purple and pink), belong to the same gene family - Inserted ACP(acyl carrier protein) binding domain (orange).- acyl carrier protein - The acyl transferase domain is split in two by insertion. of the acyl- carrier protein domain " insert domain". - this insertion results in a completely different function of the protein product. function of acyl transferase " original" is different than that of acyl-carrier protein transacylase " after insertion"
how did the Tissue Plasminogen Activator (TPA) gene occur since it contains genes from different families?
- a thrombolytic administered to some patients having a heart attack or stroke to dissolve damaging blood clots - Domain shuffling is illustrated by tissue plasminogen activator (TPA), a protein found in the blood of vertebrates and which is involved in the blood clotting response. The TPA gene has four exons, each coding for a different structural domain. The upstream exon codes for a 'finger' module that enables the TPA protein to bind to fibrin, a fibrous protein found in blood clots and which activates TPA. This exon appears to be derived from a second fibrin-binding protein, fibronectin, and is absent from the gene for a related protein, urokinase, which is not activated by fibrin. The second TPA exon specifies a growth-factor domain which has apparently been obtained from the gene for epidermal growth factor and which may enable TPA to stimulate cell proliferation. The last two exons code for 'kringle' structures which TPA uses to bind to fibrin clots; these kringle exons come from the plasminogen gene. - exon shuffling of ' EGF, fibronectin finger exon and plasminogen Kringle exon - exon duplication of PK gene - Type I collagen and TPA provide elegant examples of gene evolution but, unfortunately, the clear links that they display between structural domains and exons are exceptional and are rarely seen with other genes. Many other genes appear to have evolved by duplication and shuffling of segments, but in these the structural domains are coded by segments of genes that do not coincide with individual exons or even groups of exons. Domain duplication and shuffling still occur, but presumably in a less precise manner and with many of the rearranged genes having no useful function. Despite being haphazard, the process clearly works, as indicated by, among other examples, the number of proteins that share the same DNA-binding motifs. Several of these motifs probably evolved de novo on more than one occasion, but it is clear that in many cases the nucleotide sequence coding for the motif has been transferred to a variety of different genes.
c.elegans
- c.elegans are so small, you can only make c.elegans from the whole masked organism - EST sequence data from 50,000 cDNA clones identifies about 9,356 genes - Some mRNSs exist at extremely low levels of abundance. - Some expressed in 1 or few cell types - or at specific time in development etc. - Low abundance cDNAs may be impossible to clone randomly.
how to predict a gene?
- cDNA clones, expressed sequence tags ( ESTs) - mRNA is copied to cDNA via reverse transcriptase Expressed sequence tags (ESTs) are randomly selected clones sequenced from cDNA libraries. Each cDNA library is constructed from total RNA or poly (A) RNA derived from a specific tissue or cell, and thus the library represents genes expressed in the original cellular population. A typical EST consists of 300-1000 base pairs (bp) of DNA and is often deposited in a database as a "single pass read" that is sufficiently long to establish the identity of the expressed gene. EST analysis has proved to be a rapid and efficient means of characterizing the massive sets of gene sequences that are expressed in a life-stage-specific manner in a wide variety of tissues and organisms; the approach was first applied to the screening of a human brain cDNA library (Adams et al., 1991). The ESTs derived from human brain were also able to provide reference for EST analysis of other organisms. In the helminth field, ESTs have extensive application in the discovery of new genes and identification of novel vaccine candidates and drug targets In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence.[1]ESTs may be used to identify gene transcripts, and are instrumental in gene discovery and in gene-sequence determination.[2] The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases (e.g. GenBank 1 January 2013, all species). An EST results from one-shot sequencing of a cloned cDNA. The cDNAs used for EST generation are typically individual clones from a cDNA library. The resulting sequence is a relatively low-quality fragment whose length is limited by current technology to approximately 500 to 800 nucleotides. Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of expressed genes. They may be represented in databases as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand. One can map ESTs to specific chromosome locations using physical mapping techniques, such as radiation hybrid mapping, Happy mapping, or FISH. Alternatively, if the genome of the organism that originated the EST has been sequenced, one can align the EST sequence to that genome using a computer. The current understanding of the human set of genes (as of 2006) includes the existence of thousands of genes based solely on EST evidence. In this respect, ESTs have become a tool to refine the predicted transcripts for those genes, which leads to the prediction of their protein products and ultimately of their function. Moreover, the situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer) gives information on the conditions in which the corresponding gene is acting. ESTs contain enough information to permit the design of precise probes for DNA microarrays that then can be used to determine gene expression profiles. Some authors use the term "EST" to describe genes for which little or no further information exists besides the tag EST sequencing was carried out in parallel to genome sequencing. Simplest experimental evidence that a bit of genomic DNA contains a gene.
Reverse transcriptase PCR - very sensitive.
- cDNA from mRNA using reverse transcriptase.Amplify cDNA by PCR - primers designed from predicted genes. Clone and analyse products.Experimentally confirmed genes raised to > 18,000.Full length cDNA- valuable for confirming intron/exon structure. - this solves the issue with low abundance mRNA - we form cDNA from mRNA - perform RT-PCR to amplify the cDNA ( primers are designed by predicting the genes ) - clone and analyze the EST sequence - full length cDNA - confirm introns and exons
Sequence data from Random Primed cDNA - ESTs (or EST Tags)
- exon skipping - The sequencing of ESTs uncovered frequent examples of differential splicing. - it is hard to get one cDNA that covers a whole gene so we use random priming to form different smaller cDNA CLONES Of the gene then we can sequence each cDNA clone or EST tag to figure out all possible patterns of splicing of the gene - sequencing cDNAs/ESTs tells us what part of the chromosome are genes that are transcribed fits pattern of splicing - it can tell us about alternative 5' exon, alternative splicing altering stop codons, genes within genes
reverse genetics
- from DNA sequence to phenotype - the starting point is a DNA sequence and not a randomly mutated gene defined by phenotype - when we have a cDNA/ESTs and we know that the sequence is a gene that is expressed, however, we have no knowledge about its function as the gene hasn't been mutated then you perform reverse genetics - knockdown a gene via RNAi it mimics a mutant - homologous gene knockout - crispr genome editing - we genetically modify a gene to create a mutant and understand the function of a gene
how did genes and gene families evolve and what is meant by protein domains?
- some genes exist as families - specialisation of genes happens after gene duplication - B-globin genes re responsible for carrying oxygen in the blood - the similarities between genes even within the same species can potentially be because they belong to the same gene family and they are the result of multiple gene duplication events - blast is a great tool for comparative genomics - Gene duplication New genes have also evolved through the duplication of whole genes and their subsequent diver- gence. This process creates - multigene families, sets of genes that are similar in sequence but encode different products. For example, humans possess 13 different genes found on chro- mosomes 11 and 16 that encode globinlike molecules, which take part in oxygen transport. All of these genes have a similar structure, with three exons separated by two introns, and are assumed to have evolved through repeated duplication and divergence from a single globin gene in a distant ancestor. This ancestral gene is thought to have been most similar to the present-day myoglobin gene and first duplicated to produce an α/β-globin precursor gene and the myoglobin gene. The α/β-globin gene then underwent another duplication to give rise to a primordial α-globin gene and a primordial β-globin gene. Subsequent duplica- tions led to multiple α-globin and β-globin genes. Similarly, vertebrates contain four clusters of Hox genes, each cluster comprising from 9 to 11 genes. Hox genes play an important role in development. Some gene families include genes that are arrayed in tandem on the same chromosome; others are dispersed among different chromosomes. Gene duplication is a com- mon occurrence in eukaryotic genomes; for example, about 5% of the human genome consists of duplicated segments. Gene duplication provides a mechanism for the addition of new genes with novel functions; after a gene duplicates, there are two copies of the sequence, one of which is free to change and potentially take on a new function. The extra copy of the gene may, for example, become active at a differ- ent time in development or be expressed in a different tissue or even diverge and encode a protein having different amino acids. However, the most common fate of gene duplication is that one copy acquires a mutation that renders it nonfunc- tional, giving rise to a pseudogene. Pseudogenes are common in the genomes of complex eukaryotes; the human genome is estimated to contain as many as 20,000 pseudogenes.
Genome Size
- what does this list tell us? - it tells us that there are a core set of proteins involved in cell biology, metabolism - indicates the evolution of developmental complexity- amplification of gene families of regulatory molecules ( due to increased developmental complexity ). This explains the increase in the £ of genes in multi cellular organisms compared to s.cerevisae but it docent explain the increase in the DNA content
how to add function onto genomics?
1. Forward genetics - random mutagenesis of organism, followed byPositional cloning of genes defined by mutation.// you screen for mutants and mutations are defined by the phenotype of the mutated gene without even knowing the actual sequence of the gene you identify the relative location of genes through recombinational mapping then you can perform positional cloning of genes defined by mutations through phenotypic rescue attempts to identify the sequence of the mutated gene. 2. For humans - identifying sequence of disease loci also forward genetics. 3. Expressed Sequence Tags (ESTs). Evidence that a sequence of genome is expressed, spliced and exported from nucleus as mRNA. It is a gene. what's different from each 1- it is an experimental process performed on model organisms ( c.elegans) 2- it is a clinical process to identify the sequence of disease loci 1 & 2 are forward genetics and we know their phenotypes 3- EST's indicate that a gene is expressed but we cannot obtain information about the function of the gene just from ESTs
Structure of a 'typical' eukaryotic gene
Although humans contain a thousand times more DNA than do bacteria, the best estimates are that humans have only about 20 times more genes than do the bacteria. This means that the vast majority of eukaryotic DNA is apparently nonfunctional. This seems like a contradiction. Why wouldn't more complicated organisms have more DNA? However, the DNA content of an organism doesn't correlate well with the complexity of an organism—the most DNA per cell occurs in a fly species. Other arguments suggest that a maximal number of genes in an organism may exist because too many genes means too many opportunities for mutations. Current estimates say that humans have about 100,000 separate mRNAs, which means about 100,000 expressed genes. This number is still lower than the capacity of the unique DNA fraction in an organism. These arguments lead to the conclusion that the vast majority of cellular DNA isn't functional. Genes that are expressed usually have introns that interrupt the coding sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (called exons) interrupted by introns. The regions between genes are likewise not expressed, but may help with chromatin assembly, contain promoters, and so forth Intron sequences contain some common features. Most introns begin with the sequence GT (GU in RNA) and end with the sequence AG. Otherwise, very little similarity exists among them. Intron sequences may be large relative to coding sequences; in some genes, over 90 percent of the sequence between the 5′ and 3′ ends of the mRNA is introns. RNA polymerase transcribes intron sequences. This means that eukaryotic mRNA precursors must be processed to remove introns as well as to add the caps at the 5′ end and polyadenylic acid (poly A) sequences at the 3′ end. Eukaryotic genes may be clustered (for example, genes for a metabolic pathway may occur on the same region of a chromosome) but are independently controlled. Operons or polycistronic mRNAs do not exist in eukaryotes. This contrasts with prokaryotic genes, where a single control gene often acts on a whole cluster (for example, lacI controls the synthesis of β‐galactosidase, permease, and acetylase). One well‐studied example of a clustered gene system is the mammalian globin genes. Globins are the protein components of hemoglobin. In mammals, specialized globins exist that are expressed in embryonic or fetal circulation. These have a higher oxygen affinity than adult hemoglobins and thus serve to "capture" oxygen at the placenta, moving it from the maternal circulation to that of the developing embryo or fetus. After birth, the familiar mature hemoglobin (which consists of two alpha and two beta subunits) replaces these globins. Two globin clusters exist in humans: the alpha cluster on chromosome 16, and the beta cluster on chromosome 11, These clusters, and the gene for the related protein myoglobin, probably arose by duplication of a primoridial gene that encoded a single heme‐containing, oxygen‐binding protein. Within each cluster is a gene designated with the Greek letter Ψ. These are pseudogenes—DNA sequences related to a functional gene but containing one or more mutations so that it isn't expressed. The information problem of eukaryotic gene expression therefore consists of several components: gene recognition, gene transcription, and mRNA processing. These problems have been approached biochemically by analyzing the enzyme systems involved in each step Protein domains may be encoded by a single exon, but not necessarily - domain - region of protein in which most 3D structure bonds after folding are satisfied internally New genes may emerge in several ways - whole gene duplication; re-arrangements involving exon 'shuffling'; de novo origin
Proof of principle
As mentioned above, we modified MAQGene to provide an additional output file with the sequence pileup at all Hawaiian/Bristol polymorphic positions and the relative frequency of Hawaiian and Bristol sequences at these positions. We extracted from the list of SNPs those loci that had sufficient coverage and whose pileup showed a defined proportion of Hawaiian sequences (for the exact filtering criteria and data processing see Materials and Methods). Graphs showing the distribution of those SNPs on the 6 chromosomes were then generated. We expected the mutation-bearing region would be revealed by the lack of such Hawaiian SNPs.
positional cloning
Can't make transgenic humans - but the same positional information is used to identify Human disease genes. Can sequence exons from patients and non-affected individuals - look for possible CAUSITIVE mutations. Can delete genes in mice and observe phenotype in them. The logic of genetic mapping and relating this to physical map and sequence is the same as for the worm.
How to add Function onto Genomics
Forward genetics - random mutagenesis of organism, followed by Positional cloning of genes defined by mutation. Model organisms and humans for disease loci. Expressed Sequence Tags (ESTs). Evidence that a sequence of genome is expressed, spliced and exported from nucleus as mRNA. It is a gene. Reverse Genetics - mutating DNA sequence and then looking for PHENOTYPE Comparative Genomics - protein sequence comparisons with computational biology Gene expression - where, when and under what circumstances and genes expressed in an organism? Where is a cell is a protein located? What proteins does it interact with?
Past Paper Exam Questions
Here are a selection of questions that relate to Genomics lectures from Bailey/Johnstone (from previous years - curriculum has changed a bit over the years...) 1. In studying a new model organism, why should we want to sequence its genome? 2. What questions about the life and attributes of an organism can only be answered at the genome level or once its genome has been sequenced? 3. How are the constituents of a genome organised, and how do we know? 4. What kinds of markers have been developed for use in genetic and physical mapping of the human genome, and how are they used? 5. Explain the principles by which genomes can be mapped, and suggest ways in which different kinds of maps may be integrated. 6. With reference to the human and/or C. elegans genome projects, discuss the major strategies and methodologies involved in sequencing a genome. [N.B. NOT '...methods...' - see below] 7. Explain how the modern approach to genome sequencing differs from the approach used in the Human Genome Project. These won't come up again - or will they...? Certainly, it's possible there may be new ones this year and any Qs we set will be tailored to what we actually taught. The point is - you can't revise just from the lecture notes for most of these - you'll need to refer to textbooks and integrate from several parts of the block (and possibly even from several blocks across the course). N.B. we don't want you to list 'methods' e.g. how PCR works, or how various forms of sequencing work - we want 'methodology' (i.e. strategy) e.g. Map construction followed by targeted sequencing of Golden Path BACs vs whole genome shotgun approach
Genomics is about 'high-throughput' science
High-throughput sequencing (HTS) is a newly invented technology alternative to microarray. Although it is still relatively more expensive than microarray, it has several advantages even for the measurements of factors that affect regulation of gene expression. For example, HTS can be applied to non-model organism while microarray is restricted to model organism to which microarray has already been designed as mentioned above. Although RNA must be converted to DNA before HTS, since HTS can directly count the number of DNA/RNA fragments, HTS is believed to be also more quantitative than microarray. If HTS is binded to ChIP technology invented for ChIP-chip technology for microarray, ChIP-seq technology can be used for the measure histone modification and TFBS. If HTS is combined with bisulfite treatments, HTS can also be used for identification DNA methylation. The disadvantage of short read technology is that short read must be mapped to genome that is not always available. If genome sequence is missing, genome must be assembled independently prior to mapping short reads. • MJ Tetrad PCR machine • 4 x 384 well blocks • run ~8 times a day • Serious genomics labs will have 100s of these! - and robots to fill them Capillary sequencers- each generates 3 x 96 x 500bp per day... - many hundreds of these machines used in HGP at Sanger Centre
How can the physical (clones & sequence) and recombinational genetic maps be aligned?
Identify the sequence of genes defined by mutation. - generate a transgenic c.elegans by injecting dna into the gonad of c.elegans ( adult hermaphrodites ) - gonad of c.elegans: the meiotic and mitotic nuclei of germ line stem cells are in a syncitium ( cytoplasmic mass containing several nuclei, formed either by fusion of cells or nuclei division) DNA injected into the gonads of the adult hermaphrodites. Form large heritable DNA molecules termed "free arrays". - nuclei will be bathed in a solution of DNA and the nuclei can take up the DNA at high frequency - as a result, the egg formed will contain the foreign DNA and transgenic offspring can be produced
All new genome projects now done using Next Generation Sequencing (NGS)
Illumina - HiSeq / MiSeq - Life Technologies - Ion Torrent/PGM • main characteristic of NGS - extreme throughput because massively parallel: >10,000MB (100,000,000 reads) per run • Approach - a modern kind of 'whole genome shotgun' sequencing: • fragment the genome randomly • size-select fragments and prepare 'sequencing library' of small fragments with defined 'linker-tag' ends (no cloning involved, but some amplification) • sequence using Illumina HiSeq or other v. high throughput machine • generates 100s of millions of 125-600bp 'reads' (Illumina) • assemble reads into contigs by• de novo assembly (high-level bioinformatic algorithms - really hard!)• 'hanging' reads on the reference genome of a related organism (doable!)
phenotypic rescue
Inject cosmid into the mutant. Observe transgenic progeny for phenotypic rescue. Subclone individual genes from cosmid. Observe transgenic progeny for phenotypic rescue. to identify which sub clone from the cosmid is actually responsible for the phenotypic rescue
Next Generation Sequencing - the benchtop 'genomics' revolution
NGS can be done on a large scale or a benchtop scale • Data ready in hours not days/weeks; cheap enough for one big (or several small) labs to have one of their own • Applications: • 'resequencing' (small genome / targeted / 'exome') for e.g. mutation detection in diseases or polymorphism screening/genotyping • Soon - whole genome sequencing for polymorphism genotyping AND rare variant detection for finding genes predisposing to complex diseases • 'RNAseq' - gene expression profiling - sequence mRNA pool • 'ChIPseq' - what genomic sequences does this transcription factor bind to? • Metagenomics - mixtures of organisms, see what's there
Key to creating accurate maps - Map Integration
Physical and genetic mapping data have become as important to network biology as they once were to the Human Genome Project. Integrating physical and genetic networks currently faces several challenges: increasing the coverage of each type of network; establishing methods to assemble individual interaction measurements into contiguous pathway models; and annotating these pathways with detailed functional information. A particular challenge involves reconciling the wide variety of interaction types that are currently available. For this purpose, recent studies have sought to classify genetic and physical interactions along several complementary dimensions, such as ordered versus unordered, alleviating versus aggravating, and first versus second degree The successful completion of the Human Genome Project depended crucially on the integration of genetic and physical maps1. Genetic maps, also known as gene linkage maps2, were constructed by measuring the meiotic recombination frequencies between different pairs of genetic markers. On the basis of many pairwise genetic distances, markers could be placed on a number line with short distances corresponding to low recombination frequencies. Conversely, physical maps were constructed by identifying the position of markers along the chromosome. Physical distances between markers were determined by techniques such as radiation hybrid mapping3,4, fluorescence in situ hybridization (FISH)5 or, ultimately, automated DNA sequencing6. Genome assembly involved a multi-step procedure in which DNA fragments were cloned, sequenced and, on the basis of the markers they were found to contain, ordered relative to each other and to the genetic map7,8. Obtaining full coverage of the genome involved generating enough physical and genetic data so that the two maps could be reconciled. Following assembly, the physical and genetic maps were annotated and continuously updated with detailed information about functional elements9. For the physical sequence map, the primary annotation task was the identification of genes; for the genetic map, it was linking genes or their surrogate genetic markers with diseases of interest. Remarkably, the mapping cellular regulatory and signalling networks is now proceeding in much the same way10,11. As for genomics, large-scale genetic and physical interaction mapping projects release enormous amounts of raw data that must be filtered and interpreted biologically (BOX 1). Integration of these two types of maps is important because they provide views that are highly complementary with regard to cellular structure and function: physical interactions dictate the architecture of the cell in terms of how direct associations between molecules constitute protein complexes, signal transduction pathways and other cellular machinery. Genetic interactions define functional relationships between genes, giving insight into how this physical architecture translates into phenotype. A complete picture of the cell must necessarily integrate both aspects
High-Throughput Gene Mapping in Caenorhabditis elegans
Positional cloning of mutations in model genetic systems is a powerful method for the identification of targets of medical and agricultural importance. To facilitate the high-throughput mapping of mutations in Caenorhabditis elegans, we have identified a further 9602 putative new single nucleotide polymorphisms (SNPs) between two C. elegansstrains, Bristol N2 and the Hawaiian mapping strain CB4856, by sequencing inserts from a CB4856 genomic DNA library and using an informatics pipeline to compare sequences with the canonical N2 genomic sequence. When combined with data from other laboratories, our marker set of 17,189 SNPs provides even coverage of the complete worm genome. To date, we have confirmed >1099 evenly spaced SNPs (one every 91 ± 56 kb) across the six chromosomes and validated the utility of our SNP marker set and new fluorescence polarization-based genotyping methods for systematic and high-throughput identification of genes in C. elegans by cloning several proprietary genes. We illustrate our approach by recombination mapping and confirmation of the mutation in the cloned gene, dpy-18. Forward genetic screens in model organisms remain a crucial tool for uncovering new biological information (Matthews and Kopczynski 2001; Sternberg 2001). These approaches require extensive recombination mapping of a mutation to discover the identity of a gene. Traditional methods in model systems have typically relied on the use of visible phenotypic markers for linkage mapping of mutations. However, single nucleotide polymorphism (SNP) markers are currently favored because of their relative abundance and because they can eliminate confounding interaction with the mutant phenotype, although in some cases outcrossing introduces genetic modifiers. To date, the only strategy for SNP-based cloning in the nematodeCaenorhabditis elegans( C. elegans Sequencing Consortium 1998) is the snip-SNP approach (Wicks et al. 2001). Here we present an alternative tripartite approach for rapid SNP-based mapping in the worm. We first established a set of finely spaced genome-spanning SNP markers and then combined this resource with a tiered mapping strategy that progressively narrows the region containing the gene of interest. Finally, we used a high-throughput SNP assay that allowed reliable and rapid genotyping with low marker development costs. This strategy afforded rapid gene cloning in C. elegans and can be tailored for use in other model organisms with a sequenced genome.
making cDNA by random priming
Random Hexamer Primers are a mixture of oligonucleotides representing all possible sequence for that size. Random Primers can be used to prime synthesis in oligo-labeling similar to using hexamers and cDNA synthesis. RNA quality priming strategy , and enzyme efficiency are important parameters for obtaining high yield cDNA of good quality. Less studied is the impact of different primers in the reaction. Reverse transcription reactions can be primed using specific primers if relatively few mRNA species are targeted. This approach is not practical for whole transcriptome analysis using microarray, because it would require synthesis and mixing of thousands of specific primers. In those cases, the reverse transcription reaction is primed with oligo(dT) random hexamers ) or random nonamers . Oligo(dT) priming has the virtue of producing cDNA from the 3′ end of poly(A) mRNA, allowing total RNA to be used as a template. The drawback is that oligo(dT) priming often results in a 3′ bias compared with random priming . Poly(A)-selected RNA like an isolated mRNA fraction or amplified RNA (aRNA) are preferably reverse transcribed with random primers, because random priming is less likely to give a 3′ end bias in the resulting cDNA . The advantage of Random Priming is cDNA clones not biased towards 3' end of gene.
genome sequence of c.elegans
Sequence of entire genome. Sequence of cDNA clones. Approximately 19,500 PREDICTED protein coding gene sequences. Large number of various kinds of functional RNAs - not discuss further.
SNP mapping c elegans
The overall strategy is schematically depicted in. A mutant strain, for which no mapping information is available, is crossed with the polymorphic Hawaiian strain CB4856 and a number of F2 progeny that carry the mutant phenotype are singled onto fresh plates. Each of the isolated F2 animals represents a recombinant in which the Hawaiian chromosomes have recombined with wild-type, Bristol-derived chromosomes. The progeny of these singled F2 hermaphrodite animals ( = F3 and F4 generation) are then pooled, DNA is prepared and the entire pool is subjected to WGS Due to meiotic recombination, in regions unlinked to the mutation the parental chromosomes will recombine in a largely non-biased manner. So as long as enough recombinants are pooled, unlinked SNP loci will appear in a roughly 50/50 ratio of Hawaiian vs. Bristol nucleotides in the sequence output pileup generated by the genome sequencer. In contrast, the closer a SNP locus is to the mutation, the more rare it is to find a recombination event between that SNP and the mutation. As a result, Hawaiian variants in the sequence pileup will be underrepresented in regions closer to the selected mutation. Finally, we expect to have only Bristol sequences in the sequence pileup very near the mutation. Identifying an extended region of pure Bristol sequence in such an approach is meaningful, since Hawaiian SNPs are uniformly distributed across all chromosomes with an average density of about 1/1,000. The WGS dataset is analyzed using a software tool that we recently developed, called MAQGene. We modified MAQGene to now perform two different functions: (i) it performs the conventional function of identifying homozygous variants of non-SNP loci between the mutant strain and the reference Bristol genome; (ii) it now also considers the ratios of Hawaiian vs. Bristol representation in the pileup at all known SNP loci in the genome (∼100,000 variants). Therefore, the same dataset originating from a single WGS run for a given mutant will not only reveal the SNP distribution but will also greatly improve the ability to identify the phenotype-causing mutation, which will be one of the few -if not the only- variants in an interval defined by the SNP mapping. A key advantage of this strategy is that rather than examining an arbitrarily chosen, limited number of SNPs to assess the map position of a specific mutation (usually at most a few dozen), this strategy interrogates in a single step, at least in theory, all ∼100,000 SNPs that distinguish the Hawaiian C. elegans isolate from the Bristol isolate. Therefore, the WGS-SNP combination not only eliminates all technical aspects of SNP mapping (PCR, sequencing or restriction mapping of PCR amplicons), but also provides, in one step, a much finer grained resolution. Moreover, it reduces the costs of mutant identification, since it employs a single whole-genome sequencing run for both mutant mapping as well as mutant identification.
Even more Genome Vital Statistics (cont.)....
Thehumangenomeisfullofstuffthatmakesitdifficultto understand highly conserved - 100 Mbp ( 3% of genome ) segmental duplications - 150 Mbp
..
a cosmid vector is a type of hybrid plasmid that contains a lambda phage, cos sequence ( cos site + plasmid ) - we obtain a mutant for a particular gene ( i.e glp-4)- we inject it with a clone that has the wild type version of the glp-4 gene. we observe f1 and f2 progeny if they display the mutant or wild type phenotype. if they display the wild type phenotype then phenotypic rescue was accomplished and the clone indeed contained the wild type gene. this only works for recessive mutations
c.elegans mutants
dpy-7: Short fat worm - exoskeletal defect. ced-4: Programmed cell death defective. unc-51: Paralysed - abnormal axons. dec-2: long defecation cycle - genetically constipated. once you sequence the genome, you require to perform genomics to interpret and make sense of the genome
Positional cloning of genes defined by mutation.
the cosmids on the physical map contain the wild type version of the genes but we don't exactly know what gene is in each clone/cosmid - there are mutant strains of c.elegans that are made with ' deficienes ( big chromosomal deletions that are used for genetic mapping ) - the strain is homozygotic for a particular mutation, one chromosome has a point mutation and the other has deficiencies both copies of the gene is mutated heterozygotic strain - will contain the mutant version of the gene on one chromosome due to point mutation and the other gene will contain the wild type phenotype, this can work if the mutation is recessive meaning that the wild type phenotype is the recessive phenotype - using molecular methods we can map areas of the genome that contain deficiencies , do pcr in a homozygote mutant and view that there is no DNA that will get amplified we use mutant strains of c elegans with defificiens , we perform PCR of the homozygote mutant to identify the cosmid sequences lost in deficiency - we inject each mutant with the wild type cosmids - test the progenies for the wildtype phenotype • The standard route to clone C. elegans genes defined by mutation. • The more genes cloned the easier it becomes to clone others.• Greater positional alignment between physical and linkage maps.
positional cloning of genes defined by mutation
we always identify map within deficiencies mapping the first genes is the hardest because we have to inject mutants with so many different cosmids until we observe a phenotypic rescue only then we can connect the physical and genetic map example: once we clone both unc-101 and Lin-11 and identify which cosmids in the physical map result in a phenotypic rescue for those gene mutants then we can connect the physical and genetic map and identify the molecular sequence of those genes - we already clones unc-101 and Lin-11 genes and identified the position of cosmids in the physical map that result in the phenotype rescue - the more genes that we clone, the easier it is to clone other genes, the greater the positional alignment between the physical and genetic map example: to clone unc-75 and connect the physical and linkage maps we have a smaller sets of cosmids that we can use to attempt phenotypic rescue
Functional Genomics
• (Genomics), Transcriptomics, Proteomics, Metabolomics •Genomics - see other slides• Transcriptomics • pattern of gene expression across tissues, developmental stages, physiol. and environmental conditions • microarrays - interrogate fixed set of genes/transcripts • RNA-seq - snapshot of ALL RNA products in the sample •Proteomics • analyse all (or large subset of) proteins expressed across tissues, developmental stages, physiol. and environmental conditions • much harder to do than for transcriptomics • many more indiv. proteins exists than mRNAs - WHY? •Metabolomics • analyse all (or large subset of) small molecules/metabolites present in tissues/body fluids, across developmental stages, physiol. and environmental conditions
Even more Genome Vital Statistics (cont.)....
• 20,400 protein-coding genes (not the 100,000 initially predicted...!) • >24,000 other genes (gene product is a functional RNA) • Genetic variation - Around 1 in 1,000 bases differ between two random copies of the human genome (e.g. your maternal and paternal chromosomes i.e. 'heterozygosity') - (in chimps, it's ~1 in 100 bases) - Around 20,000,000 sites harbouring 'common' variation (polymorphisms; popn. freq. of 'minor allele' >~1%) across all human populations - > 600,000,000 sites harbouring rare (popn. freq. <1%) genetic variants • many of these possibly harmful/deleterious
Even more Genome Vital Statistics (cont.)....
• 20,400 protein-coding genes (not the 100,000 initially predicted...!) • >24,000 other genes (gene product is a functional RNA) • all except 1,000 or so of these have been discovered only very recently • >200,000 alternative mRNAs from protein-coding genes - alternative splicing of internal exons, alternative promoters, alternative last exons
After the genome.....
• After you get your 'complete' genome sequence, two kinds of biology become possible:- • Post-genome biology • all the normal kinds of biology, but boosted by the knowledge of the position, size, structure of each gene whose sequence or product you might be working on • Functional Genomics (Genomics, Transcriptomics, Proteomics etc.) • not just post-genome biology - highly specialised set of approaches • to do with the functioning of the genome, and its products, as a whole... • special feature of FG - work gets done on huge datasets - possibly all genes, or all transcripts, or all soluble proteins, or a whole chromosome, or all repeats,... or at least a large subset of these entities, involving analysis of patterns covering the behaviour of lots of things simultaneously • almost always involves complicated stats and bioinformatics to handle the data
Understanding predisposition to complex disorders - needs a map of all human genomic variation
• Hot topic - map of polymorphic sites in the human genome• Secondary goals - Hotspots of mutation; Hotspots of recombination • Main types of polymorphism of interest• SNVs - more than 20 million common, more than 300 million rare• CNVs (copy number variants) - more than 3.5 million common/rare? HAPMAP and 1000 Genomes Project - generated maps of all common varying sites - see http://www.1000genomes.org/ Nature 1st Oct. 2015 - involved fully sequencing genomes of >2,500 indivs.
mapping with deficiencies
• An association or alignment between the physical and genetic maps. • In the 1st cases, map within deficiencies. • Test which cosmid sequecnes lost in deficiency .• test cosmids for rescue of mutant. Four steps are involved in cloning a mutation-defined gene in higher organisms: 1.Identification of a cloned segment of DNA near the gene of interest. 2.Isolation of a contiguous stretch of DNA and construction of a physical map of DNA in the region. This region will contain the gene of interest as well as other neighboring genes. The position of the gene of interest within this cloned segment is not known. 3.Correlation of the physical map with the genetic map to localize the approximate position of the gene of interest along a cloned segment of DNA. 4.Detection of alterations in expression of transcripts or changes in DNA sequence between DNA from mutant and wild-type organisms to determine which region of the cloned DNA corresponds to the normal form of the mutationdefined gene. The specific experimental techniques used in positional cloning may vary depending on the species and nature of the mutation. In this section, we describe some of the common techniques for carrying out each of the basic steps in this approach Although the analysis described in the previous section may localize a mutation-defined gene of interest to a particular DNA region, it frequently cannot determine which of several neighboring genes corresponds to that specific gene. Here we describe strategies for pinpointing the precise position of the gene of interest within a larger region of cloned DNA. The objective of positional cloning is to locate a mutation-defined gene within cloned DNA so that it can be characterized at the molecular level. The first step in positional cloning involves identification of cloned DNAfragments near the gene of interest. The chromosomal location of cloned DNA fragments can be mapped to particular chromosomes by in situ hybridization or by using them as probes in Southern blot or PCR screening of mouse-human hybrid cells containing one or a few human chromosomes or portions of chromosomes. Regions of contiguous DNA can be isolated by chromosome walking starting near a gene of interest (see Figure 8-24) and by ordering of overlapping YACs (see Figure 8-25). Such DNAs can be used to construct physical maps of chromosomal regions of interest. Correlating the genetic and physical maps of a specific chromosomal region is a key step to the identification and isolation of a particular mutation-defined gene. Landmarks on the physical map that facilitate correlation with the genetic map include DNA polymorphisms and chromosomal abnormalities such as large deletions and translocations. Identification of the gene of interest within a candidate region typically requires comparison of DNA sequences between wild-type and mutant (or disease-affected) individuals.
Human Genome Project - sequencing phase
• CeleraGenomics-CraigVenter- whole genome shotgun sequencing, followed by whole genome assembly- quick but much less accurate method. • shotgun method generates millions of small, random fragments - use supercomputer to create de novo assembly. • this produced a race to patent the genome... - c.f. the hare and the tortoise... • big fights between Craig Venter on one hand, and HGP Director, Francis Collins andHGP spokesman, John Sulston, • Venter's method failed in practice - he used the public databases with their good BAC-based seqs. to 'hang' his assembly on - the quality of the Celera genome sequence was not as good as that of the HGP sequence - few people bought the Celera genome assembly..... • Who has the last laugh?..... - all new model organism genomes over the last few years have been done using whole genome shotgun approach, hanging assembly on prior examples of reference genome seqs.....!! • many new model organisms being seqd. have a related organism already seqd.
Genome projects
• Completedgenomes-listnowrunsto>45,000organisms - Viruses >33,000 completed incl. HIV, Polio, Ebola, Foot and Mouth - Organelles - Mitochondrion, chloroplast - Microbes >200,000 completed (bacteria and archaea) incl. E. coli K12, Yersinia pestis, Chlamydia, E. coli O157 - Fungi - Saccharomyces, Schizosaccharomyces (yeast) - Protists - Plasmodium falciparum, Trypanosoma - Plants - Arabidopsis, Rice - Oryza sativum - Animals • Nematodes - Caenorhabditis elegans • Arthropods - Drosophila, Anopheles• Chordates - Ciona; amphioxus; Fugu - smallest vertebrate genome - Mus, chicken - Gallus, Cow, Pig - Homo sapiens sapiens, H. sapiens neanderthalensis,chimpanzee
Even more Genome Vital Statistics (cont.)....
• Genes in the human genome - characteristics of an 'average human gene' • proportion of genome occupied by genes 40% • exons as proportion of genome 3% • what's the rest of the genome made up of...?? - see later....
geneticists attitude to genes
• Imagine a gene you are interested in... More questions: - Where did it come from? i.e. when and how did it come into existence? - What's it related to (homologous to)... • in the same genome? - paralogues• in the genome of other organisms? - orthologues and paralogues Orthologues - gene copies originating by speciation Paralogues - gene copies originating by gene duplication » these other related genes constitute the 'gene family' - Genetic models of human disease • How do I know which gene in a mouse is the correct homologue (i.e. 'orthologue') of my disease gene (e.g. SMN1) - Need to know this if I want to knock out the right gene....!!- Where are the regulatory regions influencing this gene's expression? • These are not always easy to find • Genomics approaches can reveal these regions and tell us how they work at the chromatin level - and at the same time reveal all the regions of our genome that regulate expression of all our genes
Genomic libraries - criteria for success
• Must represent entire genome - exceptions will be 'unclonable bits' - nasty repeated regions •Must be redundant(to ensure coverage) - 6 x coverage depth needed i.e. each base in genome represented in at least 6 clones • Mustbetractable-easytogrowandwon'tthrowoutinsert -BACs - good - Cosmids - OK, but small- YACs - bad - often, clones are unstable and chuck bits out / rearrange chunks • Must be fit for purpose - for creating an library that can be arrayed on a grid for high- throughput clone mapping, vector needs to be easy to grow and isolate DNA from: • BACs good• Cosmids OK but less good
What's in the genome that's not genes?
• Pseudogenes - defunct/old genes that are now non-functional • > 12,000 pseudogenes exist - mixture of processed (no introns) and unprocessed • (why are there so many processed pseudogenes?...hint: reverse transcription creates lots of new copies of genes) • 'genomic repeats' • Tandem repeats • Satellites, microsatellites and minisatellites • Interspersed repeats • LINEs - Long Interspersed Nuclear Elements• L1/Kpn 17% of genome (0.6 x 106 copies) • SINEs - Short INterspersed Nuclear Elements • Alu repeats • Proportion of genome occupied by interspersed and tandem repeats?
Human Genome Project - sequencing phase
• Target - complete human genome sequence draft sequence completed: 2001 : 90% coverage • YAC/BAC clone seqs, assembled into supercontigs - (check out Feb. 15th 2001 issue of 'Nature') 'full' version completed - 2003/2004 - 95% coverage • Genome sequence assembly was fed into specially designed front end software for visualising and interrogating the genome seq. - 'genome browsers' - NCBI - http://www.ncbi.nlm.nih.gov/sites/genome- EBI - http://www.ensembl.org- Santa Cruz - http://genome.ucsc.edu- Nature 'Omics Gateway' - http://www.nature.com/omics/index.html » see early papers - these describe the way it was done and the way it was interpreted Human genome sequencing statistics • compiled by gangs of bioinformaticians and genome annotators- Genome Reference Consortium - building the most accurate sequence version - ENSEMBL, EBI, Cambridge - automated gene recognition- HAVANA - manual curation of gene annotations- GENCODE - combined manual and automated curation - The human 'reference genome' - identity of base present at every position along every chromosome
More Genome Vital Statistics - the Human example
• The human genome is large,complex,and difficult to understand! • haploid size = 3,300,000,000 (3.3x109) bp = 3.3 Gbp = 3,300 Mbp • 3.6nm per bp• haploid genome ≅ 1m long • 23 chromosomes in haploid genome (24 'kinds' of chromosomes:1 - 22, X+Y) ⇒ each chromosome is 1-10cm long Packaging - how to get all that into one nucleus.... - one of the great awe-inspiring mysteries! Physical behaviour (shearing - shake and break...) • Euchromatin - 2,900 Mbp • Constitutive heterochromatin - >400 Mb
Post-genome genomics and human biology.....
• Three main challenges (when it comes to understanding humans as organisms and improving our lives) for genomics to assist in grappling with (this logic will work for any organism...) : 1. Gene networks - how do all the genes work together? • what and where are all the genes - what do they do, how do they work • how do the genes relate to RNA and protein gene products? • can we predict the function of the genes from their sequence? • where are all the genomic regions that regulate or influence gene expression? • lots of different approaches - incl. 'epigenomics' • How does each gene relate to all the other genes (homologues = orthologues and paralogues)? • How did our genome evolve? Has it still been evolving till very recently? How did we become human? 2. Understanding predisposition to disease • Diagnosis of monogenic disorders and prediction of disease status • Understanding predisposition to complex disease (incl. cancer...) • Where is all the genomic variation and how does it relate to disease predisposition? • the holy grail.... - a map of functional polymorphisms that influence human traits and disorders, and even our evolution... 3. Engineering the genome • gene therapy - what do we have to take into account to make it work? - properties of different genomic regions for genome editing...
maps and markers
• To get a better handle on the genetic map of an organism's genome, phenotypic markers are not enough... • Genetic maps require genetic markers located at intervals along each chromosome and spanning the full length of all chromosomes • N.B. other kinds of map are possible using other kinds of marker
Next phase - sequencing your genome....
• Why do we want to sequence genomes?(rather than work on cDNA clones from individual genes...) - Predict gene positions- Know gene sequence (not just mRNA sequence) and predict gene functions- Analyse relationships between genes - gene families, coordinate regulatn. etc.) - Analyse all the variation between individuals • Human genome sequencing technology used inHGP - Sanger method; fluorescent sequencing version developed during the 1990s - based on separation of DNA fragments in capillaries, not gels • capillary sequencing • polymer-filled capillary, electrically conducting
How it all fits together - Mohair genes
• doing it the old-fashioned, but really accurate, way: • clone genome of several goats into BACs - >6 x coverage • assess overlaps between BACs using clone fingerprinting and clone-end-walking • establish minimal tiling path • shotgun sequence each BAC clone in minimal tiling path • use supercomputer to fit sequences together - constrain it to find overlaps between BAC clones known to overlap • Or....doing it the modern way - Just fragment the genome, and use NGS to sequence every fragment, then assemble the sequences, hanging the assembly on other ruminant genome seqs • join sequence into 'contigs' of ~10Mbp and publish in GenBank • goat molecular geneticists can then (using benchtop NGS e.g. Ion Torrent / MiSeq): - find polymorphic markers, create linkage map of markers - breed goat strains, follow inheritance of desired character - do linkage analysis of wool trait loci to find regions containing wool quality and robustness genes - use genome sequence and knowledge of where all the genes are to home straight in on likely genes
All new genome projects now done using Next Generation Sequencing
• generate many sequence 'reads' • align the reads • generate a 'consensus sequence' from read contigs
Microsattelites - CNV
A microsatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from one to six or more base pairs) are repeated, typically 5-50 times. Microsatellites occur at thousands of locations within an organism's genome. They have a higher mutation rate than other areas of DNA leading to high genetic diversity. Microsatellites are often referred to as short tandem repeats (STRs) by forensic geneticists and in genetic genealogy, or as simple sequence repeats (SSRs) by plant geneticists. Microsatellites and their longer cousins, the minisatellites, together are classified as VNTR (variable number of tandem repeats) DNA. The name "satellite" DNA refers to the early observation that centrifugation of genomic DNA in a test tube separates a prominent layer of bulk DNA from accompanying "satellite" layers of repetitive DNA. They are widely used for DNA profiling in cancer diagnosis, in kinship analysis (especially paternity testing) and in forensic identification. They are also used in genetic linkage analysis to locate a gene or a mutation responsible for a given trait or disease. Microsatellites are also used in population genetics to measure levels of relatedness between subspecies, groups and individuals. Repetitive DNA is not easily analysed by next generation DNA sequencing methods, which struggle with homopolymeric tracts. Therefore, microsatellites are normally analysed by conventional PCR amplification and amplicon size determination, sometimes followed by Sanger DNA sequencing. In forensics, the analysis is performed by extracting nuclear DNA from the cells of a sample of interest, then amplifying specific polymorphic regions of the extracted DNA by means of the polymerase chain reaction. Once these sequences have been amplified, they are resolved either through gel electrophoresis or capillary electrophoresis, which will allow the analyst to determine how many repeats of the microsatellites sequence in question there are. If the DNA was resolved by gel electrophoresis, the DNA can be visualized either by silver staining (low sensitivity, safe, inexpensive), or an intercalating dye such as ethidium bromide (fairly sensitive, moderate health risks, inexpensive), or as most modern forensics labs use, fluorescent dyes (highly sensitive, safe, expensive). Instruments built to resolve microsatellite fragments by capillary electrophoresis also use fluorescent dyes. Forensic profiles are stored in major databanks. The British data base for microsatellite loci identification was originally based on the British SGM+ system using 10 loci and a sex marker. The Americans increased this number to 13 loci. The Australian database is called the NCIDD, and since 2013 it has been using 18 core markers for DNA profiling Microsatellites can be amplified for identification by the polymerase chain reaction (PCR) process, using the unique sequences of flanking regions as primers. DNA is repeatedly denatured at a high temperature to separate the double strand, then cooled to allow annealing of primers and the extension of nucleotide sequences through the microsatellite. This process results in production of enough DNA to be visible on agarose or polyacrylamide gels; only small amounts of DNA are needed for amplification because in this way thermocycling creates an exponential increase in the replicated segment. With the abundance of PCR technology, primers that flank microsatellite loci are simple and quick to use, but the development of correctly functioning primers is often a tedious and costly process. potentially more informative than a microsatellite map.
Where did our genome come from?.... 'Tree of Life'
All living organisms store genetic information using the same molecules — DNA and RNA. Written in the genetic code of these molecules is compelling evidence of the shared ancestry of all living things. Evolution of higher life forms requires the development of new genes to support different body plans and types of nutrition. Even so, complex organisms retain many genes that govern core metabolic functions carried over from their primitive past. Genes are maintained over an organism's evolution, however, genes can also be exchanged or "stolen" from other organisms. Bacteria can exchange plasmids carrying antibiotic resistance genes through conjugation, and viruses can insert their genes into host cells. Some mammalian genes have also been adopted by viruses and later passed onto other mammalian hosts. Regardless of how an organism gets and retains a gene, regions essential for the correct function of the protein are always conserved. Some mutations can accumulate in non-essential regions; these mutations are an overall history of the evolutionary life of a gene. • Eachspecies'genomeisdescendedwithmodificationfromgenomeofancestor - all the way back to the beginning of life.... • Thus-genomes of all organisms are'related' • Comparative genomics tells us about state of ancestor and changes along each branch We have learned from homologous sequence alignment that the information that can be gained by comparing two genomes together is largely dependent upon the phylogenetic distance between them. Phylogenetic distance is a measure of the degree of separation between two organisms or their genomes on an evolutionary scale, usually expressed as the number of accumulated sequence changes, number of years, or number of generations. The distances are often placed on phylogenetic trees, which show the deduced relationships among the organisms. The more distantly related two organisms are, the less sequence similarity or shared genomic features will be detected between them. Thus, only general insights about classes of shared genes can be gathered by genomic comparisons at very long phylogenetic distances (e.g., over one billion years since their separation). Over such very large distances, the order of genes and the signatures of sequences that regulate their transcription are rarely conserved. At closer phylogenetic distances (50-200 million years of divergence), both functional and non-functional DNA is found within the conserved segments. In these cases, the functional sequences will show signatures of selection by virtue of their sequences having changed less, or more slowly than, non-functional DNA. Moreover, beyond the ability to discriminate functional from non-functional DNA, comparative genomics is also contributing to the identification of general classes of important DNA elements, such as coding exons of genes, non-coding RNAs, and some gene regulatory sites. In contrast, very similar genomes separated by about 5 million years of evolution (such as human and chimpanzee) are particularly useful for finding the sequence differences that may account for subtle differences in biological form. These are sequence changes under directional selection, a process whereby natural selection favors a single phenotype and continuously shifts the allele frequency in one direction. Comparative genomics is thus a powerful and promising approach to biological discovery that becomes more and more informative as genomic sequence data accumulate.
The human karyotype - cytogenetics
Early Cytological Mapping Efforts Depended on Examining Chromosomes Under the Light Microscope All types of mapping involve measuring the positions of easily observed landmarks. Until recently, the only useful physical landmarks along human chromosomes have been cytogenetic bands. When cultured human cells are treated with suitable drugs during cell division, the chromosomes are easily viewed through the light microscope as wormlike shapes. Several staining procedures developed in the late 1960s and early 1970s imprint reproducible patterns of light and dark bands on chromosomes. The banding pattern is believed to reflect a periodicity in the spacing of certain types of DNA sequences along chromosomes. From a mapping standpoint, this banding is important in that it allows human chromosomes to be individually recognized by light microscopy and allows an average chromosome to be subdivided into 10 to 20 regions. Banding patterns provide the basis for a physical map of the chromosomes, often referred to as a cytogenetic map. In clinical genetics, examination of the banding patterns has led to diagnosis of such conditions as the Down syndrome, a genetic disease usually caused by the presence of an extra copy of chromosome 21 Since the late 1960s, it has been possible to assign many genes to locations on the cytogenetic map by the techniques of somatic cell genetics. In these techniques, rodent and human cells are fused to form hybrid cells that can be grown in culture. These cells generally lose all but one or a few human chromosomes, but different human chromosomes, or parts thereof, are retained in different cell lines. Chromosome banding is used to determine which portions of the human genome have been retained in particular cell lines. Consistent co-retention of a region of the genome and a human biochemical trait allows the genetic determinant of that trait to be assigned to a position on the cytogenetic map. banding patterns give us a good coordinate system if you take blood cells and spread out the blood on the slide and then treat the material not he slide in such a way that'll break open the cell nucleus instead of it being tightly compacted, it just spreads out. any cells at metaphase will spread its chromosomes and stick down to the slide ( metaphase spread) The analysis of metaphase chromosomes is one of the main tools of classical cytogenetics and cancer studies. Chromosomes are condensed (thickened) and highly coiled in metaphase, which makes them most suitable for visual analysis. Metaphase chromosomes make the classical picture of chromosomes (karyotype). For classical cytogenetic analyses, cells are grown in short term culture and arrested in metaphase using mitotic inhibitor. Further they are used for slide preparation and banding (staining) of chromosomes to be visualised under microscope to study structure and number of chromosomes (karyotype). Staining of the slides, often with Giemsa(G banding) or Quinacrine, produces a pattern of in total up to several hundred bands. Normal metaphase spreads are used in methods like FISH and as a hybridization matrix for comparative genomic hybridization (CGH) experiments. Malignant cells from solid tumors or leukemia samples can also be used for cytogenetic analysis to generate metaphase preparations. Inspection of the stained metaphase chromosomes allows the determination of numerical and structural changes in the tumor cell genome, for example, losses of chromosomal segments or translocations, which may lead to chimeric oncogenes, such as bcr-abl in chronic myelogenous leukemia. those don't tell us very much in itself but it gives us a coordinate system. we can describe which bit of the chromosome we are looking at by reference to the banding pattern its a way of localising things in the genome but its not a high resolution one because genes are GC rich, most bits are enlighten in the white bands in G-banding pattern tells us that maybe genes aren't randomly distributed through the genome
comparative genomes- genomic evolution
a common ancestor of ours is shared with the chimpanzee if we line up both our chromosomes, a picture of major changes to the genomes either down the human lineage or chimpanzee can be seen since that common ancestor existed.
LECTURE 3 = next priority-start making the other maps needed to generate the genome sequence
maps of physically ordered markers all over the genome - several types of map possible • random marker maps - STSs - RH mapping • physical maps based on cloned bits of genome - YACs and BACs • also realised that we'll never clone/identify the disease/trait genes themselves until we can relate genetic map position to something physical in the genome • Requires integration of all the different types of map- integration can also help improve the accuracy of each kind of map
The first biochemical systems were centered on RNA
· Polymerization of the building blocks into biomolecules might have occurred in the oceans or could have been promoted by the repeated condensation and drying of droplets of water in clouds. Alternatively, polymerization might have taken place on solid surfaces, perhaps making use of monomers immobilized on clay particles, or in hydrothermal vents. The precise mechanism need not concern us; what is important is that it is possible to envisage purely geochemical processes that could lead to synthesis of polymeric biomolecules similar to the ones found in living systems. It is the next steps that we must worry about. We have to go from a random collection of biomolecules to an ordered assemblage that displays at least some of the biochemical properties that we associate with life. These steps have never been reproduced experimentally and our ideas are therefore based mainly on speculation tempered by a certain amount of computer simulation. One problem is that the speculations are unconstrained because the global ocean could have contained as many as 1010 biomolecules per liter and we can allow a billion years for the necessary events to take place. This means that even the most improbable scenarios cannot be dismissed out of hand and a way through the resulting maze has been difficult to find. · Progress was initially stalled by the apparent requirement that polynucleotides and polypeptides must work in harness in order to produce a self-reproducing biochemical system. This is because proteins are required to catalyze biochemical reactions but cannot carry out their own self-replication. Polynucleotides can specify the synthesis of proteins and self-replicate, but it was thought that they could do neither without the aid of proteins. It appeared that the biochemical system would have to spring fully formed from the random collection of biomolecules because any intermediate stage could not be perpetuated. The major breakthrough came in the mid-1980s when it was discovered that RNA can have catalytic activity. Those ribozymes that are known today carry out three types of biochemical reaction: · Self-cleavage, as displayed by the self-splicing Group I, II and III introns and by some virus genomes. · Cleavage of other RNAs. · Synthesis of peptide bonds, by the rRNA component of the ribosome. · In the test tube, synthetic RNA molecules have been shown to carry out other biologically relevant reactions such as synthesis of ribonucleotides, synthesis and copying of RNA molecules and transfer of an RNA-bound amino acid to a second amino acid forming a dipeptide, in a manner analogous to the role of tRNA in protein synthesis. The discovery of these catalytic properties solved the polynucleotide-polypeptide dilemma by showing that the first biochemical systems could have been centered entirely on RNA. · Ideas about the RNA world have taken shape in recent years. We now envisage that RNA molecules initially replicated in a slow and haphazard fashion simply by acting as templates for binding of complementary nucleotides which polymerized spontaneously. This process would have been very inaccurate so a variety of RNA sequences would have been generated, eventually leading to one or more with nascent ribozyme properties that were able to direct their own, more accurate self-replication. It is possible that a form of natural selection operated so that the most efficient replicating systems began to predominate, as has been shown to occur in experimental systems. A greater accuracy in replication would have enabled RNAs to increase in length without losing their sequence specificity, providing the potential for more sophisticated catalytic properties, possibly culminating in structures as complex as present-day Group I introns and ribosomal RNAs. To call these RNAs 'genomes' is a little fanciful, but the term protogenome has attractions as a descriptor for molecules that are self-replicating and able to direct simple biochemical reactions. These reactions might have included energy metabolism, based, as today, on the release of free energy by hydrolysis of the phosphate-phosphate bonds in the ribonucleotides ATP and GTP, and the reactions might have become compartmentalized within lipid membranes, forming the first cell-like structures. There are difficulties in envisaging how long-chain unbranched lipids could form by chemical or ribozyme-catalyzed reactions, but once present in sufficient quantities they would have assembled spontaneously into membranes, possibly encapsulating one or more protogenomes and providing the RNAs with an enclosed environment in which more controlled biochemical reactions could be carried out. Before the evolution of RNA polymerases, ribonucleotides that became associated with an RNA template would have had to polymerize spontaneously. This process would have been inaccurate and many RNA sequences would have been generated.
polymorphisms
• Polymorphisms- source of genetic variation between individuals- enable construction of 'genetic maps' by tracing inheritance through families - many other types of poly also known. - effect of this variation at the sequence level? • neutral-most • selective-smallproportion(butstill many?)
Maps and linkage
• ILJ and ST have gone over the idea of linking phenotypic traits to chromosomal positions - linkage analysis and the construction of linkage maps- enable us to home in on disease/trait genes even if we know (almost) nothing about the genome- based purely on recombination rates between phenotypes. Physical Maps Describe Chromosomal DNA Molecules, Whereas Genetic Linkage Maps Describe Patterns of Inheritance. Physical maps specify the distances between landmarks along a chromosome. Ideally, the distances are measured in nucleotides, so that the map provides a direct description of a chromosomal DNA molecule. The most important landmarks in physical mapping are the cleavage sites of restriction enzymes. The maps can be calibrated in nucleotides by measuring the sizes of the DNA fragments produced when a chromosomal DNA molecule is cleaved with a restriction enzyme. Restriction mapping has not yet been extended to DNA molecules as large as human chromosomes. Physical maps of human chromosomes are now based largely on the banding patterns along chromosomes as observed in the light microscope. One can only estimate the number of nucleotides represented by a given interval on the map; furthermore, the amount of DNA present in different bands of the same size may not be constant since there are likely to be regional variations in the extent to which chromosomes condense during cell division. Nonetheless, cytogenetic maps are considered to be physical maps because they are based on measurements of actual distance. In contrast, genetic linkage maps describe the arrangement of genes and DNA markers on the basis of the pattern of their inheritance. Genes that tend to be inherited together (i.e., linked) are close together on such maps, and those inherited independently of one another are distant. Genes from different chromosomes are inherited independently and thus are always unlinked. Genes on the same chromosome can be tightly or loosely linked or unlinked, as reflected in the probability that they will be separated from one another during sperm or egg production. The genes can be separated if the chromosome breaks and exchanges parts with the other member of the chromosome pair, a process know as crossing over or genetic exchange. The farther apart two genes are on the chromosome, the more frequently such an exchange will occur between them. Exchange is a complex genetic process that accompanies the formation of sperm cells in the male and egg cells in the female. Unlike other cells, which contain two copies of each chromosome (except for the special case of the X and Y chromosomes in males), sperm and egg cells contain only a single copy of each chromosome. A particular sperm or egg cell, however, does not simply receive a precise copy of one of the two parental versions of each chromosome: Instead, each sperm or egg receives a unique composite of the two versions, produced by the series of cutting and splicing events that constitute genetic exchange. Indeed, the great variety of individual chromosomes that can be produced by exchange and independent assortment is responsible for much of the genetic individuality of different humans. The order of genes on a chromosome measured by linkage maps is the same as the order in physical maps, but there is no constant scale factor that relates physical and genetic distances. This variation in scale exists because the process of exchange does not occur equally at all places along a chromosome. Nor does exchange take place at the same rate in the two sexes; hence, as maps become more accurate, there will have to be separate genetic linkage maps for males and females. Because they describe the arrangement of genes at the most fundamental level, physical maps are gaining in importance relative to genetic linkage maps in most areas of biological research. They can never displace genetic linkage maps, however, which are distinctive in their ability to map traits that can be recognized only in whole organisms. Disease genes are particularly important illustrations of this point. Huntington's disease and cystic fibrosis, for example, have catastrophic effects on patients, but cannot be recognized in the types of cultured cells that are suitable for genetic studies. Only by studying the patterns in which these diseases are inherited in affected families has it been possible to localize the defective genes on chromosome maps. Because of the unique ability of genetic linkage mapping to define and localize disease genes, increasing the number of genetic markers available for this type of mapping should receive major emphasis in any overall program to map the human genome. A type of physical map that provides information on the approximate location of expressed genes is a complementary DNA (cDNA) map. A gene that is expressed will produce messenger RNA (mRNA) molecules in those cells in which the gene is active. The physical mapping of expressed genes (exons) is possible by using the DNA prepared from messenger RNA in the process called reverse transcription (in which an enzyme synthesizes a complementary strand of DNA by copying an RNA molecule that serves as a template). The availability of cDNAs permits the localization of genes of unknown function, including genes that are expressed only in differentiated tissues, such as the brain, and at particular stages of development and differentiation. Because they are expressed, they are likely to be the biologically most interesting part of the genome and therefore can usefully be the focus for early sequencing. In addition, knowledge of their map locations provides a set of likely candidate genes to test once the approximate location of a gene that is altered in a particular disorder has been mapped by genetic linkage techniques. To this point, about 4,100 expressed gene loci have been identified by all methods. Identification of the rest of the 50,000 to 100,000 genes in the haploid genome will come eventually with complete sequencing, but can be greatly facilitated in the immediate future by the cDNA map. This map contains information of great biological and medical significance simply because it represents the expressed portion of the genome.
•Traditionally,there were two main strategies for sequencing a genome, and two sets of resources that were needed
- 1) Clone / physical map-guided• build 'physical' maps from real, very large, bits of DNA - a large clone library. • then fragment each large clone, make a library of the tiny fragments by cloning into a small clone vector, and sequence everything.... - reassemble the sequence of each large clone. - 2) Whole genome shotgun (WGS) approach • fragment the WHOLE genome in one go, [skipping the large clone library], make a library of the tiny genomic fragments by cloning straight into a small clone vector, and sequence everything.... - assemble the whole genome sequence in a one-er
genomes
- Early genomes made of RNA (- maybe; other theories now competing) • RNA world - no cells (in modern sense), just RNA, starting with 1 gene • at some point, an RNA gene came into being that had ribonucleotide polymerase activity - acted as the catalyst for its own copying...? • Later on - translation - encoded info for production of proteins- involves nucleic acids 'coding for'('encoding' is the proper term) proteins - Later emergence of DNA as the info store - genome stability - less labile - Modern functions of ribonucleic acids • coding - proteins via mRNA• catalytic - ribozymes• structural (& catalytic...) - rRNA, tRNA, 7SL RNA, snoRNAs etc. •regulatory - miRNAs, lncRNAs, circRNAs
first goal
-> create a high quality linkage map of human genome 1- markers solely based on PCR's ability to amplify a specific part of the genome, if you have any bit of sequence from the genome ( cloned into many other clones) if you take any of those clones you can get enough sequence that you can get 2 pcr primers facing towards each other, PCR to amplify bit of DNA between them
Genome
A genome is the complete set of genetic information in an organism. It provides all of the information the organism requires to function. In living organisms, the genome is stored in long molecules of DNA called chromosomes. Small sections of DNA, called genes, code for the RNA and protein molecules required by the organism. In eukaryotes, each cell's genome is contained within a membrane-bound structure called the nucleus. Prokaryotes, which contain no inner membranes, store their genome in a region of the cytoplasm called the nucleoid. The full range of RNA molecules expressed by a genome is known as its transcriptome, and the full assortment of proteins produced by the genome is called its proteome. There are 23 pairs of chromosomes in the human genome. Between 1990 and 2003, all twenty-three pairs were fully sequenced through an international research undertaking known as the Human Genome Project. The study and analysis of genomes is called genomics.
Two scenarios for the evolution of the first coding RNA
A ribozyme could have evolved to have a dual catalytic and coding function (A), or a ribozyme could have synthesized a coding molecule (B). In both examples, the amino acids are shown attaching to the coding molecule via small adaptor RNAs, the presumed progenitors of today's tRNAs.
Whole genome shotgun sequencing...........2
A BAC contig that covers the entire genomic area of interest makes up the tiling path. Once a tiling path has been found, the BACs that form this path are sheared at random into smaller fragments and can be sequenced using the shotgun method on a smaller scale. Although the full sequences of the BAC contigs is not known, their orientations relative to one another are known. There are several methods for deducing this order and selecting the BACs that make up a tiling path. The general strategy involves identifying the positions of the clones relative to one another and then selecting the least number of clones required to form a contiguous scaffold that covers the entire area of interest. The order of the clones is deduced by determining the way in which they overlap. Overlapping clones can be identified in several ways. A small radioactively or chemically labeled probe containing a sequence-tagged site (STS) can be hybridized onto a microarray upon which the clones are printed. In this way, all the clones that contain a particular sequence in the genome are identified. The end of one of these clones can then be sequenced to yield a new probe and the process repeated in a method called chromosome walking. Alternatively, the BAC library can be restriction-digested. Two clones that have several fragment sizes in common are inferred to overlap because they contain multiple similarly spaced restriction sites in common. This method of genomic mapping is called restriction fingerprinting because it identifies a set of restriction sites contained in each clone. Once the overlap between the clones has been found and their order relative to the genome known, a scaffold of a minimal subset of these contigs that covers the entire genome is shotgun-sequenced. Because it involves first creating a low-resolution map of the genome, hierarchical shotgun sequencing is slower than whole-genome shotgun sequencing, but relies less heavily on computer algorithms than whole-genome shotgun sequencing. The process of extensive BAC library creation and tiling path selection, however, make hierarchical shotgun sequencing slow and labor-intensive. Now that the technology is available and the reliability of the data demonstrated, the speed and cost efficiency of whole-genome shotgun sequencing has made it the primary method for genome sequencing. In whole genome shotgun sequencing (top), the entire genome is sheared randomly into small fragments (appropriately sized for sequencing) and then reassembled. In hierarchical shotgun sequencing (bottom), the genome is first broken into larger segments. After the order of these segments is deduced, they are further sheared into fragments appropriately sized for sequencing. Alternatively, the BAC library can be restriction-digested. Two clones that have several fragment sizes in common are inferred to overlap because they contain multiple similarly spaced restriction sites in common. This method of genomic mapping is called restriction fingerprinting because it identifies a set of restriction sites contained in each clone. Once the overlap between the clones has been found and their order relative to the genome known, a scaffold of a minimal subset of these contigs that covers the entire genome is shotgun-sequenced. Because it involves first creating a low-resolution map of the genome, hierarchical shotgun sequencing is slower than whole-genome shotgun sequencing, but relies less heavily on computer algorithms than whole-genome shotgun sequencing. The process of extensive BAC library creation and tiling path selection, however, make hierarchical shotgun sequencing slow and labor-intensive. Now that the technology is available and the reliability of the data demonstrated, the speed and cost efficiency of whole-genome shotgun sequencing has made it the primary method for genome sequencing. Newer sequencing technologies[edit] The classical shotgun sequencing was based on the Sanger sequencing method: this was the most advanced technique for sequencing genomes from about 1995-2005. The shotgun strategy is still applied today, however using other sequencing technologies, such as short-read sequencing and long-read sequencing. Short-read or "next-gen" sequencing produces shorter reads (anywhere from 25-500bp) but many hundreds of thousands or millions of reads in a relatively short time (on the order of a day). This results in high coverage, but the assembly process is much more computationally intensive. These technologies are vastly superior to Sanger sequencing due to the high volume of data and the relatively short time it takes to sequence a whole genome
Genetic Markers
Chicks atop a picture of a genetic map of a chicken. The chicken genome has 39 pairs of chromosomes, whereas the human genome contains 23 pairs A genetic marker is a gene or DNA sequence with a known location on a chromosome that can be used to identify individuals or species. It can be described as a variation (which may arise due to mutation or alteration in the genomic loci) that can be observed. A genetic marker may be a short DNA sequence, such as a sequence surrounding a single base-pair change (single nucleotide polymorphism, SNP), or a long one, like minisatellites. For many years, gene mapping was limited to identifying organisms by traditional phenotype markers. This included genes that encoded easily observable characteristics such as blood types or seed shapes. The insufficient number of these types of characteristics in several organisms limited the mapping efforts that could be done. This prompted the development of gene markers which could identify genetic characteristics that are not readily observable in organisms (such as protein variation) • SNPs- a subset of SNPs are RFLPs • Microsatellites - a type of CNV • Other CNVs- could include Alu repeat insertions
coding sequences
DNA sequences that carry the instructions to make proteins are coding sequences. The proportion of the genome occupied by coding sequences varies widely. A larger genome does not necessarily contain more genes, and the proportion of non-repetitive DNA decreases along with increasing genome size in complex eukaryotes. Simple eukaryotes such as C. elegans and fruit fly, have more non-repetitive DNA than repetitive DNA, while the genomes of more complex eukaryotes tend to be composed largely of repetitive DNA. In some plants and amphibians, the proportion of repetitive DNA is more than 80%. Similarly, only 2% of the human genome codes for proteins.
What are the advantages of clone-by-clone sequencing?
Every fragment of DNA is taken from a known region of the genome, so it is relatively easy to determine where there are any gaps in the sequence. Assembly is more reliable because a genome map? is followed so the scientists know where the larger fragments are in relation to each other. As each fragment is distinct many people can work on the genome at one time.
RFLPs Are Useful for Interrelating Physical and Genetic Linkage Maps
Genetic linkage mapping allows those genes with no known cellular or molecular effects to be located on the human genome. On the other hand, physical maps describe the DNA molecules present in chromosomes. RFLP markers can easily be localized on either type of map. Not only can RFLP markers be placed on the genetic linkage map in family studies, but also, because the probes that are used to recognize RFLPs are themselves DNA molecules, their positions on a physical map can be determined in a variety of straightforward ways. Exact alignment between the genetic linkage and physical maps of the human genome at a large number of sites is therefore possible. This will greatly facilitate finding the actual DNA sequences that correspond to a gene once such a gene is localized on the genetic linkage map. In addition, making maps continuous across entire chromosomes will be easier by genetic linkage mapping, whereas maps of higher resolution (finer than a million nucleotides) will be easier to achieve by physical mapping. The more points at which the two maps can be exactly aligned, the greater the opportunity to take advantage of this complementarity, which will help solve the connectivity problem that arises when making maps of high resolution. RFLP mapping provides a powerful, comprehensive approach to the study of inherited diseases. Ideally, the centerpiece of this approach would be a reference RFLP map, at 1 cM resolution, determined from normal families. Once completed, the project of constructing such a map would provide human geneticists with a permanent archive of several thousand DNA probes that would detect polymorphisms throughout the genome at an average spacing of 1 million nucleotides. To apply this resource to the study of a particular inherited disease, an investigator would test DNA samples from families afflicted by a particular inherited disease with a uniformly spaced subset of perhaps 5 percent of these probes. Once rough linkage was tentatively detected, typically with a recombination frequency of 10 percent between the mutant gene that caused the disease and the polymorphism that was detected by the probe, the linkage could be rapidly confirmed and the position of the disease gene refined by follow-up analyses conducted with more closely spaced probes, selected to cover the region of interest thoroughly. Because the same RFLP polymorphisms are not segregating in all families, more sites are required than might seem necessary. For this reason more reference pedigrees are needed. In addition, research in highly polymorphic sites and ways of detecting them should be encouraged. At present, genetic linkage mapping with RFLPs is often begun with essentially random probe collections; once weak linkage is detected, the refinement of the position of the disease gene is extremely laborious since new sets of probes must be developed. Nonetheless, when major resources are directed to the study of particular diseases—such as cystic fibrosis and Huntington's disease—progress can be impressive. Only a few years ago, nothing was known about the position in the genome of the gene responsible for either of these diseases, and no compelling evidence existed that either was caused by mutations in the same gene in different afflicted families. Now, as a result of the RFLP approach, both genes have been mapped with great precision and shown to have a common genetic basis in most or all cases. Equally important, the RFLP approach, because of its ability to interrelate genetic linkage and physical mapping, has laid the groundwork for locating and analyzing the actual DNA sequences responsible for the diseases by coupled strategies of physical mapping and cloning, starting with the DNA clones used to probe for the linked RFLPs. Generalization of this strategy to the large variety of known inherited disorders could be expected to advance our understanding of basic human biology as well as to direct improvements in the diagnosis and treatment of many diseases. The reference RFLP map for the human—and its associated collection of well-tested DNA probes—would dramatically improve the efficiency of this research, allow the study of diseases in smaller family groups, and improve the practicality of studying diseases that are caused by alterations in more than one gene. The study of multigenic disorders could ultimately revolutionize medicine, since there are likely to be multigenic genetic predispositions to such common disorders as cancer, heart disease, and schizophrenia.
The first DNA genomes
How did the RNA world develop into the DNA world? The first major change was probably the development of protein enzymes, which supplemented, and eventually replaced, most of the catalytic activities of ribozymes. There are several unanswered questions relating to this stage of biochemical evolution, including the reason why the transition from RNA to protein occurred in the first place. Originally, it was assumed that the 20 amino acids in polypeptides provided proteins with greater chemical variability than the four ribonucleotides in RNA, enabling protein enzymes to catalyze a broader range of biochemical reactions, but this explanation has become less attractive as more and more ribozyme-catalyzed reactions have been demonstrated in the test tube. A more recent suggestion is that protein catalysis is more efficient because of the inherent flexibility of folded polypeptides compared with the greater rigidity of base-paired RNAs. Alternatively, enclosure of RNA protogenomes within membrane vesicles could have prompted the evolution of the first proteins, because RNA molecules are hydrophilic and must be given a hydrophobic coat, for instance by attachment to peptide molecules, before being able to pass through or become integrated into a membrane. The transition to protein catalysis demanded a radical shift in the function of the RNA protogenomes. Rather than being directly responsible for the biochemical reactions occurring in the early cell-like structures, the protogenomes became coding molecules whose main function was to specify the construction of the catalytic proteins. Whether the ribozymes themselves became coding molecules, or coding molecules were synthesized by the ribozymes is not known, although the most persuasive theories about the origins of translation and the genetic code suggest that the latter alternative is more likely to be correct. Whatever the mechanism, the result was the paradoxical situation whereby the RNA protogenomes had abandoned their roles as enzymes, which they were good at, and taken on a coding function for which they were less well suited because of the relative instability of the RNA phosphodiester bond, resulting from the indirect effect of the 2′-OH group. A transfer of the coding function to the more stable DNA seems almost inevitable and would not have been difficult to achieve, reduction of ribonucleotides giving deoxyribonucleotides which could then be polymerized into copies of the RNA protogenomes by a reverse-transcriptase-catalyzed reaction. The replacement of uracil with its methylated derivative thymine probably conferred even more stability on the DNA polynucleotide, and the adoption of double-stranded DNA as the coding molecule was almost certainly prompted by the possibility of repairing DNA damage by copying the partner strand According to this scenario, the first DNA genomes comprised many separate molecules, each specifying a single protein and each therefore equivalent to a single gene. The linking together of these genes into the first chromosomes, which could have occurred either before or after the transition to DNA, would have improved the efficiency of gene distribution during cell division, as it is easier to organize the equal distribution of a few large chromosomes than many separate genes. As with most stages in early genome evolution, several different mechanisms by which genes might have become linked have been proposed
summary.......2
III. STS (sequence-tagged sites): A short, unique sequence of DNA that can be amplified by PCR. Ideal landmarks during map construction (easily detectable by PCR). The most critical aspect of an STS description is the DNA sequence of the 2 primers. IV. Contig (Contiguous = sharing edge or touching with over- lapping regions of a genome): An organized set of DNA clones that collectively provide redundant cloned coverage of a region too long to clone in one piece. V. YAC (yeast artificial chromosome): A cloning system, up to 1 Mb DNA segment. VI. Cosmid: 20-40 kb DNA fragment. Both YAC and Cosmid are used for cloning, enabling the use of these clones for sequencing (fragmenting) by several groups. VII.FISH (fluorescence in situ hybridization): A physical mapping technique employing fluorescin-labeled DNA probes that can detect segments of the human genome by DNA-DNA hybridization on samples of condensed chromo somes of lysed metaphase cells. VIII. Centimorgan (cM): A unit of measure of recombination frequency. 1 cM represents a 1% chance that a marker at one genetic locus will be separated from a marker at a second locus as a consequence of crossing over in a single generation. In humans, 1 cM is about 1M bp. b) Methods I. Macrorestriction Maps: Top-down mapping: Fragmenting chromosomes with a rare restriction enzyme into large pieces, which are then ordered and subdivided and then mapped. This result in more continuity and less gap than the contig method, but it has a lower map resolution. II. Contig map (Bottom-up mapping): Cutting a chromosome The Human Genome Project into small pieces, each cloned and ordered, forming contiguous DNA blocks. III. Positional Cloning: The markers are used for gene hunt. Once the gene is located, physical maps are used to obtain flanking DNA segments for further detailed study (mostly pertains to regulation of gene function). IV. STS-Content Mapping: Provides the means to establish these overlaps between each clone and its nearest neighbors. If 2 clones share even a single STS, they can reliably be assumed to overlap. Using the YAC system and STS-content mapping, physical maps of human chromosomes 21 & Y, and a large part of X have been published. {YAC gives a much larger segment (contigs) than those observed with cosmid clones.} The problem with YAC (and more so with cosmid) is the orientation of the contig obtained. This is overcome by the FISH method, where a fluorescence-labeled probe that binds to a chromosome is visualized through light microscope. V. Radiation Hybrid Mapping: Involves fragmentation of chromosomes in cultured cells with high doses of X-rays followed by incorporation of the fragments into stable cell lines. VI. PCR Markers: Based on short, repetitive DNA sequences widely distributed in the human genome, such as: (CA)n (n, number of repetitions of the dinucleotide CA). n is highly variable in the different zones of the Human Genome. The difference in n results in different copy length, detectable by electrophoresis. VII. Positional Cloning: Identifying a gene inflicting a disease (inheritable disease).
What are the disadvantages of clone-by-clone sequencing?
Making clones and generating genome maps takes a long time. Clone-by-clone sequencing is generally more expensive than other sequencing methods. Some parts of the chromosomes, such as the centromeres, are difficult to clone. This is because they contain long repetitive sections which makes them difficult to cut and clone into BACs. As a result you cannot sequence using clone-by-clone sequencing methods.
Mutation Rates
Mutation rates differ between species and even between different regions of the genome of a single species. Spontaneous mutations often occur which can cause various changes in the genome. Mutations can result in the addition or deletion of one or more nucleotide bases. A change in the code can result in a frameshift mutation which causes the entire code to be read in the wrong order and thus often results in a protein becoming non-functional. A mutation in a promoter region, enhancer region or a region coding for transcription factors can also result in either a loss of function or and upregulation or downregulation in transcription of that gene. Mutations are constantly occurring in an organism's genome and can cause either a negative effect, positive effect or no effect at all.
Noncoding sequences
Noncoding sequences include introns, sequences for non-coding RNAs, regulatory regions, and repetitive DNA. Noncoding sequences make up 98% of the human genome. There are two categories of repetitive DNA in the genome: tandem repeats and interspersed repeats
Pseudogenes
Often a result of spontaneous mutation, pseudogenes are dysfunctional genes derived from previously functional gene relatives. There are many mechanisms by which a functional gene can become a pseudogene including the deletion or insertion of one or multiple nucleotides. This can result in a shift of reading frame, causing the gene to longer code for the expected protein, a premature stop codon or a mutation in the promoter region. Often cited examples of pseudogenes within the human genome include the once functional olfactory gene families. Over time, many olfactory genes in the human genome became pseudogenes and were no longer able to produce functional proteins, explaining the poor sense of smell humans possess in comparison to their mammalian relatives.
Tandem repeats
Short, non-coding sequences that are repeated head-to-tail are called tandem repeats. Microsatellites consisting of 2-5 basepair repeats, while minisatellite repeats are 30-35 bp. Tandem repeats make up about 4% of the human genome and 9% of the fruit fly genome. Tandem repeats can be functional. For example, telomeres are composed of the tandem repeat TTAGGG in mammals, and they play an important role in protecting the ends of the chromosome. In other cases, expansions in the number of tandem repeats in exons or introns can cause disease.[30] For example, the human gene huntingtin typically contains 6-29 tandem repeats of the nucleotides CAG (encoding a polyglutamine tract). An expansion to over 36 repeats results in Huntington's disease, a neurodegenerative disease. Twenty human disorders are known to result from similar tandem repeat expansions in various genes. The mechanism by which proteins with expanded polygulatamine tracts cause death of neurons is not fully understood. One possibility is that the proteins fail to fold properly and avoid degradation, instead accumulating in aggregates that also sequester important transcription factors, thereby altering gene expression. Tandem repeats are usually caused by slippage during replication, unequal crossing-over and gene conversion
the two genomes
The nuclear genome comprises approximately 3 200 000 000 nucleotides of DNA, divided into 24 linear molecules, the shortest 50 000 000 nucleotides in length and the longest 260 000 000 nucleotides, each contained in a different chromosome. These 24 chromosomes consist of 22 autosomes and the two sex chromosomes, X and Y. The mitochondrial genome is a circular DNA molecule of 16 569 nucleotides, multiple copies of which are located in the energy-generating organelles called mitochondria. Each of the approximately 1013 cells in the adult human body has its own copy or copies of the genome, the only exceptions being those few cell types, such as red blood cells, that lack a nucleus in their fully differentiated state. The vast majority of cells are diploid and so have two copies of each autosome, plus two sex chromosomes, XX for females or XY for males - 46 chromosomes in all. These are called somatic cells, in contrast to sex cells or gametes, which are haploid and have just 23 chromosomes, comprising one of each autosome and one sex chromosome. Both types of cell have about 8000 copies of the mitochondrial genome, 10 or so in each mitochondrion.
Why is Genome Sequencing Important?
To obtain a 'blueprint' - DNA directs all the instructions needed for cell development and function DNA underlies almost every aspect of human health, both, in function and dis-function To study gene expression in a specific tissue, organ or tumorTo study human variationTo study how humans relate to other organismsTo find correlations how genome information relates to development of cancer, susceptibility to certain diseases and drug metabolism (pharmacogenomics) Outlook: Personalized Genomics (David Church, Harvard)
Transposable Elements
Transposable elements are regions of DNA that can be inserted into the genetic code through one of two mechanisms. These mechanisms work similarly to "cut-and-paste" and "copy-and-paste" functionalities in word processing programs. The "cut-and-paste" mechanism works by excising DNA from one place in the genome and inserting itself into another location in the code. The "copy-and-paste" mechanism works by making a genetic copy or copies of a specific region of DNA and inserting these copies elsewhere in the code. The most common transposable element in the human genome is the Alu sequence, which is present in the genome over one million times.
Mapping the genome - the early years.
• Before the idea of a 'genome project' came along, all progress was slow and done one locus/gene at a time by lots of individual research groups • Strategiesfromyesteryearformappinggenes&loci: - Analysis of chromosomal deletions - Genetic maps based on 'linkage' between phenotypic traits (e.g. blood groups) and using a very few DNA-based genetic markers - FISH - fluorescent in situ hybridisation - Somatic cell hybrids
Features of genomes
• Features of genomes-- Information content - coded recipe book - the genome as replicator - Genes • cooperating 'units' of replicator function- Genomic DNA may comprise >1 individual DNA molecule • Chromosome - no. per genome varies from 1 to dozens - each is a single molecule (double-stranded) of DNA - chromosomes as linear 'arrays' of genes • Mitochondrial DNA, Plastid DNA - endosymbiont relicts • Plasmids and other independently replicating mobile elements -Flavours of DNA sequence • Single copy DNA - unique sequence - includes most genes •Repetitive DNA - selfish, 'parasitic' elements - 'genomic repeats'» these have an effect on structure and evolution of genome - structural elements» centromeres, matrix attachment regions (MARs)
replicators - Dawkins selfish gene idea
- Molecules (or other entities) that get themselves copied.... - The important aspect of this is the information content • a genome is akin to a coded recipe book
first priority-make a better linkage map
- first generation linkage map pre-HGP was sparse and made from difficult-to-type RFLP markers - next generation linkage map constructed by Généthon and CEPH using microsatellite markers
How can a molecule get itself copied?
In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all living organisms acting as the most essential part for biological inheritance. The cell possesses the distinctive property of division, which makes replication of DNA essential. DNA is made up of a double helix of two complementary strands. During replication, these strands are separated. Each strand of the original DNA molecule then serves as a template for the production of its counterpart, a process referred to as semiconservative replication. As a result of semi-conservative replication, the new helix will be composed of an original DNA strand as well as a newly synthesized strand. Cellular proofreading and error-checkingmechanisms ensure near perfect fidelity for DNA replication. In a cell, DNA replication begins at specific locations, or origins of replication, in the genome. Unwinding of DNA at the origin and synthesis of new strands, accommodated by an enzyme known as helicase, results in replication forks growing bi-directionally from the origin. A number of proteins are associated with the replication fork to help in the initiation and continuation of DNA synthesis. Most prominently, DNA polymerase synthesizes the new strands by adding nucleotides that complement each (template) strand. DNA replication occurs during the S-stage of interphase. DNA replication (DNA amplification) can also be performed in vitro (artificially, outside a cell). DNA polymerases isolated from cells and artificial DNA primers can be used to start DNA synthesis at known sequences in a template DNA molecule. Polymerase chain reaction (PCR), ligase chain reaction (LCR), and transcription-mediated amplification (TMA) are examples. • catalyse its own synthesis • molecular 'tool-use' - it could co-opt another molecule that happens to 'know' how to replicate it• instruct the synthesis of a catalyst for its own synthesis - 'coding' information- implies the existence of a 'product' of the code » the product is related to the replicator, but is not itself a replicator
THE EMBARKMENT OF THE HUMAN GENOME PROJECT:
The Human Genome Project (HGP) originated at the DOE Meeting in Alta, Utah in December 1984, where the possible use of DNA analysis in detecting mutations among atomic bomb survivors was contemplated. This was then followed by the strive for sequencing of the entire Human Genome, advocated by several scientists, including Robert Sinsheimer (1985, then chancellor of the University of California, Santa Cruz), Charles Delisi, DOE, who described the framework (The Human Genome Project, Am Scientist 76, 488-493, 1988), and Renate Dulbecco (1986, president of the Salk Institute). A National Research Council (NRC) Committee was asked, in September 1986, to determine whether the Human Genome Project (i.e., sequencing the Human Genome) should be advanced. In February 1988 the Committee recommended the implementation of the HGP, which will include in addition to Human Genome mapping also mapping of model organisms, at a budget of $200 m per year for 15 years, in which NIH will have a central role. Another committee, appointed by the US Congress' Office of Technology Assessment (OTA), released a report in April 1988 supporting the recommendation of the NRC Committee. In 1988 the Congress appropriated $17.3 m to NIH and $11.8 m to DOE for genome research. In March 1988, James Wyngaarden, then NIH director, announced the creation of an NTH Office Center for Human Genome Research. James Watson was appointed in October 1989 to head the office which became the National Center for Human Genome Research (NCHGR), serving as its director until April 1992; Michael Gottes- man served as acting director until Francis Collins became the second and present director. NIH and DOE, working as partners in managing HGP, presented to Congress a 5-year-term program in early 1990 with 8 major goals: Develop maps of human chromosomes. Improve technology for DNA sequencing. Mapping and sequencing DNA of selected model organisms
The Nucleosome: The Unit of Chromatin
The basic repeating structural (and functional) unit of chromatin is the nucleosome, which contains eight histone proteins and about 146 base pairs of DNA. The observation by electron microscopists that chromatin appeared similar to beads on a string provided an early clue that nucleosomes exist. Another clue came from chemically cross-linking (i.e., joining) histones in chromatin. This experiment demonstrated that H2A, H2B, H3, and H4 form a discrete protein octamer, which is fully consistent with the presence of a repeating histone-containing unit in the chromatin fiber. Today, researchers know that nucleosomes are structured as follows: Two each of the histones H2A, H2B, H3, and H4 come together to form a histone octamer, which binds and wraps approximately 1.7 turns of DNA, or about 146 base pairs. The addition of one H1 protein wraps another 20 base pairs, resulting in two full turns around the octamer, and forming a structure called a chromatosome. The resulting 166 base pairs is not very long, considering that each chromosome contains over 100 million base pairs of DNA on average. Therefore, every chromosome contains hundreds of thousands of nucleosomes, and these nucleosomes are joined by the DNA that runs between them (an average of about 20 base pairs). This joining DNA is referred to as linker DNA. Each chromosome is thus a long chain of nucleosomes, which gives the appearance of a string of beads when viewed using an electron microscope. The amount of DNA per nucleosome was determined by treating chromatin with an enzyme that cuts DNA (such enzymes are called DNases). One such enzyme, micrococcal nuclease (MNase), has the important property of preferentially cutting the linker DNAbetween nucleosomes well before it cuts the DNA that is wrapped around octamers. By regulating the amount of cutting that occurs after application of MNase, it is possible to stop the reaction before every linker DNA has been cleaved. At this point, the treated chromatin will consist of mononucleosomes, dinucleosomes (connected by linker DNA), trinucleosomes, and so forth. If DNA from MNase-treated chromatin is then separated on a gel, a number of bands will appear, each having a length that is a multiple of mononucleosomal DNA. The simplest explanation for this observation is that chromatin possesses a fundamental repeating structure. When this was considered together with data from electron microscopy and chemical cross-linking of histones, the "subunit theory" of chromatin was adopted. The subunits were later named nucleosomes and were eventually crystallized. The model of the nucleosome that crystallographers constructed from their data is shown in Figure 3. Phosphodiester backbones of the DNA double helix are shown in brown and turquoise, while histone proteins are shown in blue (H3), green (H4), yellow (H2A), and red (H2B). Note that only eukaryotes (i.e., organisms with a nucleus and nuclear envelope) have nucleosomes. Prokaryotes, such as bacteria, do not.
Transposable elements
Transposable elements (TEs) are sequences of DNA with a defined structure that are able to change their location in the genome.TEs are categorized as either class I TEs, which replicate by a copy-and-paste mechanism, or class II TEs, which can be excised from the genome and inserted at a new location. The movement of TEs is a driving force of genome evolution in eukaryotes because their insertion can disrupt gene functions, homologous recombination between TEs can produce duplications, and TE can shuffle exons and regulatory sequences to new locations
first goal of human genome project
- October 1, 1993, to September 30, 1998 (FY 1994-98)Mapping and Sequencing the Human Genome Genetic Mapping Complete the 2- to 5-cM map by 1995. (Goals for map resolution remain unchanged.) Develop technology for rapid genotyping. Develop markers that are easier to use. Develop new mapping technologies. Physical Mapping Complete a sequence tagged site (STS) map of the human genome at a resolution of 100 kb. (Goals for map resolution remain unchanged.) DNA Sequencing Develop efficient approaches to sequencing one- to several megabase regions of DNA of high biological interest. Develop technology for high-throughput sequencing, focusing on systems integration of all steps from template preparation to data analysis. Build up a sequencing capacity to allow sequencing at a collective rate of 50 Mb per year by the end of the period. This rate should result in an aggregate of 80 Mb of DNA sequence completed by the end of FY 1998. Gene Identification Develop efficient methods for identifying genes and for placement of known genes on physical maps or sequenced DNA. Technology Development Substantially expand support of innovative technological developments as well as improvements in current technology for DNA sequencing and for meeting the needs of the Human Genome Project as a whole. Model Organisms Finish an STS map of the mouse genome at a 300-kb resolution. Finish the sequence of the Escherichia coli and Saccharomyces cerevisiae genomes by 1998 or earlier. Continue sequencing Caenorhabditis elegans and Drosophila melanogaster genomes with the aim of bringing C. elegans to near completion by 1998. Sequence selected segments of mouse DNA side by side with corresponding human DNA in areas of high biological interest.
How do we 'anchor' ('hang') a YAC/BAC map on a map of microsatellite markers (in latter stages of HGP with help from STS markers)?
- in Figure below, microsat. marker D1S2473 is on BAC 12H2 - D1S1617 is on BAC 613A2, which does not overlap with 12H2 - ...but several STS markers are present on 12H2 and 1068B5, or on 1068B5 and 678K16, or on 678K16 and 613A2 - hence this set of markers establish order and overlap of these 4 BAC clones
next priority-start making the other maps needed to generate the genome sequence
- maps of physically ordered markers all over the genome -several types of map possible • random marker maps - STSs - RH mapping• physical maps based on cloned bits of genome - YACs and BACs • also realised that we'll never clone/identify the disease/trait genes themselves until we can relate genetic map position to something physical in the genome • Requires integration of all the different types of map- integration can also help improve the accuracy of each kind of map
Exon Shuffling
Exon shuffling is a mechanism by which new genes are created. This can occur when two or more exons from different genes are combined together or when exons are duplicated. Exon shuffling results in new genes by altering the current intron-exon structure. This can occur by any of the following processes: transposon mediated shuffling, sexual recombination or illegitimate recombination. Exon shuffling may introduce new genes into the genome that can be either selected against and deleted or selectively favored and conserved.
Microsattelites - CNV..........2
If searching for microsatellite markers in specific regions of a genome, for example within a particular intron, primers can be designed manually. This involves searching the genomic DNA sequence for microsatellite repeats, which can be done by eye or by using automated tools such as repeat masker. Once the potentially useful microsatellites are determined, the flanking sequences can be used to design oligonucleotide primers which will amplify the specific microsatellite repeat in a PCR reaction. Random microsatellite primers can be developed by cloning random segments of DNA from the focal species. These random segments are inserted into a plasmid or bacteriophage vector, which is in turn implanted into Escherichia coli bacteria. Colonies are then developed, and screened with fluorescently-labelled oligonucleotide sequences that will hybridize to a microsatellite repeat, if present on the DNA segment. If positive clones can be obtained from this procedure, the DNA is sequenced and PCR primers are chosen from sequences flanking such regions to determine a specific locus. This process involves significant trial and error on the part of researchers, as microsatellite repeat sequences must be predicted and primers that are randomly isolated may not display significant polymorphism. Microsatellite loci are widely distributed throughout the genome and can be isolated from semi-degraded DNA of older specimens, as all that is needed is a suitable substrate for amplification through PCR. More recent techniques involve using oligonucleotide sequences consisting of repeats complementary to repeats in the microsatellite to "enrich" the DNA extracted (Microsatellite enrichment). The oligonucleotide probe hybridizes with the repeat in the microsatellite, and the probe/microsatellite complex is then pulled out of solution. The enriched DNA is then cloned as normal, but the proportion of successes will now be much higher, drastically reducing the time required to develop the regions for use. However, which probes to use can be a trial and error process in itsel Consequently, it is easier to detect genotyping errors in microsatellites and fewer microsatellite markers provide can provide the same information. Second, SNPs are far more common than microsatellites, which means that a SNP map can be far denser and
understanding a genome
until v .recently,first step in understanding a genome was always to find out where things are in the genome - for this, we need to make maps Genome mapping is used to identify and record the location of genes and the distances between genes on a chromosome. Genome mapping provided a critical starting point for the Human Genome Project. A genome map highlights the key 'landmarks' in an organism's genome?. A bit like how the London tube map shows the different stops on a tube line to help you get around the city, a genome map helps scientists to navigate their way around the genome. The landmarks on a genome map may include short DNA? sequences, regulatory sites that turn genes? on and off or the genes themselves. Genome mapping provided the basis for whole genome sequencing? and the Human Genome Project?. Sequenced DNA fragments can be aligned to the genome map to aid with the assembly of the genome. Over time, as scientists learn more about a particular genome, its map becomes more accurate and detailed. A genome map is not a final product, but work in progress. Need a coordinate system for describing location
Fibre FISH
- Used in fine mapping of closely spaced genes- Multiple probes, different dyes - Can deduce order of genes. In an alternative technique to interphase or metaphase preparations, fiber FISH, interphase chromosomes are attached to a slide in such a way that they are stretched out in a straight line, rather than being tightly coiled, as in conventional FISH, or adopting a chromosome territory conformation, as in interphase FISH. This is accomplished by applying mechanical shear along the length of the slide, either to cells that have been fixed to the slide and then lysed, or to a solution of purified DNA. A technique known as chromosome combing is increasingly used for this purpose. The extended conformation of the chromosomes allows dramatically higher resolution - even down to a few kilobases. The preparation of fiber FISH samples, although conceptually simple, is a rather skilled art, and only specialized laboratories use the technique routin FISH and 'Chromosome painting' - helps identify deletions/translocations
Evolution of the Human Genome Project
Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others. Project goals were to identify all the approximately 20,500 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise from the project. Though the HGP is finished, analyses of the data will continue for many years. • Human Genome Org. (HUGO) 1989 • Human Genome Project (HGP) 1990/1992 - first Director - James Watson (he didn't last long in post....!!
Genomes: the First 10 Billion Years
Cosmologists believe that the universe began some 14 billion years ago with the gigantic 'primordial fireball' called the Big Bang. Mathematical models suggest that after about 4 billion years galaxies began to fragment from the clouds of gas emitted by the Big Bang, and that within our own galaxy the solar nebula condensed to form the Sun and its planets about 4.6 billion years ago. The early Earth was covered with water and it was in this vast planetary ocean that the first biochemical systems appeared, cellular life being well established by the time land masses began to appear, some 3.5 billion years ago. But cellular life was a relatively late stage in biochemical evolution, being preceded by self-replicating polynucleotides that were the progenitors of the first genomes. We must begin our study of genome evolution with these precellular systems.
chromosomes
In the nucleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes. Each chromosome is made up of DNA tightly coiled many times around proteins called histones that support its structure. Chromosomes are not visible in the cell's nucleus—not even under a microscope—when the cell is not dividing. However, the DNA that makes up chromosomes becomes more tightly packed during cell division and is then visible under a microscope. Most of what researchers know about chromosomes was learned by observing chromosomes during cell division. Each chromosome has a constriction point called the centromere, which divides the chromosome into two sections, or "arms." The short arm of the chromosome is labeled the "p arm." The long arm of the chromosome is labeled the "q arm." The location of the centromere on each chromosome gives the chromosome its characteristic shape, and can be used to help describe the location of specific genes.
Ligase ribozyme
The RNA Ligase ribozyme was the first of several types of synthetic ribozymes produced by in vitro evolution and selection techniques. They are an important class of ribozymes because they catalyze the assembly of RNA fragments into phosphodiester RNA polymers, a reaction required of all extant nucleic acid polymerases and thought to be required for any self-replicating molecule. Ideas that the origin of life may have involved the first self-replicating molecules being ribozymes are called RNA World hypotheses. Ligase ribozymes may have been part of such a pre-biotic RNA world. In order to copy RNA, fragments or monomers (individual building blocks) that have 5′-triphosphates must be ligated together. This is true for modern (protein-based) polymerases, and is also the most likely mechanism by which a ribozyme self-replicase in an RNA world might function. Yet no one has found a natural ribozyme that can perform this reaction
Accumulating Changes Over Time
The evolution of the genome is characterized by the accumulation of changes. The analaysis of genomes and their changes in sequence or size over time involves various fields. There are various mechanisms that have contributed to genome evolution and these include gene and genome duplications, polyploidy, mutation rates, transposable elements, pseudogenes, exon shuffling and genomic reduction and gene loss. The concepts of gene and whole-genome duplication are discussed as their own independent concepts, thus, the focus will be on other mechanisms.
genomics
• Questions: Genomics - what's it for....? • How can the wool yield and quality be improved, while also maintaining breeding success? • How can we locate and identify the genes influencing these traits most efficiently? • Can we modify the genome to achieve improvements that selective breeding has failed to deliver? • What do we need to know in order to provide the answers...? - How is the goat genome organised? Chromos & gene distribution - How do its genes relate to those in other organisms?• does it have more keratin genes, or just control them differently in the production of wool...? • What are the rules governing how expression of these genes is regulated? - How does the goat genome 'work'? - How do the wool genes vary? Are they under natural selection? - What other gene products do the wool gene products interact with?
STS (sequence tagged sites)
...unique DNA 'markers' used to follow a physically-linked gene of interest (e.g. a linked disease gene). STS markers are short sequences of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. Because each is unique, STSs are often used in linkage and radiation hybrid mapping techniques. STSs serve as landmarks on the physical map of the human genome. A sequence-tagged site (or STS) is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known STSs can be easily detected by the polymerase chain reaction (PCR) using specific primers. For this reason they are useful for constructing genetic and physical maps from sequence data reported from many different laboratories. They serve as landmarks on the developing physical map of a genome. When STS loci contain genetic polymorphisms (e.g. simple sequence length polymorphisms, SSLPs, single nucleotide polymorphisms), they become valuable genetic markers, i.e. loci which can be used to distinguish individuals. They are used in shotgun sequencing, specifically to aid sequence assembly. STSs are very helpful for detecting microdeletions in some genes. For example, some STSs can be used in screening by PCR to detect microdeletions in Azoospermia (AZF) genes in infertile men. STS markers used to designate random 'loci' all over the genome [Locus - any arbitrary site, or a specified site, in the genome] Where do STSs come from....? - you need a piece of DNA to sequence first.
summary
1.Define a genome map. A genome map is the graphical description of the location of genes and DNA markers on the genome of an organism. 2.What are the different types of genome mapping methods? • Genetic mapping - Mapping using genes - Mapping using molecular markers • Physical mapping - Restriction mapping - Cytogenetic mapping - STS content mapping - Radiation hybrid mapping 3.How does a genetic map differ from a physical map? A genetic map is constructed using recombination frequencycalculated from the progenies and it is an indirect method of locating the positions of genes or DNA markers. The unit of measurement is cM, whereas physical mapping pertains to locating the position of DNA sequences directly on the chromosome of a large DNA fragment. The unit of measurement is the base pair. 4.What are the markers for genetic mapping and physical mapping? Markers for genetic mapping •Genes with visible phenotype •Molecular markers such as RFLP, AFLP and SSR Markers for physical mapping •Restriction enzyme sites •Expressed sequence tags •cDNA clones •Any unique DNA sequence 5.Clone contig mapping is more useful than other physical mapping methods for genome sequencing. Justify the statement. Clone contigs are prepared from BAC clone libraries. The BAC clones can also be used both for physical mapping and sequencing. 6. High density mapping is helpful in genome sequence finishing. Validate the statement. Eukaryotic genomes are large and they contain a sizable portion of repetitive sequences. Genomic DNA is fragmented into a smaller DNA before sequencing; clone assembly produces contigs but not the entire reconstructed genome, i.e. it has many gaps. To close the gaps, it is necessary to have high density maps. 7. STS map is one of the highest density physical maps. How is this possible? Thousands of STS markers are available to map on the BAC library clones using highly sensitive clone fingerprinting techniques. Therefore, it is possible to construct a map with one marker in every 100 kbp of the genome. 8. Restriction mapping is not a better physical mapping technique for larger genome mapping. Why? When the size of the genome is too large, the number of fragments produced by restriction digestion will increase, and the restricted fragments separated after gel electrophoresis cannot be viewed as discrete bands.
dna histones chromosomes
Chromosomal DNA is packaged inside microscopic nuclei with the help of histones. These are positively-charged proteins that strongly adhere to negatively-charged DNA and form complexes called nucleosomes. Each nuclesome is composed of DNA wound 1.65 times around eight histone proteins. Nucleosomes fold up to form a 30-nanometer chromatin fiber, which forms loops averaging 300 nanometers in length. The 300 nm fibers are compressed and folded to produce a 250 nm-wide fiber, which is tightly coiled into the chromatid of a chromosome.
replicators
Dawkins proposes the idea of the "replicator":[12] "It is finally time to return to the problem with which we started, to the tension between individual organism and gene as rival candidates for the central role in natural selection...One way of sorting this whole matter out is to use the terms 'replicator' and 'vehicle'. The fundamental units of natural selection, the basic things that survive or fail to survive, that form lineages of identical copies with occasional random mutations, are called replicators. DNA molecules are replicators. They generally, for reasons that we shall come to, gang together into large communal survival machines or 'vehicles'. The original replicator (Dawkins' Replicator) was the initial molecule which first managed to reproduce itself and thus gained an advantage over other molecules within the primordial soup. As replicating molecules became more complex, Dawkins postulates, the replicators became the genes within organisms, with each organism's body serving the purpose of a 'survival machine' for its genes. Dawkins writes that gene combinations which help an organism to survive and reproduce tend to also improve the gene's own chances of being replicated, and, as a result, "successful" genes frequently provide a benefit to the organism. An example of this might be a gene that protects the organism against a disease. This helps the gene spread, and also helps the organism.
Genome evolution: what processes lead to change?
Gene and whole genome duplications have contributed accumulations that have contributed to genome evolution. Mutations are constantly occurring in an organism's genome and can cause either a negative effect, positive effect or no effect at all; however, it will still result in changes to the genome. Transposable elements are regions of DNA that can be inserted into the genetic code and will causes changes within the genome. Pseudogenes are dysfunctional genes derived from previously functional gene relatives and will become a pseudogene by deletion or insertion of one or multiple nucleotides. Exon shuffling occurs when two or more exons from different genes are combined together or when exons are duplicated, and will result in new genes. Species can also exhibit genome reduction when subsets of their genes are not needed anymore.
'Mapping 101' - Characterising Gene Content of Chromosomes
Somatic cell hybrids - 1970s-mid 1990s Somatic cell hybrids are formed through fusion of different somatic cells of the same or different species. Somatic cell hybrids contain the nucleus of both cells and in addition all cytoplasmic organelles from both parents, in contrast to the generative hybrids where generally mitochondria and plastids are not transmitted through the male. - Grow human and mouse cells with PEG orwith Sendai virus - Cells fuse, one cell loses chromosomes at random - Grow up each cell into a 'clonal' cell line - Each cell line retains random subset of human chromos - Hybrid Panel - used to map genes, biochemical traits and markers to particular chromosomes The technique of somatic cell hybridization is extensively used in human genome mapping, but it can in principle be used in many different animal systems. The procedure uses cells growing in culture. A virus called the Sendai virushas a useful property that makes the mapping technique possible. Each Sendai virus has several points of attachment, so it can simultaneously attach to two different cells if they happen to be close together. However, a virus is very small in comparison with a cell, so the two cells to which the virus is attached are held very close together indeed. In fact, the membranes of the two cells may fuse together and the two cells become one—a binucleate heterokaryon. If suspensions of human and mouse cells are mixed together in the presence of Sendai virus that has been inactivated by ultraviolet light, the virus can mediate fusion of the cells from the different species. When the cells have fused, the nuclei subsequently fuse to form a uni-nucleate cell line composed of both human and mouse chromosome sets. Because the mouse and human chromosomes are recognizably different in number and shape, the two sets in the hybrid cells can be readily distinguished. However, in the course of subsequent cell divisions, for unknown reasons the human chromosomes are gradually eliminated from the hybrid at random. Perhaps this process is analogous to haploidization in the fungus Aspergillus.
The origins of genomes
The first oceans are thought to have had a similar salt composition to those of today but the Earth's atmosphere, and hence the dissolved gases in the oceans, was very different. The oxygen content of the atmosphere remained very low until photosynthesis evolved, and to begin with the most abundant gases were probably methane and ammonia. Experiments attempting to recreate the conditions in the ancient atmosphere have shown that electrical discharges in a methane-ammonia mixture result in chemical synthesis of a range of amino acids, including alanine, glycine, valine and several of the others found in proteins. Hydrogen cyanide and formaldehyde are also formed, these participating in additional reactions to give other amino acids, as well as purines, pyrimidines and, in less abundance, sugars. At least some of the building blocks of biomolecules could therefore have accumulated in the ancient chemosphere
integration of maps required
• also realised that we'll never clone/identify the disease/trait genes themselves until we can relate genetic map position to something physical in the genome • Requires integration of all the different types of map- integration can also help improve the accuracy of each kind of map
Next job-build 'physical' maps from real bits of DNA
- only physical maps can tell you how far apart disease loci, markers etc. are in terms of DNA sequence - where to start....?? - this was the clever bit - we already know the order of a whole bunch of markers...., so 'hang' the physical map on the linkage map! (what do we mean by 'hang'? - see later slide...) • for a physical map, you need CLONES made from bits of the genome - GENOMIC LIBRARIES • Use of genomic libraries-the genomic clones can be the substrate for efforts to sequence the whole genome..... • This is how the public effort on the Human Genome Project was done Physical mapping gives an estimation of the (physical) distance between specific known DNA sequences on a chromosome. The distance between these known DNA sequences on a chromosome is expressed as the number of base pairs between them. There are a several different techniques used for physical mapping. These include: Restriction mapping (fingerprint mapping and optical mapping)Fluorescent in situ hybridisation (FISH) mappingSequence tagged site (STS) mapping. A physical map is a representation of a genome, comprised of cloned fragments of DNA. The map is therefore made from physical entities (pieces of DNA) rather than abstract concepts such as the linkage frequencies and genes that make up a genetic map (Figure 11.7). It is usually possible to correlate genetic and physical maps, for example by identifying the clone that contains a particular molecular marker. The connection between physical and genetic maps allows the genes underlying particular mutations to be identified through a process call map-based cloning. To create a physical map, large fragments of the genome are cloned into plasmid vectors, or into larger vectors called bacterial artificial chromosomes (BACs). BACs can contain approximately 100kb fragments. The set of BACs produced in a cloning reaction will be redundant, meaning that different clones will contain DNA from the same part of the genome. Because of this redundancy, it is useful to select the minimum set of clones that represent the entire genome, and to order these clones respective to the sequence of the original chromosome. Note that this is all to be done without knowing the complete sequence of each BAC. Making a physical map may therefore rely on techniques related to Southern blotting: DNA from the ends of one BAC is used as a probe to find clones that contain the same sequence. These clones are then assumed to overlap each other. A set of overlapping clones is called a contig
What is genomics
- the study of a genome, at the level of the genome (or at least very large chunks of it) - includes sequencing a genome - involves large-scale experiments - 'big science' - involves creating resources to store or look at lots of pieces of the genome at once, sometimes using very big pieces - allows us to draw conclusions about genome function and functioning by looking at the overall pattern we see that would not be visible by studying genes etc. individually - 'emergent properties' • However... the term 'genomics' is now being used to mean:- merely doing lots of DNA sequencing (NGS or 3rd gen. sequencing)- doing a search for something across the whole genome - e.g. GWAS - any large-scale experiment or analysis involving nucleic acids Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain. The field also includes studies of intragenomic (within the genome) phenomena such as epistasis (effect of one gene on another), pleiotropy (one gene affecting more than one trait), heterosis (hybrid vigour), and other interactions between loci and alleles within the genome.
Clone-by-clone sequencing
A method of genome sequencing in which a physical map is constructed first, followed by sequencing of fragments and identifying overlap regions. Physical mapping of cloned sequences was once considered a pre-requisite for genome sequencing. The process would begin by breaking the genome into BAC-sized pieces, arranging these BACs into a map, then breaking each BAC up into a series of smaller clones, which were usually then also mapped. Eventually, a minimum set of smaller clones would be identified, each of which was small enough to be sequenced. Because the order of clones relative to the complete chromosome was known prior to sequencing, the resulting sequence information could be easily assembled into one complete chromosome at the end of the project. Clone-by-clone sequencing therefore minimizes the number of sequencing reactions that must be performed, and makes sequence assembly straightforward and reliable. However, a drawback of this strategy is the tedious process of building physical map prior to any sequencing. During clone-by-clone sequencing, a map of each chromosome of the genome is made before the DNA is split up into fragments ready for sequencing. In clone-by-clone sequencing the genome? is broken up into large chunks, 150 kilobases? long (150,000 base pairs?). The location of these chunks on the chromosomes? is recorded (mapped) to help with assembling them in order after sequencing. The chunks are then inserted into Bacterial Artificial Chromosomes? (BACs) and put inside bacterial cells to grow. The chunks of DNA? are copied each time the bacteria? divide to produce lots of identical copies. The DNA in the individual bacterial clones? is then broken down into even smaller, overlapping fragments. Each fragment is 500 base pairs long so that they are a more manageable size for sequencing. These fragments are put into a vector? that has a known DNA sequence. The DNA fragments are then sequenced, starting with the known sequence of the vector and extending out into the unknown sequence of the DNA. Following sequencing?, the small fragments of DNA are pieced together by identifying areas of overlap to reform the large chunks that were originally inserted into the BACs. This 'assembly' is carried out by computers which spot areas of overlap and piece the DNA sequence together. Then, by following the map constructed at the beginning, the large chunks can be assembled back into the chromosomes as part of the complete genome sequence. The clone-by-clone approach was used during the 1980s and 1990s to sequence the genomes of the nematode worm, C. elegans, and the yeast, S. cerevisiae. Clone-by-clone sequencing was the preferred method during the Human Genome Project?, which was completed in 2001.
SNPs
A single-nucleotide polymorphism (SNP; /snɪp/; plural /snɪps/) is a substitution of a single nucleotide at a specific position in the genome, that is present in a sufficiently large fraction of the population (e.g. 1% or more). For example, at a specific base position in the human genome, the C nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations - C or A - are said to be the alleles for this specific position. SNPs pinpoint differences in our susceptibility to a wide range of diseases (e.g. sickle-cell anemia, β-thalassemia and cystic fibrosis result from SNPs). The severity of illness and the way the body responds to treatments are also manifestations of genetic variations. For example, a single-base mutation in the APOE (apolipoprotein E) gene is associated with a lower risk for Alzheimer's disease. A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single-nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration To genotype an indiv. at a SNP locus, you design an assay using an STS approach combined with some sort of allele-specific detection technology - see later slide
eukaryotic genomes
Eukaryotic genomes are composed of one or more linear DNA chromosomes. The number of chromosomes varies widely from Jack jumper ants and an asexual nemotode, which each have only one pair, to a fern species that has 720 pairs. A typical human cell has two copies of each of 22 autosomes, one inherited from each parent, plus two sex chromosomes, making it diploid. Gametes, such as ova, sperm, spores, and pollen, are haploid, meaning they carry only one copy of each chromosome. In addition to the chromosomes in the nucleus, organelles such as the chloroplasts and mitochondria have their own DNA. Mitochondria are sometimes said to have their own genome often referred to as the "mitochondrial genome". The DNA found within the chloroplast may be referred to as the "plastome". Like the bacteria they originated from, mitochondria and chloroplasts have a circular chromosome. Unlike prokaryotes, eukaryotes have exon-intron organization of protein coding genes and variable amounts of repetitive DNA. In mammals and plants, the majority of the genome is composed of repetitive DNA.
Polymorphic markers and genotyping for linkage analysis
Genetic linkage analysis is a powerful tool to detect the chromosomal location of disease genes. It is based on the observation that genes that reside physically close on a chromosome remain linked during meiosis. For most neurologic diseases for which the underlying biochemical defect was not known, the identification of the chromosomal location of the disease gene was the first step in its eventual isolation. By now, genes that have been isolated in this way include examples from all types of neurologic diseases, from neurodegenerative diseases such as Alzheimer, Parkinson, or ataxias, to diseases of ion channels leading to periodic paralysis or hemiplegic migraine, to tumor syndromes such as neurofibromatosis types 1 and 2. With the advent of new genetic markers and automated genotyping, genetic mapping can be conducted extremely rapidly. Genetic linkage maps have been generated for the human genome and for model organisms and have provided the basis for the construction of physical maps that permit the rapid mapping of disease traits. As soon as a chromosomal location for a disease phenotype has been established, genetic linkage analysis helps determine whether the disease phenotype is only caused by mutation in a single gene or mutations in other genes can give rise to an identical or similar phenotype. Often it is found that similar phenotypes can be caused by mutations in very different genes. Good examples are the autosomal dominant spinocerebellar ataxias, which are caused by mutations in different genes but have very similar phenotypes. In addition to providing novel, genotype-based classifications of neurologic diseases, genetic linkage analysis can aid in diagnosis. However, in contrast to direct mutational analysis such as detection of an expanded CAG repeat in the Huntingtin gene, diagnosis using flanking markers requires the analysis of several family members. SNPs - Association studies can determine whether a genetic variant is associated with a disease or trait. A tag SNP is a representative single-nucleotide polymorphism in a region of the genome with high linkage disequilibrium(the non-random association of alleles at two or more loci). Tag SNPs are useful in whole-genome SNP association studies, in which hundreds of thousands of SNPs across the entire genome are genotyped. Haplotype mapping: sets of alleles or DNA sequences can be clustered so that a single SNP can identify many linked SNPs. Linkage disequilibrium (LD), a term used in population genetics, indicates non-random association of alleles at two or more loci, not necessarily on the same chromosome. It refers to the phenomenon that SNP allele or DNA sequence that are close together in the genome tend to be inherited together. LD is affected by two parameters: 1) The distance between the SNPs [the larger the distance, the lower the LD. 2) Recombination rate [the lower the recombination rate, the higher the LD].
How unique is life?
If the experimental simulations and computer models are correct then it is likely that the initial stages in biochemical evolution occurred many times in parallel in the oceans or atmosphere of the early Earth. It is therefore quite possible that 'life' arose on more than one occasion, even though all present-day organisms appear to derive from a single origin. This single origin is indicated by the remarkable similarity between the basic molecular biological and biochemical mechanisms in bacterial, archaeal and eukaryotic cells. To take just one example, there is no obvious biological or chemical reason why any particular triplet of nucleotides should code for any particular amino acid, but the genetic code, although not universal, is virtually the same in all organisms that have been studied. If these organisms derived from more than one origin then we would anticipate two or more very different codes. If multiple origins are possible, but modern life is derived from just one, then at what stage did this particular biochemical system begin to predominate? The question cannot be answered precisely, but the most likely scenario is that the predominant system was the first to develop the means to synthesize protein enzymes and therefore probably also the first to adopt a DNA genome. The greater catalytic potential and more accurate replication conferred by protein enzymes and DNA genomes would have given these cells a significant advantage compared with those still containing RNA protogenomes. The DNA-RNA-protein cells would have multiplied more rapidly, enabling them to out-compete the RNA cells for nutrients which, before long, would have included the RNA cells themselves. Are life forms based on informational molecules other than DNA and RNA possible? Orgel (2000) has reviewed the possibility that RNA was preceded by some other informational molecule at the very earliest period of biochemical evolution and concluded that a pyranosyl version of RNA, in which the sugar takes on a slightly different structure, might be a better choice than normal RNA for an early protogenome because the base-paired molecules that it forms are more stable. The same is true of peptide nucleic acid (PNA), a polynucleotide analog in which the sugar-phosphate backbone is replaced by amide bonds. PNAs have been synthesized in the test tube and have been shown to form base pairs with normal polynucleotides. However, there are no indications that either pyranosyl RNA or PNA were more likely than RNA to have evolved in the prebiotic soup.
selfish genes
In describing genes as being "selfish", Dawkins states unequivocally that he does not intend to imply that they are driven by any motives or will, but merely that their effects can be metaphorically and pedagogically described as if they were. His contention is that the genes that are passed on are the ones whose evolutionary consequences serve their own implicit interest (to continue the anthropomorphism) in being replicated, not necessarily those of the organism. In later work, Dawkins brings evolutionary "selfishness" down to creation of a widely proliferated extended phenotype. For some, the metaphor of "selfishness" is entirely clear, while to others it is confusing, misleading, or simply silly to ascribe mental attributes to something that is mindless. For example, Andrew Brown has written ""Selfish", when applied to genes, doesn't mean "selfish" at all. It means, instead, an extremely important quality for which there is no good word in the English language: "the quality of being copied by a Darwinian selection process." This is a complicated mouthful. There ought to be a better, shorter word—but "selfish" isn't it." Donald Symons also finds it inappropriate to use anthropomorphism in conveying scientific meaning in general, and particularly for the present instance. He writes in The Evolution of Human Sexuality (1979) "In summary, the rhetoric of The Selfish Gene exactly reverses the real situation: through [the use of] metaphor genes are endowed with properties only sentient beings can possess, such as selfishness, while sentient beings are stripped of these properties and called machines...The anthropomorphism of genes...obscures the deepest mystery in the life sciences: the origin and nature of mind.
Types of marker - signposts for genomic locations ( EST)
In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and are instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. An EST results from one-shot sequencing of a cloned cDNA. The cDNAs used for EST generation are typically individual clones from a cDNA library. The resulting sequence is a relatively low-quality fragment whose length is limited by current technology to approximately 500 to 800 nucleotides. Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of expressed genes. They may be represented in databases as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand. One can map ESTs to specific chromosome locations using physical mapping techniques, such as radiation hybrid mapping, Happy mapping, or FISH. Alternatively, if the genome of the organism that originated the EST has been sequenced, one can align the EST sequence to that genome using a computer. The current understanding of the human set of genes (as of 2006) includes the existence of thousands of genes based solely on EST evidence. In this respect, ESTs have become a tool to refine the predicted transcripts for those genes, which leads to the prediction of their protein products and ultimately of their function. Moreover, the situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer) gives information on the conditions in which the corresponding gene is acting. ESTs contain enough information to permit the design of precise probes for DNA microarrays that then can be used to determine gene expression profiles. Some authors use the term "EST" to describe genes for which little or no further information exists besides the tag as for STS, but primers are specific for an expressed gene
chromosome mapping
In the preceding subsection, we discussed the use of human-rodent cell hybrids to assign genes to chromosomes. This technique can be extended to obtain mapping data. One extension of the hybrid cell technique is called chromosome-mediated gene transfer. First, samples of individual human chromosomes are isolated by fluorescence-activated chromosome sorting (FACS). In this procedure metaphase chromosomes are stained with two dyes, one of which binds to AT-rich regions, and the other to GC-rich regions. Cells are disrupted to liberate whole chromosomes into liquid suspension. This suspension is converted into a spray in which the concentration of chromosomes is such that each spray droplet contains one chromosome. The spray passes through laser beams tuned to excite the fluorescence. Each chromosome produces its own characteristic fluorescence signal, which is recognized electronically, and two deflector plates direct the droplets containing the specific chromosome needed into a collection tube. Then a sample of one specific chromosome under study is added to rodent cells. The human chromosomes are engulfed by the rodent cells and whole chromosomes or fragments become incorporated into the rodent nucleus. Correlations are then made between the human fragments present and human markers. The closer two human markers are on a chromosome, the more often they are transferred together - first - work out what chromos have been retained in each line in the panel (using existing markers of known localisation). - then - amplify each marker/gene of interest using PCR with each cell line in the panel as template. - see which ones give a PCR product • STS markers - this principle used whenever trying to assemble a set of genomic resources and characterise their content
Genome Reduction and Gene Loss
Many species exhibit genome reduction when subsets of their genes are not needed anymore. This typically happens when organisms adapt to a parasitic life style, e.g. when their nutrients are supplied by a host. As a consequence, they lose the genes need to produce these nutrients. In many cases, there are both free living and parasitic species that can be compared and their lost genes identified. Good examples are the genomes of Mycobacterium tuberculosis and Mycobacterium leprae, the latter of which has a dramatically reduced genome. Another beautiful example are endosymbiont species. For instance, Polynucleobacter necessarius was first described as a cytoplasmic endosymbiont of the ciliate Euplotes aediculatus. The latter species dies soon after being cured of the endosymbiont. In the few cases in which P. necessarius is not present, a different and rarer bacterium apparently supplies the same function. No attempt to grow symbiotic P. necessarius outside their hosts has yet been successful, strongly suggesting that the relationship is obligate for both partners. Yet, closely related free-living relatives of P. necessarius have been identified. The endosymbionts have a significantly reduced genome when compared to their free-living relatives (1.56 Mbp vs. 2.16 Mbp).
SNP arrays
NP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide The basic principles of SNP array are the same as the DNA microarray. These are the convergence of DNA hybridization, fluorescence microscopy, and solid surface DNA capture. The three mandatory components of the SNP arrays are: An array containing immobilized allele-specific oligonucleotide (ASO) probes. Fragmented nucleic acid sequences of target, labelled with fluorescent dyes. A detection system that records and interprets the hybridization signal. The ASO probes are often chosen based on sequencing of a representative panel of individuals: positions found to vary in the panel at a specified frequency are used as the basis for probes. SNP chips are generally described by the number of SNP positions they assay. Two probes must be used for each SNP position to detect both alleles; if only one probe were used, experimental failure would be indistinguishable from homozygosity of the non-probed allele
Repetitive sequences
Our knowledge on the organization of eukaryote genomes has dramatically increased due to ever-growing genome sequences. A well-known feature of eukaryote genomes is that they consist of substantial proportion of repetitive sequences occurred in hundreds or thousands of times. According to evolutionary origins and genomic distribution, repetitive DNA sequences could be overall classified into three types, including the tandem repeats, interspersed repeats, and long terminal repeats (LTRs). Tandem repeats, such as microsatellites, minisatellites and satellites, are characterized by two or more contiguous repetitions of short fragments. Interspersed repeats mainly include short and long interspersed elements; and both of them, together with LTRs, are evolutionarily derived from the transposable elements. As the evolutionary dynamics, diversity pattern, and biological function of repetitive sequences in eukaryote genomes have been intensively reviewed elsewhere
physical maps
Physical Maps A physical map provides detail of the actual physical distance between genetic markers, as well as the number of nucleotides. There are three methods used to create a physical map: cytogenetic mapping, radiation hybrid mapping, and sequence mapping. Cytogenetic mapping uses information obtained by microscopic analysis of stained sections of the chromosome. It is possible to determine the approximate distance between genetic markers using cytogenetic mapping, but not the exact distance (number of base pairs). Radiation hybrid mapping uses radiation, such as x-rays, to break the DNA into fragments. The amount of radiation can be adjusted to create smaller or larger fragments. This technique overcomes the limitation of genetic mapping and is not affected by increased or decreased recombination frequency. Sequence mapping resulted from DNA sequencing technology that allowed for the creation of detailed physical maps with distances measured in terms of the number of base pairs. The creation of genomic libraries and complementary DNA (cDNA) libraries (collections of cloned sequences or all DNA from a genome ) has sped up the process of physical mapping. A genetic site used to generate a physical map with sequencing technology (a sequence-tagged site, or STS) is a unique sequence in the genome with a known exact chromosomal location. An expressed sequence tag (EST) and a single sequence length polymorphism (SSLP) are common STSs. An EST is a short STS that is identified with cDNA libraries, while SSLPs are obtained from known genetic markers and provide a link between genetic maps and physical maps.
radiation cell hybrids
Radiation cell hybrids are typically constructed using cells from two different species. Cells from the organism whose genome is to be mapped (donor) are irradiated with a lethal dose and then usually fused with rodent (recipient) cells. The irradiated chromosomes break at random sites and, after cell fusion with the recipient cells, the donor chromosome fragments are incorporated into the recipient chromosomes. Consequently each hybrid cell line derived from a single cell contains different parts of the donor's chromosomes, which were incorporated at random. Radiation hybrid mapping is based on this artificially induced random breaking of the genomic DNA into smaller fragments. The original order of these fragments to each other is determined by ascertaining that specific DNA sequences are found to be in the same clones, which means that they segregate together because of their close physical proximity in the genome. For detailed mapping, fewer than 100 hybrid cell lines are necessary. For example, irradiated canine cells were fused with recipient hamster cells, and 88 cell lines were selected. To map the canine genome, DNA from each cell line is being tested for the presence or absence of unique canine markers, like STSs. If two markers are originally located closely on a chromosome, a break between the markers is unlikely, and, therefore, they will mostly be found together in the same cell line. In contrast, if they are farther apart or even ondifferent chromosomes, the separation of the two markers into different cell lines is likely. Hence, the actual distance between two markers on a chromosome is proportional to the probability of the markers being separated and found in different cell lines. Analysis of hundreds to thousands of markers allows for the determination of the order and distance between markers. Higher resolution RH maps can be achieved by increasing the intensity of the initial radiation of the donor cells leading to increased chromosomal breaks and smaller average fragment sizes. The probability of separation between closely located markers increases, thereby permitting the ordering of more markers.
Retrotransposons
Retrotransposons can be transcribed into RNA, which are then duplicated at another site into the genome. Retrotransposons can be divided into long terminal repeats (LTRs) and non-long terminal repeats (Non-LTRs). Long terminal repeats (LTRs) are derived from ancient retroviral infections, so they encode proteins related to retroviral proteins including gag (structural proteins of the virus), pol (reverse transcriptase and integrase), pro (protease), and in some cases env (envelope) genes. These genes are flanked by long repeats at both 5' and 3' ends. It has been reported that LTRs consist of the largest fraction in most plant genome and might account for the huge variation in genome size. Non-long terminal repeats (Non-LTRs) are classified as long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and Penelope-like elements (PLEs). In Dictyostelium discoideum, there is another DIRS-like elements belong to Non-LTRs. Non-LTRs are widely spread in eukaryotic genomes. Long interspersed elements (LINEs) encode genes for reverse transcriptase and endonuclease, making them autonomous transposable elements. The human genome has around 500,000 LINEs, taking around 17% of the genome. Short interspersed elements (SINEs) are usually less than 500 base pairs and are non-autonomous, so they rely on the proteins encoded by LINEs for transposition. The Alu element is the most common SINE found in primates. It is about 350 base pairs and occupies about 11% of the human genome with around 1,500,000 copies
SNP genotyping
SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation. A SNP is a single base pair mutation at a specific locus, usually consisting of two alleles (where the rare allele frequency is > 1%). SNPs are found to be involved in the etiology of many human diseases and are becoming of particular interest in pharmacogenetics. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. The use of SNPs is being extended in the HapMap project, which aims to provide the minimal set of SNPs needed to genotype the human genome. SNPs can also provide a genetic fingerprint for use in identity testing.[1] The increase of interest in SNPs has been reflected by the furious development of a diverse range of SNP genotyping method • allele-specific PCR or DNA (mini)sequencing - SNAPshot method • Taqman genotyping - ABI technology that uses extra, fluorescently-labelled, allele-specific oligos in the PCR • SNP arrays - 'SNPchips'• Can now type >1 million SNPs for one indiv. in one experiment
STS (sequence tagged sites).........2
Sequence-tagged site (STS) mapping This technique maps the positions of short DNA sequences (between 200-500 base pairs in length) that are easily recognisable and only occur once in the genome. These short DNA sequences are called sequence-tagged sites (STSs). To map a set of STSs a collection of overlapping DNA fragments from a single chromosome or the entire genome is required. To do this, the genome is first broken up into fragments. The fragments are then replicated up to 10 times in bacterial cells to create a library of DNA clones. The polymerase chain reaction (PCR)? is then used to determine which fragments contain STSs. Special primers? are designed to bind either side of the STS to ensure that only that part of the DNA is copied. If two DNA fragments are found to contain the same STS then they must represent overlapping parts of the genome. If one DNA fragment contains two different STSs then those two STSs must be near to each other in the genome. Sequence Tagged Site (STS) Mapping STSs are short nonrepetitive DNA segments that are located at unique sites in the genome and can be easily amplified by the polymerase chain reaction (PCR). Common sources to obtain STSs represent expressed sequence tags (ESTs), microsatellites (discussed later), and known genomic sequences that have been deposited in databanks. ESTs are short sequences obtained by converting mRNA into complementary DNA (cDNA). They are unique and valuable sequences, because they represent parts of expressed genes of the cells or tissue used for the mRNA extraction. To construct a genome map using STSs, different DNA resources, sometimes called a mapping reagent, can be used. The most common resources are radiation hybrid panels or clone libraries, both of which can be constructed using either whole genome sequences or a single chromosome.
Dawkins selfish gene theory
The Selfish Gene is a 1976 book on evolution by the biologist Richard Dawkins, in which the author builds upon the principal theory of George C. Williams's Adaptation and Natural Selection (1966). Dawkins uses the term "selfish gene" as a way of expressing the gene-centred view of evolution(as opposed to the views focused on the organism and the group), popularising ideas developed during the 1960s by W. D. Hamilton and others. From the gene-centred view, it follows that the more two individuals are genetically related, the more sense (at the level of the genes) it makes for them to behave selflessly with each other. A lineage is expected to evolve to maximise its inclusive fitness—the number of copies of its genes passed on globally (rather than by a particular individual). As a result, populations will tend towards an evolutionarily stable strategy. The book also introduces the term meme for a unit of human cultural evolution analogous to the gene, suggesting that such "selfish" replication may also model human culture, in a different sense. Memetics has become the subject of many studies since the publication of the book. In raising awareness of Hamilton's ideas, as well as making its own valuable contributions to the field, the book has also stimulated research on human inclusive fitness. In the foreword to the book's 30th-anniversary edition, Dawkins said he "can readily see that [the book's title] might give an inadequate impression of its contents" and in retrospect thinks he should have taken Tom Maschler's advice and called the book The Immortal Gene. In July 2017 a poll to celebrate the 30th anniversary of the Royal Society science book prize listed The Selfish Gene as the most influential science book of all Tim
Chromatin Is Coiled into Higher-Order Structures
The packaging of DNA into nucleosomes shortens the fiber length about sevenfold. In other words, a piece of DNA that is 1 meter long will become a "string-of-beads" chromatin fiber just 14 centimeters (about 6 inches) long. Despite this shortening, a half-foot of chromatin is still much too long to fit into the nucleus, which is typically only 10 to 20 microns in diameter. Therefore, chromatin is further coiled into an even shorter, thicker fiber, termed the "30-nanometer fiber," because it is approximately 30 nanometers in diameter. Over the years, there has been a great deal of speculation concerning the manner in which nucleosomes are folded into 30-nanometer fibers. Part of the problem lies in the fact that electron microscopy is perhaps the best way to visualize packaging, but individual nucleosomes are hard to discern after the fiber has formed. In addition, it also makes a difference whether observations are made using isolated chromatin fibers or chromatin within whole nuclei. Thus, the 30-nanometer fiber may be highly irregular and not quite the uniform structure depicted in instructive drawings such as figure 1. Interestingly, histone H1 is very important in stabilizing chromatin higher-order structures, and 30-nanometer fibers form most readily when H1 is present. Processes such as transcription and replication require the two strands of DNA to come apart temporarily, thus allowing polymerases access to the DNA template. However, the presence of nucleosomes and the folding of chromatin into 30-nanometer fibers pose barriers to the enzymes that unwind and copy DNA. It is therefore important for cells to have means of opening up chromatin fibers and/or removing histones transiently to permit transcription and replication to proceed. Generally speaking, there are two major mechanisms by which chromatin is made more accessible: Histones can be enzymatically modified by the addition of acetyl, methyl, or phosphate groups. Histones can be displaced by chromatin remodeling complexes, thereby exposing underlying DNA sequences to polymerases and other enzymes. It is important to remember that these processes are reversible, so modified or remodeled chromatin can be returned to its compact state after transcription and/or replication are complete.
Genes vs organisms
There are other times when the implicit interests of the vehicle and replicator are in conflict, such as the genes behind certain male spiders' instinctive mating behaviour, which increase the organism's inclusive fitness by allowing it to reproduce, but shorten its life by exposing it to the risk of being eaten by the cannibalistic female. Another example is the existence of segregation distorter genes that are detrimental to their host, but nonetheless propagate themselves at its expense. Likewise, the persistence of junk DNA that [Dawkins believed at that time] provides no benefit to its host can be explained on the basis that it is not subject to selection. These unselected for but transmitted DNA variations connect the individual genetically to its parents, but confer no survival benefit. These examples might suggest that there is a power struggle between genes and their interactor. In fact, the claim is that there isn't much of a struggle because the genes usually win without a fight. However, the claim is made that if the organism becomes intelligent enough to understand its own interests, as distinct from those of its genes, there can be true conflict. An example of such a conflict might be a person using birth control to prevent fertilisation, thereby inhibiting the replication of his or her genes. But this action might not be a conflict of the 'self-interest' of the organism with his or her genes, since a person using birth control might also be enhancing the survival chances of their genes by limiting family size to conform with available resources, thus avoiding extinction as predicted under the Malthusian model of population growth.
RFLP
a variation in the length of restriction fragments produced by a given restriction enzyme in a sample of DNA. Such variation is used in forensic investigations and to map hereditary disease. in molecular biology, restriction fragment length polymorphism (RFLP) is a technique that exploits variations in homologous DNA sequences, known as polymorphisms, in order to distinguish individuals, populations, or species or to pinpoint the locations of genes within a sequence.The term may refer to a polymorphism itself, as detected through the differing locations of restriction enzyme sites, or to a related laboratory technique by which such differences can be illustrated. In RFLP analysis, a DNA sample is digested into fragments by one or more restriction enzymes, and the resulting restriction fragments are then separated by gel electrophoresis according to their size. Although now largely obsolete due to the emergence of inexpensive DNA sequencing technologies, RFLP analysis was the first DNA profiling technique inexpensive enough to see widespread application. RFLP analysis was an important early tool in genome mapping, localization of genes for genetic disorders, determination of risk for disease, and paternity testing. The basic technique for the detection of RFLPs involves fragmenting a sample of DNA with the application of a restriction enzyme, which can selectively cleave a DNA molecule wherever a short, specific sequence is recognized in a process known as a restriction digest. The DNA fragments produced by the digest are then separated by length through a process known as agarose gel electrophoresis and transferred to a membrane via the Southern blot procedure. Hybridization of the membrane to a labeled DNA probe then determines the length of the fragments which are complementary to the probe. A restriction fragment length polymorphism is said to occur when the length of a detected fragment varies between individuals, indicating non-identical sequence homologies. Each fragment length is considered an allele, whether it actually contains a coding region or not, and can be used in subsequent genetic analysis. The technique for RFLP analysis is, however, slow and cumbersome. It requires a large amount of sample DNA, and the combined process of probe labeling, DNA fragmentation, electrophoresis, blotting, hybridization, washing, and autoradiography can take up to a month to complete. A limited version of the RFLP method that used oligonucleotide probes was reported in 1985. The results of the Human Genome Project have largely replaced the need for RFLP mapping, and the identification of many single-nucleotide polymorphisms (SNPs) in that project (as well as the direct identification of many disease genes and mutations) has replaced the need for RFLP disease linkage analysis (see SNP genotyping). The analysis of VNTR alleles continues, but is now usually performed by polymerase chain reaction (PCR) methods. For example, the standard protocols for DNA fingerprinting involve PCR analysis of panels of more than a dozen VNTRs.
1st step in understanding a genome - make maps
• A primary goal of the Human Genome Project is to make a series of descriptive diagrams mapsof each human chromosome at increasingly finer resolutions. Mapping involves (1) dividing the chromosomes into smaller fragments that can be propagated and characterized and (2) ordering (mapping) them to correspond to their respective locations on the chromosomes. After mapping is completed, the next step is to determine the base sequence of each of the ordered DNA fragments. The ultimate goal of genome research is to find all the genes in the DNA sequence and to develop tools for using this information in the study of human biology and medicine. Improving the instrumentation and techniques required for mapping and sequencinga major focus of the genome projectwill increase efficiency and cost-effectiveness. Goals include automating methods and optimizing techniques to extract the maximum useful information from maps and sequences. A genome map describes the order of genes or other markers and the spacing between them on each chromosome. Human genome maps are constructed on several different scales or levels of resolution. At the coarsest resolution are genetic linkage maps, which depict the relative chromosomal locations of DNA markers (genes and other identifiable DNA sequences) by their patterns of inheritance. Physical maps describe the chemical characteristics of the DNA molecule itself. Geneticists have already charted the approximate positions of over 2300 genes, and a start has been made in establishing high-resolution maps of the genome . More-precise maps are needed to organize systematic sequencing efforts and plan new research directions. To make a map,we need a set of reference points and signposts - 'markers' - anything you can use to recognise where you are in the genome. - to study a genome at a fine level, many thousands of markers needed. - these markers will be 'invisible' but their positions detectable and definable via experiment. - once you've sequenced the whole genome, you can express position in terms of distance in base pairs along chromosome, but markers are still going to be useful.
Clones - uses in physical mapping
• Genomic libraries-whole genome represented in library • Shotgun libraries-smaller chunks of genome cloned as large numbers of random smaller fragments Large insert: - YAC - 100kbp-1Mbp- yeast artificial chromosome - BAC- 100-150kbp- bacterial ' ' • Medium insert -Phage λ - 15kbp - - Cosmid - 40 kbp - cross betw. phage and plasmid • Small insert - Plasmid - 1-2 kbp - M13 phage - 0.5-1 kbp - Large inserts good for creating libraries to minimise the number of clones - Small inserts good for sequencing very large numbers of clones to piece together the sequence bit by bit
Using gridded clone libraries to identify genes
• Gridded libraries - cosmid library arrayed onto nylon by spotting DNA or bacterial clones • Hyb. filters with known probe (e.g. cDNA for gene of interest) - reveals which clones include the probe sequence • Use info to work out which clones in library - Contain gene of interest - Overlap with each other • Make 'contigs' of overlapping clones • Construct map from contigs
Building maps of the genome - how it's done
• Historical perspective- earlier approaches simpler to understand - builds up your understanding gradually Basicmodern workhorse technology - PCR - polymerase chain reaction - Used to amplify small segments of the genome for further analysis The objectives of the Human Genome Project are to create high-resolution genetic and physical maps, and ultimately to determine the complete nucleo-tide sequence of the human genome. The result of this initiative will be to localize the estimated 50,000-100,000 human genes, and acquire information that will enable development of a better understanding of the relationship between genome structure and function. To achieve these goals, new methodologies that provide more rapid, efficient, and cost effective means of genomic analysis will be required. From both conceptual and practical perspec- tives, the polymerase chain reaction (PCR) represents a fundamental technology for genome mapping and se- quencing. The availability of PCR has allowed definition of a technically credible form that the final composite map of the human genome will take, as described in the sequence-tagged site proposal. Moreover, applications of PCR have provided efficient approaches for identifying, isolating, mapping, and sequencing DNA, many of which are amenable to automation. The versatility and power provided by PCR have encouraged its involvement in almost every aspect of human genome research, with new applications of PCR being developed on a continual basis. The versatility and power provided by the PCR methodology has encouraged its involvement in almost every aspect of the Human Genome Project. In fact, the potential utility of PCR in human genome mapping and sequencing appears to be limited only by the creativity of its users. Although human genome research has been conducted for many years,a formal targeted program to map and sequence the human genome has only recently been initiated (10). The application of PCR has been central in formulating both the conceptual and practical approaches on which this project is based
Evolution of the Human Genome Project
• Phase 2 target - Human physical clone-based map- Completed - 1995- YAC map anchored ('hung') on map of microsatellite markers (CEPH) - STS map by radiation hybrid mapping (Whitehead)- Integrated YAC + STS map (Whitehead)- BAC map based on STS map (several centres) • BAC/PAC 'contig' on chr.1q with integrated map of gene locations • YAC contig created by 'chromosome walking' using 'clone-end' STS markers Primer walking (or Directed Sequencing) is a sequencing method of choice for sequencing DNA fragments between 1.3 and 7 kilobases. Such fragments are too long to be sequenced in a single sequence read using the chain termination method. This method works by dividing the long sequence into several consecutive short ones. The DNA of interest may be a plasmid insert, a PCRproduct or a fragment representing a gap when sequencing a genome. The term "primer walking" is used where the main aim is to sequence the genome. The term "chromosome walking" is used instead when the sequence is known but there is no clone of a gene. For example, the gene for a disease may be located near a specific marker such as an RFLP on the sequence. The fragment is first sequenced as if it were a shorter fragment. Sequencing is performed from each end using either universal primers or specifically designed ones. This should identify the first 1000 or so bases. In order to completely sequence the region of interest, design and synthesis of new primers (complementary to the final 20 bases of the known sequence) is necessary to obtain contiguous sequence information.
deletion mapping
• Q.-how do we localise genes,or the sites of disease mutations and predisposing variants for traits, in the genome? • First job - visualise the genome, see what can be seen... • First technology available - late 1950s onwards - cytogenetics • gene mapping - adenylate kinase gene mapped to chr.9 by deletion/ dosage obsn. • disease genes assocd. with chr. deletions and translocations • example - Duchenne muscular dystrophy mapped to chr. Xp21 from observations of X-autosome translocations deletion mapping is a technique used to find out the mutation sites within a gene. The principle of deletion mapping involves crossing a strain which has a point mutation in a gene, with multiple strains who each carry a deletion in a different region of the same gene. Wherever recombination occurs between the two strains to produce a wild-type (+) gene (regardless of frequency), the point mutation cannot lie within the region of the deletion. If recombination cannot produce any wild-type genes, then it is reasonable to conclude that the point mutation and deletion are found within the same stretch of DNA. • Deletion series leads to localisation of gene you've got several different patients each with different deletions , the minimum area of overlapped deletions will define the region containing gene of interest
Next phase of genome mapping
• The problem-thousands of Mendelian and complex disorders and traits recognised - the need to map the genes for these traits/disorders by linkage analysis - early 1980s, lack of a really good, dense genetic map for mapping • Rationalebehindthe'genomeproject'idea - once disease genes mapped by linkage, how do we identify the disease gene itself? - process needs to be quick and cheap. - but at that time there was a lack of really good maps of the genome showing the location and identity of all genes. - some microorganism genomes were already being analysed, there was already interest in the genome of C. elegans, Drosophila etc. • The Human Genome Project (HGP) idea was formalised by the U.S. D.O.E. in 1987
What questions can we use large-scale genomic sequence data to answer?
• discover new genes - 'gene prediction'- add to 'known' genes to define complete set of all genes in the genome • genome annotation - identify position, structure and function of all genes • classify all genes/DNA sequences by their evolutionary relationships • gene families • understand evolution of genes and genomes• how are two organisms related? - analyse large-scale sequence from genome to find out (not just individual genes, rDNA, mtDNA etc.) • locate and characterise all varying sites (polymorphisms and rare variants/mutations) in the genome. - use these 'genomic resources' to discover the location and identity of genes/variants predisposing to traits/disorders. • understand structure and functioning of genome- incl. chromatin-level gene expression regulation - Answer 1 - making best use of resources to achieve our ends • Answer 2 - understanding genomes is a necessary part of understanding life and how it works Answer 3 - understanding genomes can help us to understand and deal with the human condition - e.g. Genome analysis and 'genomic diseases'.... •Some diseases can be caused by regions of the genome misbehaving during replication and meiosis - Why does this happen? - Are some parts of the genome particularly prone to such rearrangements? If so, why? - Look at example of SMA - spinal muscular atrophy - Chr.5 contains an inverted duplication containing several genes » these include the SMN genes (survival motor neuron) - Patients make too little SMN1 protein - often due to deleted gene » occurs due to illegitimate recombination between repeats
Marker maps - Radiation hybrids
• late 1980s onwards - useful way to generate STS maps • 'zap' human cells with controlled dose of radiation - fragments chromosomes at random in different cells • fuse cells with mouse cells - result is a kind of somatic cell hybrid • 'clone' the resulting hybrids - create panel of RH cell lines (e.g. 96 hybrid lines) • assay each line for presence of new STS markers by PCR • markers close to each other on a particular chromosome tend to stay on one fragment more often than widely spaced markers or those on diff. chromos. - use stats to calculate probability of co-occurrence of markers in all the panel clones (a bit like doing linkage analysis). - create map of 'framework' markers and place new markers on this map. This is a type of physical mapping technique. It generates a high resolution map of the genome. Rodent cells have the ability to maintain chromosomes derived from other species as part of their chromosomes. It is possible to induce chromosomal breakage by exposing the cells to X-rays. The breakage depends on the intensity and the duration of the irradiation. The irradiated cells with broken chromosomes are fused with rodent cells and grown on a HAT medium which allows only hybrid cells to grow. This gives a panel of radiation hybrids. The landmarks that can be mapped on the radiation hybrids include STS, EST, etc. Even non-polymorphic markers can be mapped using radiation hybrid mapping technique. The presence or absence of markers on a particular radiation hybrid is identified by PCR amplification. The linkage between two markers is determined by calculating the breakage percentage based on PCR amplification. When two markers are located close together, they will be retained together during irradiation, otherwise they will get separated. Breakage depends on the distance between markers. A particular radiation hybrid will show PCR amplification for two markers if they are located close to each other. The LOD score is calculated, the order and distance between the markers are mapped. This technique is useful when STS content mapping using YAC clones is not feasible, owing to poor maintenance of certain parts of the chromosome such as GC-rich regions and terminal parts of the chromosomes, etc. Steps in RH mapping 1.Irradiation of human fibroblast cells. 2.Fusion of irradiated cells with rodent cells. 3.Selection of fused cells on HAT medium. 4.PCR amplification with specific primers using hybrid chromosomes. 5.Construction of RH map. When the Human Genome Sequencing Project started in 1990, scientists opted for map-based genome sequencing strategy. Therefore, one of the objectives of the Human Genome Project was to develop high density mapping methods. Targets were set for genetic mapping and physical mapping. Many new mapping methods such as optical mapping, fibre FISH and radiation hybrid mapping were developed.