Human Molecular Genetics - MODULE 2 - Next generation sequencing and human molecular genetics

Ace your homework & exams now with Quizwiz!

Genotyping - microsatellites

- microsattelites markers GeneScan gel- fluorescent PCR prods - multiplex loading - polyacrylamide seq. gel

1977

1- di-deoxy sequencing development - Fred Sanger sequencing efficiency = 150 2- chemical degradation sequencing developed - Walter Gilbert sequencing efficiency = 1,500

Gene structure - influence on mutation screening design

Essential to have this picture in mind when considering where variation is located, and its possible functional effect

GWAS results - chronotype

Genes for chronotype....?? - do GWAS for chronotype .- Oh, yes there are! - UK Biobank results have come out recently (see Manhattan plot below)....(and for lots of other biorhythm and sleep-related trait phenos...) red- variants enriched in people waking up early game was chi squared

Highlights of Mate Pair Sequencing

High Genomic Diversity Efficient protocol enables the highest genomic diversity of any next-generation platform. User-Friendly Workflow Simple workflow with limited hands-on time and multiple stopping points. Low DNA Input Requirements Requires as little as 1 μg of starting material.

Genetic architecture of complex traits/disorders

In Parkinson disease, the genetic explanations are a mix of rare, high pen. familial forms, moderate frequency moderate risk variants, and high freq.low risk variants

Linkage and haplotypes in recessive disorders

Look at haplo from BOTH parents. Look for region that is non-recomb. with disease locus in haplo coming from each of them to define linked / critical region

physical sequence capture approach

Sequence capture technology allows targeted enrichment of specific regions of a genome such as an entire exome. This, in concert with NGS, provides an efficient strategy for high-throughput screening of regions of interest, facilitating the identification and characterisation of physiologically relevant variants. As such, sequence capture is a cost effective alternative to whole genome sequencing. Targeted sequence capture coupled with NGS constitutes an efficient alternative approach to the exploration of genetic diversity in a very large number of genomic regions and specimens. The use of sequence capture techniques provides evolutionary biologists with easy access to nucleotide diversity, for addressing various research questions Furthermore, these techniques provide high sequence coverage for a small set of target sequences, making it possible to multiplex several samples, thereby reducing the cost of large-scale applications, for population genetics studies, for example. Sequence capture techniques require access to a reference genome, but provide highly reproducible SNPs and markers with greater transferability across species than for other pangenomic marker systems (e.g., RADseq or GBS) (Harvey et al., 2016). Intra- and interspecific reproducibility is a prerequisite for comparative studies across populations or related species, even if sampling and molecular analysis are performed at different times. For example, George et al. (2011) developed a genomic capture approach in humans that successfully captured about 96% of coding sequences in monkeys (George et al., 2011). Similarly, in gymnosperms, a common capture design established for spruce and lodgepole pine (Suren et al., 2016) successfully captured more than 50% of the targeted bases with a coverage of at least 10×. For hybridization enrichment, the entire genome is sheared into small fragments which are subsequently ligated to sequencer-specific adapter DNA molecules. (E) Biotinylated oligomers that have been designed to be complementary to the region of interest are incubated with the previously generated sequencing library. (F) Captured molecules from the region of interest are pulled down using streptavidin-coated magnetic beads. DNA molecules are then eluted and ready for sequencing Hybrid capture-based enrichment methods use sequence-specific capture probes that are complementary to specific regions of interest in the genome. The probes are solution-based, biotinylated oligonucleotide sequences that are designed to hybridize and capture the regions intended in the design. Capture probes are significantly longer than PCR primers and therefore can tolerate the presence of several mismatches in the probe binding site without interfering with hybridization to the target region. This circumvents issues of allele dropout, which can be observed in amplification-based assays. Because probes generally hybridize to target regions contained within much larger fragments of DNA, the regions flanking the target are also isolated and sequenced. Compared to amplicon-based assays, hybrid capture-based assays enable the interrogation of neighboring regions that may not be easily captured with specific probes. However, hybrid capture-based assays can also isolate neighboring regions that are not of interest, thereby reducing overall coverage in the regions of interest if the off-target sequencing is not appropriately balanced. Also, in cases with rearrangements, isolated neighboring regions may also be from genomic areas far from the intended or predicted targets. Fragment sizes obtained by shearing and other fragmentation approaches will have a large influence over the outcome of the assays. Shorter fragments will be captured with higher specificity than longer fragments as they will contain a lower proportion of off-target sequences. On the other hand, longer reads would be expected to map to the reference sequence with less ambiguity than shorter reads.

lod score changes with different linkage situations

The point is any two markers have a fixed genetic spacing, there is a there is a particular theta, which is the correct theta for those too loci. They are X number of recombinations but you can calculate the lod score. Okay, otherwise known as Z at any theoretical value for theta. So the, the point in this graph is on the x axis have been plotted what lod score you would get under different hypothesis for theta the null hypothesis is a feature of point 5, 50 percent recombination under mendels second law

1951

base ratios determined - Erwin chargaff he proportion of G's was equal to the proportion of c's and the proportion of as was equal to A the proportion of T in any given genomics, although The relative ratios of G's and C's to As and T's vary between genomes within the genome. G, were equally frequencies and as C aND A equally frequent as T and obviously we now understand the basis for that association. So based on the data X ray crystal graphic data provided by

variable expressivity

individuals with the same genotype have related phenotypes that vary in intensity the propensity of a genotype to be associated with different phenotypes (e.g. symptoms) in different indivs. • 'pleiotropy'• also used if severity is very variable - both influenced by modifier genes, epistasis, genetic background, environment

looking everywhere strategy

need a high resolution genetic map of each chromosome - linkage map; first objective of Human Genome Project (HGP) - use highly informative markers - microsatellites Linkage map defined by - ID of markers • marker names are often anonymous e.g. 'D12S352' - order of markers- spacing of markers • optimal spacing for genome scan is 5cM• requires approx. 450 markers across the genome • map distance measured in cM = centiMorgans- 1cM » q = 0.01 theta ( recombination fraction) (1% recomb.) » 1Mbp (in humans) even better linkage maps now at our disposal - SNP maps• >10 million common SNPs(>700million SNPs intotal)already identified - sets of >1,000,000 have been put on one SNP-Chip- can genotype all of these in one indiv. in one experiment! genetic distance is all about recombinations list of markers in a genetic map , includes positions and the genetic distance (cM) you cannot detect recombination unless both loci you're looking at are heterozygous in the parent and then you can follow what combination of alleles were passed down to the offspring SNPs only have 2 alleles but less likely to be heterozygous Micro satellites heterozygous markers might have up to 12 allele so more likey that individual is heterozygous and therefore more likely that the parents are heterozygotes then you can follow alleles down the next generation so they're more informative. good marker - informative - high heterozygosity

Mode of inheritance Pattern of transmission of trait/disorder

penetrance: An all-or-nothing state defined as the probability that: • a particular genotype will exhibit a particular phenotype • an individual with a disease-causing allele will show the associated symptoms • if they do not, they are said to be non-penetrant • if a disorder exhibits reduced penetrance - some indivs. carrying mutant allele don't exhibit affected phenotype • remember - for dominant alleles, penetrance is measured in hets & homs; for recessive alleles, penetrance is meaningful only in homs penetrance 0 = disease variant has nothing to do with the disease, no impact on disease whatsoever

1965

yeast tRNA sequenced - Robert Holley sequencing efficiency ( bp person-1 year-1) = 1

variant filtering

• has this variant been observed before?• is it a common polymorphism?• is it a rare variant?• is it unique?• if it has been seen before, in who?• cross reference with database such as dbSNP, genome aggregation database (gnomAD) • is this variant predicted to be pathogenic?• does it change the amino acid sequence?• use algorithms such as variant effect predictor (VEP)

Mutation detection - screening strategy

'positional cloning' strategy - map (i.e. locate on a genetic map) your disease gene(s)- prioritise/rank 'positional candidate' genes for mutation screening • how?- functional considerations - screen positional candidate genes for mutations Strategy for mutation screening - Guess how candidate gene might be affected (use M.O.I)- Make sure you can distinguish causative variants from neutral polys you can narrow down a critical region ( HGP) and work out what genes as you've already integrated genetic map with the physical map so we know where every marker is including the genetic markers on a map containing the gene locations as well if unlucky,

Mendel's Laws

1) The Law of Segregation: Each inherited trait is defined by a gene pair. Parental genes are randomly separated to the sex cells so that sex cells contain only one gene of the pair. Offspring therefore inherit one genetic allele from each parent when sex cells unite in fertilization. ( random segregation / 1 locus) 2) The Law of Independent Assortment: Genes for different traits are sorted separately from one another so that the inheritance of one trait is not dependent on the inheritance of another. ( 2 loci) no linkage, 50% recombination 3) The Law of Dominance: An organism with alternate forms of a gene will express the form that is dominant.

sequence capture

1- entire genome shared into smaller fragments 2- make sequence-specific capture probes that are complementary to specific regions of interest in the genome. 3- Non-specific unbound molecules are washed away and the pieces that have a coordinate sequence on a bead or chip will hybridise, and the enriched DNA is eluted for NGS

same thing applies to variants

2 heterozygous variants but short reads and nothing spanning the region between them two ( shorter reads) .. with paired reads .. 2 variants are known to be on opposite chromosomes , established phase between close variance

Genotyping SNPs more generally

All SNPs - can use allele-specific amplification or allele-specific detection • amplify each allele separately using allele-specific primer sets; or • label each allelic product differently • detect colours e.g. TaqMan assay - general PCR primer pair + two allele-specific primers labeled with different fluorophores, and quencher at 3' end to stop fluorescence in absence of amplification • moderate throughput: 96 or 384-well plate assay • High throughput genotyping - SNP arrays - 'SNPchips'• we can now type >1 million SNPs for one indiv.in one experiment• .....Or we can just look at the genome sequence of several individuals.....

Amplicon Sequencing

Amplicon sequencing is a highly targeted approach that enables researchers to analyze genetic variation in specific genomic regions. The ultra-deep sequencing of PCR products (amplicons) allows efficient variant identification and characterization. This method uses oligonucleotide probes designed to target and capture regions of interest, followed by next-generation sequencing (NGS). Amplicon sequencing is useful for the discovery of rare somatic mutations in complex samples (such as tumors mixed with germline DNA). Another common application is sequencing the bacterial 16S rRNA gene across multiple species, a widely used method for phylogeny and taxonomy studies, particularly in diverse metagenomics samples. Amplicon sequence variant (ASV) is a term used to refer to single DNA sequences recovered from a high-throughput marker gene analysis. These amplicon reads are created following the removal of erroneous sequences generated during PCR and sequencing. This allows ASVs to distinguish sequence variation by a single nucleotide change. ASVs are utilized to classify groups of species based on DNA sequences, finding biological and environmental variation and to determine ecological patterns. For many years the standard unit for marker gene analysis was operational taxonomic units (OTUs), which are generated by clustering sequences based on a shared similarity threshold. These traditional units were created by construction of molecular taxonomic units by either clustering based on similarities between sequencing reads (de-novo OTUs) or by clustering reference databases to define and label an OTU (closed-reference OTUs). Instead of using exact sequence variants (single nucleotide changes), OTUs are distinguished by a less fixed dissimilarity threshold which is most commonly 3%. This means these units have to share 97% of the DNA sequence. ASV methods on the other hand are able to resolve sequence differences by as little as a single nucleotide change which allows this method the ability to avoid similarity-based operational clustering units all together. Therefore, ASVs provide a more precise measurement of sequence variation since this method uses DNA differences instead of user created OTU differences. ASVs are also referred to as exact sequence variants (ESVs), zero-radius OTUs (zOTUs), sub-OTUs (sOTUs), Haplotypes, or Oligotypes

association analysis in polygenic/MF traits and disorders

An association study is an attempt to find new statistical relationships between different events or verify the already known ones. The actual causes of these relationships are often beyond the knowledge or the experimental facilities of a researcher. However, once one has collected the statistics of occurrence of combinations of different observed outcomes, a conclusion can be made regarding the significance (which is assessed based on the probability of randomly obtaining the result observed) and intensity of these relationships. The association between a certain polymorphic genome region and a phenotypic trait is analyzed by comparing the distributions of its alleles and genotypes in the representative samples of individuals, which are formed with respect to the presence/absence of this trait and need to match in terms of sex, age, and ethnicity. The allelic variants under analysis can be localized in any DNA region, including the coding sequences (exons), introns, and promoter regions of the genes, where the transcriptional regulatory regions are frequently located, as well as the other DNA regions. In exon analysis, not only the nonsynonymous substitutions determining the changes in the amino acid sequence of the protein molecule being encoded are of interest, but also the synonymous substitutions, since they can affect the mRNA structure and stability, as well as the translation kinetics due to the use of different isoacceptor tRNAs.. However, it should be remembered that in addition to the direct relation between the investigated locus and the hereditary trait, the association may be based on linkage disequilibrium between the marker locus and the true locus of the disease, if these loci are located sufficiently close to one another. The aim of association studies is to link the phenotypic traits that are significant for medicine with such characteristics as allelic variations in the genome, epigenetic modifications, effects of environmental factors, lifestyle, etc. The phenotypic traits that are of significance for personalized medicine typically include the onset of a disease, its course (clinical presentation, extent of injury in the systems of the organism, etc.) or the efficacy of therapy with a certain drug (the area of interest of phamacogenomics). In this review, we will focus on the association between the individual traits and the carriage of allelic variants of the genome. Identification of these associations enables one to assess the risk of disease development (susceptibility), predict the character of its course, and give a preference to certain methods of prevention, diagnosis and therapy based on the features of the individual genome. The analysis of the associations between polygenic diseases and the combined occurrence of alleles of different genes remains a relatively poorly developed research area. This can be mainly attributed to the fact that any increase in the number of genes being analyzed results in an exponential growth in the number of combinations of their allelic variants, which makes any analysis using conventional exhaustive search techniques almost infeasible.

sequencing whole genome

And particularly looking for disease causing mutations, the majority, but not all, obviously, but the majority of disease causing mutations are located in exons. So we get a greater mutation to basis sequenced return if we sequence exotic regions sequencing a fraction of a larger genome

Multifactorial Disorders; Continuous or Discontinuous

Autosomal or sex-linked single gene conditions generally produce distinct phenotypes, said to be discontinuous: the individual either has the trait or does not. However, multifactorial traits may be discontinuous or continuous. Continuous traits exhibit normal distribution in population and display a gradient of phenotypes while discontinuous traits fall into discrete categories and are either present or absent in individuals. It is interesting to know that many disorders arising from discontinuous variation show complex phenotypes also resembling continuous variation [10] This occurs due to the basis of continuous variation responsible for the increased susceptibility to a disease. According to this theory, a disease develops after a distinct liability threshold is reached and severity in the disease phenotype increases with the increased liability threshold. On the contrary, disease will not develop in the individual who does not reach the liability threshold. Therefore, an individual either having disease or not, the disease shows discontinuous variation. An example of how the liability threshold works can be seen in individuals with cleft lip and palate. Cleft lip and palate is a birth defect in which an infant is born with unfused lip and palate tissues. An individual with cleft lip and palate can have unaffected parents who do not seem to have a family history of the disorder

Haplotypes and recombination - mapping familial disease/trait genes

Can I see bits of chromosome that seemed to be shared by the affected people but not by the unaffected people in our family. So everything basically boils down to that. linkage analysis is just simply a way of quantifying the degree to which disease lockers is either recombinant or non recombinant with markers in different places along the chromosomes that you are looking at So because we can't see combinations directly. We use genotyping at markers I I quickly went over. I think the basis for genotyping before. Meiotic recombination generates the mosaic/patchwork To find disease/trait genes, we need to detect where these recombinations have taken place; - problem - we can't see them. A genetic marker enables us to: a) detect the presence of a fragment or region of the genome (locus) in a DNA sample from an indiv. b) genotype an individual at a polymorphic locusi.e. identify which alleles at a polymorphic site are carried by an individual

1870

DNA discovered - Friedrich Miescher after Mendel had discovered nature of genetic inheritance wasn't widely recognised , no idea of the physical nature of dna it became apparent that chromosomes were involved in the genetic material and the DNA was somehow involved that but they thought, largely, it was a student that DNA was actually the carrier for the protein.

Examples of Multifactorial Inheritance Disorders

Discontinuous multifactorial disorders Insulin-dependent (Type 1) Diabetes Mellitus (IDDM) Hypertension / Obesity - above threshold, co-morb. risk incs. Ischaemic Heart Disease (IHD) / Stroke Non-Insulin dependent (Type 2) Diabetes Mellitus (NIDDM) Cleft Lip ± Cleft Palate Neural Tube Defects (Spina Bifida) Atopy (Asthma, Exczema etc.) Alzheimer disease (AD), Parkinson disease (PD) Schizophrenia, Bipolar disorder (Manic Depression) [Cancer - special case.....]

multifactorial inheritance

Disorders and traits influenced by genes and environment • polygenic influences ( affected by genetics), rather by a single-gene for a single gene disorder, one mutation in one gene is the whole explanation for why the person affected if its high penetrance, In this case, you can carry risk variants in lots of different genes at the same time and that helps explain why you're affected. • exhibit non-Mendelian inheritance patterns Disorders and dichotomous traits • discrete states - all-or-nothing Quantitative traits - continuous distribution in popn. ( might look like a normal distribution) • normal curve, mean, variance narrower - lower variance Genetic predisposition, risk, liability • Genetic determinism, predestiny, genetics and lifestyle

Linkage Mapping - Genome-wide Scan for Linkage

First step in assessing linkage -count offspring consistent with each phase; hypothesise true phase count recombs. under that hypoth. max. recomb. rate = 50%, but can get more apparent recombs. to detect recombs. parent must be heterozygous at both loci definitive phase assignment only poss. if grandparental genotypes known and all genotypes informative if loci are linked, they must be on same chromosome and q < 0.5

Polygenicity - how do many genes at the same time influence someone's liability to develop a disorder?

Fisher - proposed quantitative inheritance model (resolving Mendel vs Galton) He realised that quantitative traits ('Normal' distribution, with characteristic 'mean' and 'variance') are 'built up' by integration of the small effects of many genes - polygenic inheritance [Note - normal distribution for a complex trait also due to combination of polygenic and environmental influences]

Continuous distributions - Discontinuous traits

Fisher's model explained how 'normal trait distribution' could result from influence of multiple independent genes. How can this be generalised to include discontinuous traits and disorders? Sewall Wright, Falconer - showed how a discontinuous ('all-or-none') trait/disorder could be predicted from a threshold effect acting on a continuous distribution • a set of factors (genes and environment) determines each individual's 'liability' to develop a particular trait. Liability varies quantitatively.

Mate Pair Library Preparation Process

Following DNA fragmentation, the DNA fragments are end-repaired with labeled dNTPs. The DNA fragments are circularized, and non-circularized DNA is removed by digestion. Circular DNA is fragmented, and the labeled fragments (corresponding to the ends of the original DNA ligated together) are affinity-purified. Purified fragments are end-repaired and ligated to Illumina paired-end sequencing adapters. Additional sequences complementary to the flow cell oligonucleotides are added to the adapter sequence with tailed PCR primers. The final prepared libraries consist of short fragments made up of two DNA segments that were originally separated by several kilobases. These libraries are ready for paired-end cluster generation, followed by sequencing utilizing an Illumina next-generation sequencing (NGS) system.

Confirming presence and identity of mutations picked up in screening

Gold standard (till 2015 or so) was Sanger sequencing • candidate genes or several dozens? you can rank them, once you've decided on a gene you decide how you'd screen it e.g. screening SCN1B for mutations in JME (epilepsy) • amplify relevant parts of gene by PCR (high fidelity) • think about where you need to design the PCR primers for each amplicon so that all relevant bases are interrogated in the sequence you generate primers were flanked and pointed towards each other of each exon in the gene 4 pairs of primers - shows the pcr products made with each of those primers from 4 different individuals, take those and screen for mutations the strategy is basically a bunch of PCR products that take you to the mutation. where would you design the primers ? = in the introns, primers cant be in exons presumabley because you might miss something you cannot sequence under a primer, any mutation in the primer you cant see in a Sanger sequence trace, the first 20 bases are rubbish put primers flanking and not into downstream or upstream if we want to detect splice site mutations, we have to push them even slightly further into the introns so that we're sequencing through the splice sites which are the very end of the introns amplicon sequencing amplicon - notional sequence amplified by a particular pair of primers some papers say they're primers itself but that's not entirely right

Haplotype analysis

Haplotype analysis allows recombination boundaries to be visualised Enables delineation of 'critical region' and 'risk haplotype' On this diagram of Ménière disease genotypes on chr.14 the haplotype indicated in magenta is the 'risk haplotype' Critical region is therefore defined by recombinationsin II.5, III.7, III.1 and III.5(latter two are unaffected=> only true if these unaff. indivs. are NOT non-penetrant!)

Locating and identifying genes for single gene disorders

If you have a clue as to where your disease gene is you can use the clue to direct your strategy. The candidate gene strategy might work. If it's something like it looks like a collagen disorder. So you go and look at the collagen genes That's what how it worked in a EB, for example, other clues - prior evidence from linkage You're interested in a familial disorder, segregating in Mendelian fashion • You want to know the causative gene • What are you going to do next...? • Assemble set of families - largest set possible • why the largest set possible? - obtain DNA from family members• Then? - pick your mapping strategy - Candidate gene strategy • picking candidate genes to screen for mutations may be a dodgy approach - big waste of time if you are wrong... - Need a less biased mapping strategy Deletion of elastin gene* on chr.7(red signal absent) in a patient with Williams- Beuren syndrome) • Clues to tell us where to start looking - what might these be? - sex-linkage; prior linkage to normal traits (e.g. myotonic dystrophy [DM] and 'secretor' blood type locus on chr.19); translocations; large cytogenetic deletions/duplications - If no clues - where should we look...? • EVERYWHERE! - 'Positional cloning' strategy - 'genome scan'

Paired-end sequencing

In "short-read" sequencing, intact genomic DNA is sheared into several million short DNA fragments called "reads". Individual reads can be paired together to create paired-end reads, which offers some benefits for downstream bioinformatics data analysis algorithms. The structure of a paired-end read is described here There is a unique adapter sequence on both ends of the paired-end read, labeled "Read 1 Adapter" and "Read 2 Adapter". "Read 1", often called the "forward read", extends from the "Read 1 Adapter" in the 5′ - 3′ direction towards "Read 2" along the forward DNA strand. "Read 2", often called the "reverse read", extends from the "Read 2 Adapter" in the 5′ - 3′ direction towards "Read 1" along the reverse DNA strand. There is an arbitrary DNA sequence inserted between "Read 1" and "Read 2", which we'll call the "Inner sequence". The length of this sequence is measured as the "Inner distance". By definition, the "Insert" is the concatenation of "Read 1", the "Inner distance" sequence and "Read 2". And the length of the "Insert" is the "Insert size". A single "Fragment" includes the "Read 1 Adapter", "Read 1", "Inner sequence", "Read 2" and "Read 2 Adapter". And the length of this "Fragment" is just the "Fragment length". Paired-end reading improves the ability to identify the relative positions of various reads in the genome, making it much more effective than single-end reading in resolving structural rearrangements such as gene insertions, deletions, or inversions. It can also improve the assembly of repetitive regions. This degree of accuracy may not be required for all experiments, however, and paired-end reads are more expensive and time-consuming to perform than single-end reads

association analysis

In the association analysis, both the predicted (dependent) and predicting (independent) traits are the categories that divide the sample into two classes (e.g., "affected" and "healthy" or "carrier" and "noncarrier"). It is convenient to present the intersections of the classes as a 2×2 table (contingency table). Its values are used to characterize the strength of association (OR) and its significance ( p -value). The p -value is calculated using the Fisher's exact test that was proposed in 1922 and is still widely applicable [6]. If a trait is represented by more than two classes that can be ranked (e.g., using the disease severity scale assigned by the medical community), 2 n -field contingency tables (where n is the number of gradations of a trait) are compiled; the Goodman-Kruskal gamma test is used to assess the strength and significance level of an association [7]. If ranking makes no sense, either the Freeman-Halton test that extends the Fisher's test to more than two categories [8] or the χ 2 test [9] can be used. ( check saved bookmark for association studies)

Linkage Mapping - Genome-wide Scan for Linkage

Lod score is inversely proportional to the no. of recombinations observed between disease and marker Lod scores from each family at each marker locus are added together to create overall lod score- lod score tells us the strength of evidence for linkage to a particular locus In essence, what we're trying to do is find parts of the genome where alleles at genetic markers get transmitted (see Mendel's 1st) along with the disease allele to offspring - i.e. 'cosegregate' - more often than expected - any such regions of the genome must be linked to the disease locus (if there's sufficient evidence for co segregation...) • i.e.we're looking for markers that break Mendel's 2nd law: independent assortment of alleles at different loci so what we're effectively doing is trying to find out where the recombination has happened on the genetic map and we're trying to infer which bit of chromosome. The disease lockers is on by establishing which markers on either side are non-recombinant with the mutation and which markers are recombinant with the mutation areas between two recombinant markers are non-recombinant with the mutation - where disease gene must be

1980

M13 cloning and shotgun sequencing - Jo messing Sequencing Efficiency = 25,000

What does a mutation look like in the sequencing trace

MECP2 screening in female Rett syndrome patients (RTT is X-linked dom.) THINK - why seeing two peaks at the position of the mutation in the case of the two point mutations? What would the trace look like if it contained the causative mutation for an aut. rec. disorder? If we see evidence for a variant, how do we know whether it's causative?

Types of genetic marker - microsatellites

Microsatellite = short tandem repeat (STR) - tandemly repeated nucleotide sequence - repeating unit is 1 - 12 nucs. long e.g. repeat tract is smallish but can vary in length - e.g. (CA)n n= e.g. 8-20

Polymorphisms - common and rare

More than 700 million sites in the genome vary allele frequencies , 20% OF GENOME sequence are they all common (i.e. with a common 'minor allele')? for a disease causing allele, the minor allele is going to be very rare but for polymorphisms the minor allele might be very common are they all rare? what's the distribution of minor allele frequencies in the population? Single nucleotide changes can be common (SNPs) or rare (SNVs) in the population, if common they're calling SNPs CNVs tend to be rare • a few are common e.g. chr. 1: 1Mbp deletion poly

characteristics

Multifactorial disorders exhibit a combination of distinct characteristics which are clearly differentiated from Mendelian inheritance. The risk of multifactorial diseases may get increased due to environmental influences. The disease is not sex-limited but it occurs more frequently in one gender than the other. The disease occurs more commonly in a distinct ethnic group (i.e., Africans, Asians, Caucasians etc.) The diseases may have more in common than generally recognized since similar risk factors are associated with multiple diseases The recurrence risk of such disorders is greater among relatives of an affected individual than in the common population. Additionally, the risk is higher in first degree relatives of an affected individual than distant relatives. Multifactorial disorders also reveal increased concordance for disease in monozygotic twins as compared to dizygotic twins or full siblings.

DNA and mutation - consequences

Mutations may have a variety of consequences two types of effects: effect on functioning of gene product • almost all mutations with an effect must act via gene function..= immediate molecular phenotype effect on downstream phenotypes Mutation effects don't have to match betw. product function level and downstream pheno ( effect of gene product, effect on phenotype) - a small change in the protein sequence or expression level can lead to fatal disease a deletion or nonsense mutation ( null mutation), no effect on phenotype at all .. there are genes that are redundant but no effect on phenotype what so ever + NMD can have NO pheno conseqs.

2008

NGS- Balasubramanian, Klennerman, Church, Rothberg

What is Paired-End Sequencing?

Paired-end sequencing allows users to sequence both ends of a fragment and generate high-quality, alignable sequence data. Paired-end sequencing facilitates detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts. In addition to producing twice the number of reads for the same time and effort in library preparation, sequences aligned as read pairs enable more accurate read alignment and the ability to detect insertion-deletion (indel) variants, which is not possible with single-read data.1 All Illumina next-generation sequencing (NGS) systems are capable of paired-end sequencing. it enables both ends of dna to be sequenced. because the distance between each paired read is known, alignment algorithms can use this information to map the reads over repetitive regions more precisely. this results in much better alignment of reads especially across areas int he genome that are difficult to sequence ( repetitive regions of the genome) - Currently fragment length (insert size) can range from 200 bps - 10,000 bps - Paired-end sequencing is helpful for assembly and loca*ng repeat. It also can detect rearrangements, including inser*ons and dele*ons (indels) and inversions. - As paired end reads are more likely to align to a reference, the quality of the en*re data set improves

Genetic architecture of complex traits and disorders - oligogenic vs polygenic models

Some MF disorders (and some traits) are 'oligogenic' a few genes of major effect contribute the majority of the heritability (many other genes may each have a small effect) •15 major genes for eye colour,one gene explains lots of variation •two dozen (orso) major genes for Type I Diabetes(insulin-dependent) Many MF disorders / traits may be closer to classical 'polygenic' model - manygenes,harbouringvariantssegregatinginthepopuln.,eachvariantis of v. small effect •>1000 significant loci for height;>600 for adiposity/obesity,>200for type 2 diabetes, schizophrenia, autism, bipolar disorder, Alzheimer disease, PD etc. •polygenic influences seem to be the norm-sort of works like this fore.g. schizophrenia, autism, bipolar disorder, depression, neuroticism, Alzheimer, Parkinson diseases etc. - but see later slide..... Note that Major genes in both oligogenic and polygenic traits/disorders will be genes you can detect the effect of, not necessarily a big effect - FTO, locus with largest effect size on adiposity, explains only 0.5% of phenotypic variance in Europeans

case control studies

The case-control studies are a more common type of association studies. The sample here is divided into two subgroups: the individuals who possess and those who do not possess a target trait at an instance of study (e.g., affected and healthy individuals). The presence of indicator traits that possibly affect the emergence of the disease is assessed in each group. Nothing is known about the individuals who died before the launch of the study, thus the higher the disease mortality, the less accurate the estimation of the level of association in terms of RR. The odds ratio (OR) is typically used as a criterion for the degree of difference between the carriers and noncarriers of an indicator trait in case-control studies [1]. If absolute risk of the disease in noncarriers is low, the OR and RR values are close. The higher the risk, the larger the difference between OR and RR. OR is always higher compared to RR. The results obtained using the case-control method can be distorted because of the ethnic heterogeneity of the groups being compared or due to the environmental factors that have not been taken into account [2]. The family-based methods (e.g., comparison of the affected and healthy brothers and sisters [3]) are less susceptible to distortion. However, there are requirements for the input data (pairs of affected and healthy immediate relatives, preferably siblings, are needed) that limit their applicability for obtaining reliable dependences. The transmission disequilibrium test (TDT) [4] imposes less strict requirements on the input sample. TDT is based on the analysis of the transfer of a marker allele from heterozygous healthy parents to an affected child. The data obtained are compared with the ones expected upon Mendelian inheritance; in the case of disequilibrium of the transfer of an allele, association between the allele and the disease is inferred. The AFBAC (affected family-based control) is another family-based method of association analysis in which the control group consists of a combination of the alleles of healthy parents that have not been inherited by the affected child (one allele from each parent) [5].

liability threshold model

The liability-threshold model is a threshold model of categorical (usually binary) outcomes in which a large number of variables are summed to yield an overall 'liability' score; the observed outcome is determined by whether the latent score is smaller or larger than the threshold. The liability-threshold model is frequently employed in medicine and genetics to model risk factors contributing to disease. In a genetic context, the variables are all the genes and different environmental conditions, which protect against or increase the risk of a disease, and the threshold z is the biological limit past which disease develops. The threshold can be estimated from population prevalence of the disease (which is usually low). Because the threshold is defined relative to the population & environment, the liability score is generally considered as a N(0, 1) normally distributed random variable. Early genetics models were developed to deal with very rare genetic diseases by treating them as Mendelian diseases caused by 1 or 2 genes: the presence or absence of the gene corresponds to the presence or absence of the disease, and the occurrence of the disease will follow predictable patterns within families. Continuous traits like height or intelligence could be modeled as normal distributions, influenced by a large number of genes, and the heritability and effects of selection easily analyzed. Some diseases, like alcoholism, epilepsy, or schizophrenia, cannot be Mendelian diseases because they are common; do not appear in Mendelian ratios; respond slowly to selection against them; often occur in families with no prior history of that disease; however, relatives and adoptees of someone with that disease are far more likely (but not certain) to develop it, indicating a strong genetic component. The liability threshold model was developed to deal with these non-Mendelian binary cases; the model proposes that there is a continuous normally-distributed trait expressing risk polygenically influenced by many genes, which all individuals above a certain value develop the disease and all below it do not. The first threshold models in genetics were introduced by Sewall Wright, examining the propensity of guinea pigstrains to have an extra hind toe, a phenomenon which could not be explained as a dominant or recessive gene, or continuous "blinding inheritance".[7][8] The modern liability-threshold model was introduced into human research by geneticist Douglas Scott Falconer in his textbook[9] and two papers.[10][11] Falconer had been asked about the topic of modeling 'threshold characters' by Cyril Clarke who had diabetes.[12] An early application of liability-threshold models was to schizophrenia by Irving Gottesman & James Shields, finding substantial heritability & little shared-environment influence[13] and undermining the "cold mother" theory of schizophrenia.

risk factors

The risk for multifactorial disorders is mainly determined by universal risk factors. Risk factors are divided into three categories; genetic, environmental and complex factors (for example overweight). Genetic risk factors are associated with the permanent changes in the base pair sequence of human genome. In the last decade, many studies have been generated data regarding genetic basis of multifactorial diseases. Various polymorphism have been shown to be associated with more than one disease, examples include polymorphismsin TNF-a, TGF-b and ACE genes.[5][6][7] Environmental risk factors vary from events of life to medical interventions. The quick change in the patterns of morbidity, within one or two generations, clearly demonstrates the significance of environmental factors in the development and reduction of multifactorial disorders.[8] Environmental risk factors include change in life style (diet, physical activity, stress management) and medical interventions (surgery, drugs). Many risk factors originate from the interactions between genetic and environmental factors and referred as complex risk factors. Examples include epigeneticchanges, body weight and plasma cortisol level

The two major types of association studies

The two major types of association studies (namely, cohort studies and case-control studies) differ in terms of the time sequences in which data is collected; therefore, they also differ in terms of the parameters that can be assessed based on monitoring. In cohort studies, a selected group of individuals is divided into two subgroups; individuals who have and those who do not have a certain indicator trait (e.g., subgroups of carriers and noncarriers of a certain genotype; smoker and nonsmoker subgroups). These subgroups are monitored during a certain time interval for the development of a trait that is of interest in terms of its prediction (the target trait); e.g., a disorder. This approach enables one to numerically assess the intensity of the contribution of an indicator trait to the development of the target trait via the ratio of probabilities of disease occurrence in the carriers and noncarriers of an indicator trait. This value is assessed using the relative risk (RR).

Linkage Mapping - Multipoint Mapping

There's extra info hidden in a set of marker genotypes - of use in mapping disease genes.... - haplotype info - allows us to map disease genes relative to recombination points within the linkage map - Multipoint mapping in a genome scan - assess linkage between disease locus and each map position along a whole chromosome - but taking info from all markers into account in every calculation Multipoint linkage mapping on Chr.3 from Waardenburg syndrome genome scan • Do multipoint 'exclusion mapping' base don lod score ( bit that's very negative) Not theta but genetic distance ' cM for one chromosome you can calculate the lod score at every position in the map even if it's between markers, you can calculate at each marker multipoint mapping is more powerful because you use more information coming from recombination along the whole chromosome - looks at whole haplotype in every offspring for the whole chromosome you can do exclusion mapping based on lod scores - the bit where the lod score dives down very negative or you can do exclusion mapping based on haplotypes and finding the points of every re -combination with the disease like in the Rett syndrome example. theta - cannot go above 5, null hypothesis = 0.5 Genetic distance can be more then 50cM, because there are double recombinants and hide evidence for recombination they've happened but looks as if they haven't happened. ( another reason why theta never goes above 5) The recombination frequency between two genes is equal to the proportion of offspring in which a recombination event occurred between the two genes during meiosis. The recombination frequency between two genes cannot be greater than 50% because random assortment of genes generates 50% recombination (non-linked genes produce 1:1 parental to non-parental. Thus, the recombination frequency would be non-parental/total=1/(1+1) = 50%).

threshold model

Thus, many different genes acting together and in various combinations, with or without environmental factors, comprise the liability, or predisposition, to a complex trait. Although the phenotype is qualitative (i.e. affected or unaffected), liability to the disease is measured on a quantitative scale (figure 4). The proportion of affected relatives will be highest among severely affected persons, since their liability is further beyond the threshold than that for mildly affected persons. In line with this, the risk is higher for closely related family members and increases with the number of cases in the family. A multifactorial threshold model describing the situation in which a genetically predisposed individual is affected when exceeding a threshold of genetic and/or environmental factors (Falconer 1965). The lower right-shifted curve illustrates an increased liability compared to the top curve. Average liability is incd. for relatives of affected people, curve shifts to right, because they share risk factors - genes and env. Relative risk (RR) is proportional to the ratio of areas under the curve to the right of the threshold value for each type of relative pair Note - curve for sibs is magnified vertically - no. of indivs on Y-axis is determined by area under curve in top graph

long insert paired end reads // mate pair sequencing

To simplify, you can differ between two kinds of reads for paired-end sequencing: short‑insert paired‑end reads (SIPERs) and long-insert paired-end reads (LIPERs). The latter one is also called mate pair. The difference between the two variants is first - surprise - the length of the insert. SIPERs are 200‑800 bp long, LIPERs can be longer. Definitely more interesting is the difference in the way the two are created. Making SIPERs is not very spectacular: after fragmenting genomic DNA you can isolate fragments of your desired length (200-800 bp) and ligate adapters to them (Fig. 1). If you want to use longer inserts to cover a larger distance between the reads you have a problem: it is not feasible to use insert sizes over 1 kb. Luckily this is not the end of the story. There is a nice trick used for mate pair sequencing, shown in Fig. 1. First DNA is fragmented and fragments of a desired length (2-5 kb) are isolated. Afterwards the ends of the DNA fragments are biotinylated (adding Biotine). The biotinylated ends leads to a circularizing of the fragments ( brings two distal ends together). Then the DNA ring is crushed into smaller fragments (400-600 bp). Biotinylated fragments are enriched (by biotin tag), those end fragments are then isolated using streptavidin coated beads and adapters are ligated. They are then ready for cluster generation and sequencing. The trick here is that the produced fragment (400-600 bp) contains the ends of the original long fragment (2-5 kb) and can be sequenced now. After sequencing you therefore get information about the original fragment. the fragment is now short enough but we know that when get their sequences through sequencing, we know that they distantly located in the original genome and by the size of the fragments regenerated. We know how far apart they were so though we've only got two little pieces of sequence, we know that they were a defined physical distance apart. that can then rests us So that can be then used us to resolve these more complex structure arrangement. So there any one of these small fragments here. We can't use it to distinguish where exactly it should be. paired end reads allow bigger scale reassemblies

Multifactorial inheritance

What is meant by a gene 'for' a trait or disorder? Multifactorial - multiple causal inputs - i.e. genes + environment How do we know that an MF trait/disorder has a genetic component? How can we measure the strength of genetic influences on a trait? How does polygenic inheritance work - how do many genes at the same time influence liability to develop a disorder/trait? How do genes and environment interact? Mendelian vs. non-Mendelian inheritance Predisposition, risk, liability, predestiny, lifestyle Discrete states vs continuous distribution How can we locate and identify 'predisposing' genes?

Genetic architecture of complex traits/disorders

Where are the predisposing alleles in the population? Classification of variants: Private, v. rare - large? effect Clan/lineage - rare, moderate effect Rare SNP - lowish effect but detectable Common SNP, ancient mutation - pretty small effect, many will be undetectable even with huge sample size

multifactorial inheritance

a pattern of inheritance in which a trait is influenced by both genes and environmental factors Multifactorial diseases are not confined to any specific pattern of single gene inheritance and are likely to be associated with multiple genes effects together with the effects of environmental factors the terms 'multifactorial'and 'polygenic' are used as synonyms and these terms are commonly used to describe the architecture of disease causing genetic component. Although multifactorial diseases are often found gathered in families yet, they do not show any distinct pattern of inheritance. It is difficult to study and treat multifactorial diseases because specific factors associated with these diseases have not yet been identified The Multifactorial threshold model assumes that gene defects for multifactorial traits are usually distributed within populations. Firstly, different populations might have different thresholds. This is the case in which occurrences of a particular disease is different in males and females (e.g. Pyloric stenosis). The distribution of susceptibility is the same but threshold is different. Secondly, threshold may be same but the distributions of susceptibility may be different. It explains the underlying risks present in first degree relatives of affected individuals.

factors affecting success in genetic mapping studies

detecting linkage family size and no of families - very important because every offspring of a heterozygous parent adds 0.3 to the lod score so your lod score goes up very quickly as families get bigger - reduced penetrance = individuals unaffected but carry the disease allele which is going to lock up your guesses about which markers are recombinant with the disease, if you don't know which individuals have no penetrance you cannot assign phase and you can't work out for certain whether they're recombinant so that info dissaperas from the lod score calculation lower penetrance - harder to get to lod score threshold

if looking for a disease causing mutation

expect a result in change in the sequence of the gene that you would expect deleterious Have result in a change in the sequence of the gene that you would expect to be deleterious okay so if it's a synonymous mutation that doesn't change the coding potential of the gene, then you can reject that as probably not causing the disease recessive - 2 mutations in same gene or compound heterozygotes Sanger sequencing confirmed the presence of DHODH mutations in three additional families with Miller syndrome allelic heterogeneity - Allelic heterogeneity is the phenomenon in which different mutations at the same locus lead to the same or very similar phenotypes. These allelic variations can arise as a result of natural selection processes, as a result of exogenous mutagens, genetic drift, or genetic migration. Many of these mutations take the form of single nucleotide polymorphisms in which a single nucleotide base is altered compared to a consensus sequence. They can also exist as copy number variants (CNV) in which the copies of a gene or DNA sequence is different from the population.[1] Mutated alleles expressing allelic heterogeneity can be classified as adaptive or disadaptive. These mutations can occur in the germ line cells, somatic cells, or in the mitochondrial. Mutations in germ line cells can be inherited as well as mitochondrial allelic mutations. The mitochondrial allelic mutations are inherited maternally. Typically in the human genome a small amount of allele variants account for ~75% of the mutations found at a particular locus within a population. Other variants found are considered to be rare or exclusive to a single pedigree. The Online Mendelian Inheritance of Man has a record of over 1000 genes and their associated allelic variants. These genes display allelic heterogeneity at their loci and are responsible for distinct disease phenotypes. Some of these diseases include alkaptonuria, albinism, achondroplasia, and phenylketonuria.[2][3] For example, β-thalassemia may be caused by several different mutations in the β-globin gene. Allelic heterogeneity should not be confused with locus heterogeneityin which a mutation at a different gene causes a similar phenotype. Nor should it be confused with phenotypic heterogeneity in which a mutation within the same gene causes a different phenotype. Other major diseases displaying allelic heterogeneity are allelic mutations in the dystrophin gene which cause Duchenne dystrophy and mutations in the CFTR gene that are known to causes cystic fibrosis.

1995

first bacterial genome - J.Craig Venter efficiency = 200,000

heterozygous variant

half reads contain reference allele and half would contain the variant allele on average. if only sequence 4, only 4 one one and none of the other .. if 30, both number of times ..

Continuous multifactorial or polygenic traits

height, weight, (BMI), adiposity, limb length, muscularity performance • running speed physiol. parameters • blood pressure • response to training skin, hair, eye colour Eye colour - 3 alleles of OCA2 explain70% of variance in iris colour, also other genes Hair colour - several common alleles ofMC1R contribute majority of red hair predispn., but other genes contribute fingertip ridge count personality traits - extroversion, novelty-seeking, neuroticism intelligence/IQ/brain function modules

Disease gene mapping - Haplotype analysis

how do we know these genotypes are given into haplotypes ? - genotypes , comma in between alleles if you leave out the comma you're implying that these are haplotypes lod score does this and quantifies it as a number

2001

human genome sequenced - sULSTON,vENTER, collins efficiency = 50,000,000

multiple copy sequence

if we see region where there's a higher gene read depth than that suggests that that sequence has somehow been increased in copy number ( multiple copy sequence) what we don't know from this analysis is weather these extra reads are actually present at the same genomic position or whether it means that this sequence is duplicated somewhere else in the genome so by simple re aligning these fragments to the reference genome, it won't resolve that question. it will tell us that there's an increased copy number but not necessarily whether that's a tandem duplication here on an increase in somewhere else so information about tandem duplications or whether its elsewhere would only be driven if we have sequences that span the flanking region so go from the region that's duplicated into the flanking DNA that will generate new unique sequences, if there's been a duplication here or an insertion insertion elsewhere in the genome

example

if you add a second locus, more different values of the phenotype .. more loci, closer you get to a smooth curve ( infinite number of genes) Francis Galton was the first scientist who studied multifactorial diseases and was the cousin of Charles Darwin. Major focus of Galton was on 'inheritance of traits' and he observed "blending" characters.[11] The average contribution of each several ancestor to the total heritage of the offspring [12] and is now known as continuous variation. When a trait (human height) exhibiting continuous variation is plotted against a graph, the majority of population distribution is centered around the mean. [13] Galton's work is contrary to work done by Gregor Mendel; as the latter studied "nonblending" traits and kept them in different categories.[14] The traits exhibiting discontinuous variation, occur in two or more distinct forms in a population as Mendel found in color of petals.

what's the distribution of minor allele frequencies in the population?

ignore blue and orange the green bar ( exam sequencing project) is very high for varying site that have a low minor allele frequency = rare variants ( below 5%) close to a 100% of variants fall into that from 5% to 50%= few of those variants that fall into these categories but most variants are rare why is the distribution that way? why are most very rare and few very common - natural selection ( not the biggest factor tho) - common variants ( SNPs) cant be exposed to strong negative selection and that's why they're still here but the rare sips can be deleterious or neutral and have no selection acting so why are most variants today rare ? - to bear in mind: intermingling of populations since the last bottleneck very recently, human species exploded in population size.. for most of our evolutionary history, there were about 10,000 breeding humans in every generation at any one time, the bottle necks are when we went fewer to that that's led to all humans since then .. the bottle necks take away existing rare variation and common variations so humans are less variable than chimpanzee because we went through bottlenecks more than chimpanzees ..its not because of bottle necks but what happened after the last one ! human species has went from big species to billions of big species more people we sample - more sites you'll see ( particular variants) .. sampling variation space

read depth

insights into larger variants , heterozygous deletion in that region, would expect half the read length homozygous, no reads mapping to that region So, the absence of alignment is also a clue to those sequences don't exist in the particular individual, from which we are obtaining the DNA sequence fragments

1970

lambda cohesive end sequenced - Ray Wu sequencing efficiency = 15 possible to determine DNA sequence naturally ss sequences enzymatic digestion able to sequence dna at 1 5

detection of variants - SNPs

line sequences - compare to reference genome .. yellow different from reference region .. genotype at that position is homozygous ( assuming high read depth ) such as 50, confident with assigning as homozygous ( intermediate read length) if 5 or 6 times and not 1, and all of them were one allele then wed be reasonably confident that homozygous but not definitive because if one person was truly heterozygous and the chance that we had get all six of them and one allele, is essentially 1/2 to the 6 which is a low number and remember if we sequence the entire genome wed be looking at millions of variants .. so we need a read depth of at least 30x across the entire genome to be confidently of accurate genotypes of all polymorphisms in a genome to be sure we can truly extinguish heterozygotes from homozygotes effectively.

Mutation detection

mutation screening methodologies - ideal approach will be cheap, efficient / high throughout, sensitive e.g. • CSCE (conformation-sensitive capillary electrophoresis) - rel. cheap and v. efficient, up until 3 years ago this was the workhorse, high-throughput bases.

Mutations may have a variety of consequences on (overall) phenotype = size of effect on phenotype

no effect on phenotype - neutral variants = most mutations are in this category = 0 penetrance • a v. small effect on phenotype - can be beneficial or detrimental = these may have evolutionary consequences ( lactase variant example) • a moderate effect on phenotype = these account for many familial traits and mild disorders; occasionally of evolutionary significance • a large effect on phenotype= these account for severe disorders and lethality

1919

nucleotide structure determined - Phoebus levene

is you sequence one persons genome ? what's the distribution like?

orange - minor allele frequency (MAF) , rare variants yellow - uncommon variants , 1-5% MAF majority of variants - common SNPs, 5% OR MORE variant - anything different than the reference genome sequence but in population = most variants are rare ! you sample one person's DNA. You're not going to find many rare variants in that person's DNA, you have to sample. Lots of people before you find the rare variants in the population. Okay, that's kind of how that works

1986

partial automation of DNA sequencing - Leroy Hood efficiency = 50,000

NGS

potentially offer a shortcut That obviates all of the mapping and linkage analysis that we've talked about previously, as was previously applied, we can now jump potentially much quicker to identifying disease causing mutations by getting whole genome sequences or whole XM sequences in a smaller number of individuals And the first example of where this technology was applied to identify the cause of a Monday and inherited single gene disorders in the case of Miller syndrome. Where they used XM sequencing to identify the disease causing gene in the absence of doing any of the traditional mapping and linkage analysis that previously had been the standard route, up until the implementation of these technologies. sequenced • four affected individuals in three independent kindreds ( families) • sequenced coding regions to a mean coverage of 40x • called 97% of variants • filtered against public SNP databases and eight HapMap exomes • identified a single candidate gene, DHODH Typically, if you sequence and XO and compare it to the reference genome, you end up, typically with about 70,000 80,000 differences between that genome and the reference. compare sequences to databases of known polymorphisms Then any and if it's a dominant disorder than anything that's present in the general population, you can likely exclude as being a causative mutation. If it's a recessive disease, you might expect to see that occasionally in the general population in our heterozygous form, but if it occurs in the homozygous form, then you can reject that as being likely your causative mutation. And if the disease is very rare, then you probably wouldn't even expect necessarily to see any copies of this variant in even in the heterozygous form in the general population, unless you have a very large sample size, which actually many of the databases. Now do

which sites at genome no mutation or variant for ?

sites that are early lethal during early development , anything that stops gametes from fertilising affecting development early because a baby won't be born hence a sample won't be examined ! Nasty repeats

ways to address that issue - paired end sequencing

still using illumina short read technology, sequences can be read from both ends to increase read length .. you get both the forward read of one strand of dna molecule as well as a reverse read of the other strand. for instance, if you've had a 150bp read ( very small fragment) if sequenced then you'd get a forward read and the reverse read of the same fragment leading increased confidence that the sequence of that fragment was correct. ( removing sequencing errors) however, if larger fragments are to be sequenced, insert being read is slightly longer, there's gonna be an overlap in the middle but there's also a unique sequence on the forward Read and reverse reads so overall you get a sequence that's essentially the size of the reads added together which increases read length larger fragments can be sequenced (2kb), sequencing only their ends ( 150bp) each side. we can make libraries using either short fragments or long fragments. if the fragments are purified before sequenced then we should know how far apart these two sequences are. that gives us a way of potentially spanning larger distances and getting more physical information about the relative location of sequences.

1953

structure of DNA determined - Watson, crick, Franklin, Wilkins x ray crystallography hydrogen bonding held two bases together

assembly and repeats

we also can have regions of the genome, the way there are complicated repeated structures blue unique sequences, with yellow and green but both of these are separated by the red sequences. this is a repeated sequence that exists in 3 copies scattered throughout the genome ( red). using the short reads we can reassemble these and know how they can align color junctions one way data cn be assembled So this is one way that the data could be assemble, but actually just using those short read that would be equally parsimonious so we cannot distinguish between these two possibilities using those short read data. inconsistencies using short reads and the longer the repeated sequence, the more difficult it is to resolve them using short reads

association

we can ask this question in two ways: for discontinues phenotypes= ask whether a gene variant is found more often in cases with the disorder/trait than expected based on population frequencies of each- testing using a case control study ask whether f(Xcases) > f(Xcontrols) using a statistical test all we have to do then if pick a genetic marker, test it for association with a phenotype

shorter reads

we might have two variants that are relatively close to each other within a hundred basis apart in the same gene potentially maybe even the same exon .. but if reads are very short and no read span both of the variants which they won't do if they're 100 bp or so apart. then although we can determine the genotype here, we cannot determine their haplotypes .. we know they're heterozygous, but we don't know if the same variant alleles are on the same chromosome and the other chromosome has the same reference version of whether the two variants are on opposite chromosomes in this arrangement. so essentially we can use the data to get their genotypes but short reads don't tell us what the phases between these two physically close variants.

germline variant, somatic variant or sequencing error?

we might only see one variant present in a background of other areas that are similar to the reference .. or a couple of time but we don't know is if that variant was a heterozygous .. At least in this case out of 10 times again if we sample 30 40 x and we wouldn't expect that scenario James's would expect to see it multiple times. other explanation - potential explanations are that this is a sequencing error. That's possible. Getting we get suddenly a background frequency of sequencing errors. or it is a somatic variant, very unlikely it wasn't inherited but its occurred during lifetime of individual // rare somatic variants Such that it's not present 50 50 in the genome, but present as of some lower frequency and is only carried by a subset of cells Next-generation sequencing (NGS) read length refers to the number of base pairs (bp) sequenced from a DNA fragment. After sequencing, the regions of overlap between reads are used to assemble and align the reads to a reference genome, reconstructing the full DNA sequence. Sequencing read lengths correspond directly to the sequencing reagents used on an NGS instrument—more chemistry cycles generate longer reads. Choosing the right sequencing read length depends on your sample type, application, and coverage requirements. Because long reads allow for more sequence overlap, they are useful for de novo assembly and resolving repetitive areas of the genome with greater confidence. For other applications, such as expression profiling or counting studies, shorter reads are sufficient and more cost-effective than longer ones.

Genotyping SNPs - RFLPs and PCR-RFLPs

works only if SNP alters a restriction site.... PCR amplification and restriction digestion restriction digestion RFLPs the 'old way'- Southern hybridisation

Genetic disorders and traits- features of inheritance and examples

• 3 basic 'types' of genetics: • single gene disorders and traits • polygenic disorders and traits • Multifactorial • environment also plays a role

Linkage Mapping - Genome-wide Scan for Linkage

• At the end of this process, we're left with..... - linked marker, or flanking markers defined by obligate recombinations • What's next - can we do more to refine the position of the disease gene?... - carry out 'fine mapping' by saturating the region choose all available markers from the densest linkage maps (>10,000 microsat. markers avail.; or choose from >20 million common SNPs) choose markers within or flanking known genes improve 'exclusion map' by defining ever-closer recombination events with new markers OUTCOME - a 'critical region' of 0.5 - 5 cM (if you're lucky!) flanked by recombinant markers defines the region within which the disease gene must lie - and narrows down the gene set the critical region is the bit of the chromosome with it in the genetic map within which the disease gene must lie because it is non recombinant with every marker in that region disease must be non recombinant with the markers to lie in the critical region first marker on the left that's recombinant cant be next to the disease gene and everything to the left of it is excluded. the position of the first marker on the right hand side that's recombinant with the disease also cannot be where the disease locus is and everything to the right is excluded the disease causing mutation must lie between this marker on the left and this marker not he right ( name markers) , cannot be at the point but anywhere between those two markers. ( fully accurate way to say it)

Linkage Mapping - Genome-wide Scan for Linkage

• Genome scan involves: - genotyping marker set in affected and unaffected indivs. in each family; why bother typing the unaffecteds...? - calculate linkage between disease locus and markers • linkage analysis - start with two-point linkage calculations if you've got a genetic map, you may then asked the question, is the disease locus linked to each of these markers in turn one at a time. first step in linkage analysis. - disease vs each marker in turn ? are they linked ? The way you figure out whether the two are linked is you work out the recombination rate and you use that to generate this thing called the lod score, which is the logarithm of the odds in favor of linkage. - what's the basis for this calcn.? » recombination rate q» calculation generates a 'lod score' » log of the odds in favour of linkage • Lod score is inversely proportional to the no. of recombinations observed between disease and marker - incs with inc. sample size So the more recombinations, there are between disease locus and a marker, the lower the lod score. - higher lod score - few recombinants - tighter linkage - lower lod score - many recombinants - less linkage - significant evidence for linkage - lod > 3.0 more meiotic transmissions - more data. add smith o potential lod score every time you add a offspring • lod score of 0 means 50% recombination, no evidence for or against linkage ( cant get - lod scores) - Mendel's 2nd law - random assortment of alleles at two loci - no linkage Null hypothesis; what happens at Mendels second law

Mutation detection by CSGE

• Heteroduplex analysis • example using DNA from a heterozygous indiv. mispaired 'bubbles' in heteroduplex will retard migration in gel/capillary pcr product, denature strands but you let them reanneal in a mixture where you've got both alleles present ( mutant and wild-type) when you let those reanneal you'll get a homoduplex where strands have gone back together properly ( AT GC) but you'll also get a heteroduplex ( AC GT), there's a bubble, mis-paired, run on a gel if you look at a trace from electrophoresis, if you see there are extra peaks suddenly appearing, that indicates you've got a mutation present. nowadays this is done by NGS !

next generation sequencing technologies

• Illumina read fragment length small - 150 bp challenges sequencing large dna variations ( repetitive dna elements ) • Roche 454 • ABI SOLiD compete with each other, they're falling away since illuminati is the primary platform of huge large scale sequencing • Complete Genomics Beijing, own sequencing platform • Ion Torrent ( not as good, cheaply, small machines) • PacBio •Oxford Nanopore primary technologies, not yet capable sequencing dna as cheaper , longer sequencing reads

Prioritising genes for mutation screening expts.

• Imagine you are interested in a particular genomic region • e.g. a critical region from a linkage study • how do you find out what genes are present in that region?, might be dozens of genes; once you've fine mapped, pulled the edges in the critical region that's as far as you'll get to, no other marker in between is recombinant you cant use linkage or recombination to get any closer to the disease gene, everything else has to be on a molecular level assay to find the mutation • used to be a big job • nowadays, use HGP data and genome browsers....

questions

• Once genetic mapping has shown us where it is, how can we identify the causative gene for a single gene disorder? • Prioritising genes within a candidate region • Mutation detection

Types of genetic marker - and interpreting genotyping data...

• SNP (single nucleotide polymorphism) some SNPs alter a restriction site, so are also examples of... RFLP (restriction fragment length polymorphism)

Genetic architecture of complex traits/disorders

• each gene that varies contributes a certain amount to the 'genetic variance' in a trait between individuals • some variants/genes have larger effects than others for each trait disorder • effect size for each variant in each gene, and allele frequencies of those variants, determine 'genetic architecture' - i.e. how much each gene contributes* to trait/disease each row represents the genetic variants that contribute to risk in one individual , the top individual has lots of variants that contribute a little bit of each risk and one variant that has a moderate effect size and contributes more risk ' bigger arrow'. the point is they've got enough tiny effects and the moderate effect to push them into the lower zone above the threshold that why they're affected the third individual has mostly tiny effects contributing a moderate effect size gene and a big effect size gene, that's what's pushed them over the threshold and into the disease affecting state. everyone has mixture of variants that contribute either tiny/ moderate/ big effect on risk for any one disease And it's that effect size distribution that is that is this thing we call genetic architecture = for a single gene disorder one gene contributes all the risk and the risk coming from all the other variants in the genome is zero. if there's a modifier gene = It's one gene contributes a lot of risk another gene contributes a bit of risk and together they explain everything and everything else explains, nothing. for polygenic - all effects are tiny , no big effects genes // extremes of genetic architecture based on penetrance higher penetrance - more effect size = more phenotype explained by that gene low penetrance = explain very little each but all add up to a lot • N.B. 'Effect size' is a GENETIC explanation and relates ONLY to% phenotypic variance explained - it has nothing to do with how much change there is in the function of the gene product (see earlier slide...)

Genetic architecture of complex traits/disorders

• each gene that varies contributes a certain amount to the 'genetic variance' in a trait between individuals • some variants/genes have larger effects than others for each trait/disorder • effect size for each variant in each gene, and allele frequencies of thosevariants, determine how much each gene contributes to trait/disease in population (% variance explained and/or proportion of heritability explained) • all MF traits/disorders therefore have a 'genetic architecture' see e.g. Todd 2010 T1D review • any genomic location harbouring a polymorphism that influences a quant. trait is known as a QTL • quantitative trait locus

Genetic architecture - where are all the functional variants?

• effect size for each gene, and allele frequencies, determine how much each gene contributes to trait/disease (% variance explained) • high frequency alleles tend to have small effects, and low frequency alleles tend to have bigger effects (or is that just detection bias...??) • what about the 'corners' of this graph? v. few common variants with large effects exist LOTS of rare variants with small effects probably exist - hard to detect such effects • this governs what we've found so far in terms of genetic architecture for complex disorders/traits • Height - 700 common variants explain 20% of the heritability (Wood et al. 2014) evidence for causal variants that come from tiny effect size but for common variants ( MAF)= GWAS approach, statistical test = association cant use linkage

next generation sequencing challenges -

• reads are short (35 to 150 bp) • but there are millions of them how to deal with short reads ? align to each other ( resemble ) if new genome, new species, no reference for human genetics, already reference sequence that exists sequence short fragments and realign those fragments against referent sequence then infer sequence of individual and compare to reference genome imp consideration, what is the read depth at a given region ? read length is how many times any one base has been read in a different sequence ( total no of times its been read ) important for giving info about the reliability of the sequencing. if sequenced once, wouldn't know if heterozygous and variants. establishing accuracy of any one sequence you gain for an individual

• where are the polymorphic sites in the genome...?

• see 1000GP variant distributn. track in ENSEMBL snapshot of ensemble gene view for a particular genes VARIANTS ARE EVERYWHERE !!! , a bit clustered

Genetic architecture of complex traits/disorders

• some variants/genes have larger effects than others for each trait/disorder • effect size for each variant in each gene, and allele frequencies of thosevariants, determine how much each gene contributes to trait/disease in population (% variance explained and/or proportion of heritability explained) high effect size alleles contribute more than low effect size alleles • effect size relates to penetrance high frequency causative alleles contribute more than low frequency alleles e.g. distributn. of effect sizes for the more strongly genetic Cancers - see Fig. • the common variants explain more cancer than the rare variants


Related study sets

Test 4 on Market Structures: ECO

View Set

A2.2: Unit 2 - Travel - Vocabulary

View Set

CP 4, The Carbohydrates, Sugars, Starches and Fibers

View Set

Nutrition Exam 3 (Adaptive quizzes)

View Set

Reproductive System/Menstrual Cycle

View Set

31 pays d'Amérique et leur capitale

View Set

CH 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 29, 31, 33, 41,

View Set

Paget Disease of Bone, Rickets & Osteomalacia

View Set