Genomics Year 2
Where do new genes come from?
-gene duplication: usually only kept if the increased gene dosage is beneficial or further mutations lead to one copy acquiring new function -gene fusion -gene fission new genes are being generated but others are lost so the overall amount of information in the genome remains stable. In the comparison of 12 Drosophila genomes, it was shown that 40% of gene families have size variations, with 17 duplicated genes reaching fixation per million years. Although gene turnover is high, the gene number is overall stable due to genes being lost as well as gained.
Why sequence genomes?
-help understanding of biochemical activities of a cell and the ways in which they are controlled (particularly by regulatory regions) -identify and characterise important inheritable disease genes or useful functional genes -geographical origin can be calculated for disease tracking and prevention -understand relationships between organisms and how they evolve.
Use of Y chromosome to study human migrations
-largest segment of non-recombining DNA in genome -Passed only through the paternal line.
Hierachical shotgun sequencing
-library of DNA segments created using bacterial artificial chromosomes (based on plasmid which clones the DNA) -all BACs screened for markers and classified by location -set of minimally overlapping BACs selected for sequencing -sanger sequencing -assembled fragments by overlapping segments
What caused high mortality of the Black Death?
-malnutrition and host susceptibility -may have been pneumonic plague as person-person transmission had a 90-05% fatality rate -no treatments available -other pathogens/co-infections
Use of mitochondrial DNA to study human migrations
-mitochondrial DNA only passed from mother to offspring
Structure of a chromosome
2 coils of DNA wrapped around histone octamer proteins (makes up nucleosomes) and gathered onto a protein scaffold. 2 sister chromatids attached by a centromere.
Issues with cDNA and ESTs
5' ends in particular can be erroneous (wrong) mRNA from genes expressed at a very low level is hard to clone mRNA from very large genes can be impossible to clone High quality full-length cDNA libraries are crucial
Human variation
99.9% identical, 1 base different every 1kb.
Do mRNA and protein abundance correlate?
Little relationship between the two, due to regulations in translation and protein degradation. However, the ratio between mRNA and protein per gene is highly conserved across tissues. if ratio of mRNA to protein levels are known for given tissue, it is possible to predict proteome profiles from mRNA alone for other tissues.
Manual vs automated Sanger sequencing
Manual: 4 reaction vessels, each for a different ddNTP radio labelled (with 33/P or 35/S). To achieve long reads, use lower ddNTP conc, shorter->higher. 400-600bp sequences. Need four lanes for gel electrophoresis, then autoradiographed. Automated: each ddNTP has a different fluorescent tag, therefore one reaction vessel and single lane required, laser scans the DNA for characteristic wavelengths as the gel is run. 500-1200bp sequences.
Paired-end sequencing
Sequence generated from both ends of a DNA clone; provides evidence of physical linkage of the two paired sequences. Forward and reverse primer on either side of the DNA fragment
post-translational modification
Shows extent of protein diversity, abundance and properties. -addition of chemical groups -amino acid modification -cleavage -addition of polypeptides
Barriers to recombination
-adaptive: selection against hybrids due to less successful mating, predators -mechanistic barriers e.g chromosomal incompatibility so sterile offspring -ecological barriers: consequence of physical separation of populations.
What is the FASTQ File format?
1. @read ID/Name 2. DNA sequence 3. +sign 4. Quality scores
genetic mechanisms of bacterial adaptation
1. DNA replication errors causing point mutations, rearrangements or deletions. Into coding regions, can cause novel protein with new function. Indels (insertion or deletions) can cause loss of function and associated with reductive evolution. Mutations in promoter regions (intergenic) can cause changes in gene expression by activation or inactivation. 2. horizontal gene transfer where genetic material acquired from external source and incorporated into genome by recombination
Theoretical approaches to explain patterns of molecular evolution in bacteria
1. Neutral diversification: most of the genetic variation can be explained by genetic drift. Completely neutral: lineages colonise all hosts and expand to occupy niches Neutral with transmission barriers: strains colonise differentially in the hosts, giving rise to multi- or single host lineage. 2. ecotypes: highlights section for adapted lineages in a given environment. Adapted trait acquired, causing selective sweep.
How do you construct a DNA library?
1. input DNA 2. Fragment DNA 3. Ligate Adapter 4. Amplify with PCR 5. Bioanalyse
What is the genome made of ?
1.5% protein coding genes introns 25.9% large percentage of rest are repetitive sequences
Measuring microbial diversity
16S rRNA (binds to the Shine-Dalgarno binding site) of the small 30s subunit of prokaryotic ribosome. Slow rates of evolution in this region mean it is used to reconstruct phylogenies.
Repetitive elements in the genome
2 main classes: 1. Short tandem repeats: unrelated to transposable elements. At the subtelomeres (mini satellite 9-64bp) and pericentromeres (satellite 5-171bp). Microsatellites 1-13bp are dispersed across chromosome. 2. Interspersed repeats: include transposable elements and related sequences
Yersinia pestis (plague): paleomicrobiology
3 major human pandemics, each had different biovars (biogroups) Justinian 6-8th= antiqua Black Death 14-19th= Medievalis. 1348 in UK, kills 50% population. Modern 19th onwards= Orientalis Bubonic: Acquired by humans via flea bites, spreads in the lymphatic system. overwhelming infection leads to multi-organ failure. key plasmid is PCP-1 which has a high copy number. Pneumonic: enters in mouth, nose, eyes, causes pneumonia in the lungs, almost 100% mortality. Both extremely contagious. Y. pestis sequenced from skeletons 1348-49 using NGS. Found the the Black Death genome almost identical to Y. pestis causing bubonic plague today. Perceived increased virulence during black death may not have been due to the bacterial genotype.
retrogene
A DNA gene copied back from RNA by reverse transcription
linkage disequilibrium (LD)
A non-random association of genes/markers that are inherited together as a unit because of their close proximity on the chromosome, in violation of Mendellian laws. LD is the connection between nearby markers in populations
Epistasis
A type of gene interaction in which one gene alters the phenotypic effects of another gene that is independently inherited.
Microbiomes
Animals comprise self and resident microbiota (x10 human cell). critical functions in human physiology, e.g immune system development, metabolic activities.
Mosaic Genes
DNA ancestry in two or more different populations.
Introns and their function
Elements which regulate gene transcription, regulate alternative splicing.
Metagenomics: reference based and reference free
Fragment DNA and map reads to a reference genome. Coverage can be used to estimate relative abundance. Ref free: Shotgun sequencing followed by assembly of contigs. Then binning based on the presence of known genes (only need once), contigs with same proportions likely to come from same species. It is hard to discriminate strains however.
Which was the first complete genome sequence of a free living organism?
Hameophilus influenzae 1995
haplotype blocks
Large segments or blocks of DNA containing numerous SNPs in linkage disequilibrium (inherited together as a set)
What is homoplasy?
a character shared by a set of species but not present in their common ancestor
What is comparative genomics?
addressing biological questions through the comparison of genomes from different species allows us to: -study evolutionary changes within a defined taxa and the rates of change -which genes have played role in evolution of specific traits, such as life history -identify previously non-annotated genes and exons and assign function -identify functional areas of non-coding DNA (highly conserved areas likely to have function) -identify genes important for key variations in traits -identify genes underlying disease
Coverage
after sequencing, there will be many overlapping sequencing reads that cover genome multiple times. Genome sequenced with enough reads to cover it X times, has a coverage of X. AKA read depth Also used to mean the number of reads that cover an individual base. e.g. three overlap, 3% physical coverage.
Obesity, Type II diabetes and the human microbiome
altered gut microbiota compared to 'healthy' controls. signs of mucosal inflammation increased gut permeability- endotoxaemia.
Genome Wide Association Studies (GWAS)
an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. Usually 500k-4m SNPs. Variants associated with the disease, or within the same haplotype (close together on chromosome) as a variant associated with a disease, will be found at a higher frequency in cases than controls GWAS have the potential to allow clinical advances such as prevention, therapeutic targets and also develop personalised medicine and help diagnostics, prognosis.
Single molecule sequencing
approach to determining the sequence of a polymer using only a single molecule. Using Oxford nanopore device.
Difference between assembly and mapping
assembly uses raw read data to reconstruct the genome de novo. Mapping aligns reads to a reference genome, then variants between reads and reference are called at each base. paired reverse read aligned approximately insert (fragment) size downstream from forward.
Enteropathogenic
bacterium causing intestinal tract infections.
Bottom-up and top-down approaches to microbiology research
bottom-up- start with DNA sequence and test the effect on the phenotype Top-down: starts with the phenotype and associates it with particular genomic elements.
S. aureus
causes variety of diseases in poultry and other avian species. -plasmid mediated
What causes phenotype variation?
changes in protein products and gene regulation caused by: -mutations in DNA -environmental factors -epigentics
Gene interaction networks
co-expression networks -compare the expression of every possible gene pair in as many different conditions or time points -quantify similarity in expression using correlation coefficient -set threshold to consider any two genes co expressed or negatively co expressed.
Metagenome
collection of sequences taken directly from the environment
pan genome
core genome: genes shared by all strains Accessory genes: shared by some but not all strain specific genes closed genome=no more novel genes above a certain number of genome sequences. The more strains sequenced, the less open the pan-genome will become.
Cag PAI
cytotoxin-associated gene A is on the 40kb cag PAI. Encodes for type 4 secretion system used to 'inject' cagA into a target host cell. CagA the localises to inner surface of cell membrane, becomes phosphorylated by SHP-2 TP, activating it and changes host cell to 'hummingbird phenotype' which is more motile. this phenotypes may participate in many aspects of cancer, such as metastasis.
coalescent theory
developed to study gene-genealogical relationships by tracing the ancestry of gene copies in populations. Different parts of the genome will coalesce (join relative's genome) at different points in time. Due to stochastic processes or selective pressures acting on maintaining or reducing variation at particular loci
Helicobacter pylori
discovered bacterium in 1982 from patients with upper gastrointestinal disease, results in peptic ulcers. acquired from oral ingestion of bacterium. Transmitted within families in early childhood. prevalence highest in developing countries 80% in middle aged adults.
Identifying adaptation by measuring rates of amino acid replacement
frequency of substitutions at synonymous sites dS (presumed neutral) with that at non-synonymous dN (result in amino acid replacement and may be subject to selection) dN/dS <1 associated with negative or purifying selection, suppressing protein changes. dominates over long evolutionary timescales as removes deleterious mutations. dN/dS >1 associated with positive selection, promoting changes in the protein sequence. may be associated with antimicrobial resistance Limitations: Selection operates on other features such as GC skew, codon usage which may not affect dN/dS. Complex adaptations involving multiple genes which might not be detectable by analysing dN/dS. Frameshifts of start codons can lead to non-synonymous SNPs being interpreted as synonymous, leading to inaccurate dN/dS.
How can we tell if genomic islands are adaptive and confer virulence?
functional homology: present in pathogenic strains but absent in closely-related species. known as pathogenicity islands PAIs.
BMP4 gene
gene involved in the increased depth and breadth of of finch's beaks.
What is the C21orf34 gene
lincRNA gene, 15 exons producing at least 19 alternative splicing variants. Encodes a number of microRNAs. -some articles link it to leukaemi, other ADHD, others risk of obesity.
What are pathogenicity islands?
major virulence factors on large segments on chromosomal or plasmid DNA. -often differ from core GC content-> lateral transfer from HGT. -mosaic structure from multiple events -flanked by small direct repeats-> recombination
MRSA
methicillin-resistant staphylococcus aureus -
Strand sequencing
one protein unzips DNA helix into two strands. A second protein creates pore in membrane where adaptor molecule resides. Ions through the pore creates a current. Each base blocks the flow to a different degree, altering the current. The adaptor holds bases long enough for them to identified electronically.
Fisher's exact test
p= ( ( a + b ) ! ( c + d ) ! ( a + c ) ! ( b + d ) ! ) / a ! b ! c ! d ! N !
Prokaryotic vs eukaryotic genomes
prokaryote: genome not contained in nucleus, non-condensed chromatin, single circular chromosomes, single replication origin eukaryotes: Nuclear membrane, linear chromosomes, condensed chromatin, multiple replication origins, single gene transcription.
What is second generation sequencing (AKA NGS)
short read sequencing: DNA sequences with length 100-350 bp. HiSeq technology= 100-250bp MiSeq= 200-350 bp
monomorphic
showing little genetic variation e.g TB, plague.
Population stratification
subpopulations exhibit allelic variation because of ancestry. can cause false positives if SNP differences in case and controls. control this by testing SNPs for general elevation in chi^2 between case and controls.
Evolution of gene expression
-activators are proteins that bind to enhancers speeding the rate of transcription -repressors bind to silencing regions and slow transcription -coactivators are adapter molecules and integrate signals from the activators -basal transcription factors: position RNA polymerase at start of transcription and initiate the transcription process. -changes in these can alter expression. Also changes in the promoter, enhancer, silencing regions -also changes in chromatin structure
Did Europeans interbreed with Neanderthals?
-archaeological data places the two species in the same geographical area -estimated 1-4% contribution from Neanderthals for non-African populations
How has the rate of WGS data changed since 1990?
-becoming cheaper -storage increasing tenfold
The human genome
3.2gb 20-22,000 genes. Originally thought higher to account for differences in complexity to bacteria for example 5000 genes. 23 chromosomes introns no operons (cluster of genes controlled by single promoter)
Gene annotation
Analysis of genomic sequences to identify protein-coding genes and determine the function of their products. Now done in an automated manner. DNA sequence translated into animo acid and compared to databases of genes with known function. Ab-initio: identify ORFs from sequence data e.g. identify start codons, stop codons. Then compare the translated sequence in the ORF to existing database e.g BLAST homology: transferring annotation from one genome to another or looking for homology between set of known genes and genome sequences. Can identify pan-genome or study function of genes that have acquired mutations in a population
Comparative genomics
Computer-aided comparison of DNA sequences between different organisms to reveal genes with related functions. -reads mapped to reference
Third generation sequencing
Long read sequencing. Assembly can be performed on long reads to produce finished genomes. Currently error prone so combined with second gen = hybrid assembly. This is where the accuracy of calling bases from 2nd gen coupled with the scaffolding power of long reads to solve genomic feature that are unresolvable by short reads alone. They span entire lengths of repeat (low complexity) regions and also help to resolve intermittent identical repeats.
Single molecule real time sequencing (SMRT): third-generation sequencing
Main advantages: library preparation not necessary. Sequencing reagents not needed. Oxford nanopore: MinION -directly detects DNA template composition -DNA molecule crosses the pore due to action of secondary motor protein -shifts in the voltage (potential) of each side of the pore characteristic of each DNA sequence. 10-100kb Pacific Bioscience: -DNA polymerase fixed at bottom of well. -single DNA molecule templare -zero-mode waveguide (light focussed into well) -phosphate labelled nucleotides incorporated, which gives off detectable fluorescent signal. up to 30kb.
Ecological mechanisms for pathogenicity
Pathogenic clones: proliferation of pathogenic clones while other stains do not. true opportunistic pathogenicity: multiple genetically divergent clones proliferate in blood and disease determinants are equally distributed among genomes of isolate Divided genome: HGT spreads pathogenicity determinants Ito multiple backgrounds allowing divergent clones to colonise blood successfully.
What is the genome of an organism?
The entire genetic material of an organism (including all of the chromosomes and the mitochondria).
Mitochondrial Eve
The first female ancestor shared by all living humans, who was identified by analysis of mitochondrial DNA. probably lived about 130-200,000 years ago. doesn't mean she was the first human, means that no mitochondria from other lineages are represented in any humans living today. Studies show all men today can trace ancestry back to single man about 120,000 years ago. This is interpreted that there were higher levels of sexual selection on males than females. Fewer males contributed to the next generations, indicating polygynous sexual reproduction. If monogamous, males and females contribute equally to the next generation so the Y chromosome and mt trees would coalesce at the same time.
Read quality and how assigned
The rate of error in identifying the bp. becomes poorer: -over length of read -over sequencing run -in the reverse read Important to: -remove adaptor sequences -Trim poor quality base -exclude poor quality reads
Campylobacter
gram negative bacteria, s-shaped a motile. C. jenjuni and C. coli leading cause of bacterial diarrhoeal illness in US and UK. Costs UK NHS £500 million pa. Transmitted through contaminated food, un-treated water, unpasteurised milk, contact with infected animals. Person-person by faecal-oral route. symptoms include; diarrhoea, fever, nausea, abdominal pain. Several potential sources, genotyping and Bayesian clustering algorithm showed that chicken is a major source of human infection.
What is the random shotgun method?
whole genome shredded into small fragments of set length. -each fragment then sequenced from both ends and the two resulting sequences are called mate pairs. -sequences are then assembled into contigs, which are used to build scaffolds,
Genome annotation
~ 50% genes in a genome are of unknown function -annotation gives function to the gene: identify non-coding regions, coding regions and attach biological information to this. -allows automated homology searching to find related proteins -Conserved domains are databases containing proteins of the same function. -some are general function, some specific, some unknown but gives clues if similar.
Repetitive elements
constitute more that 50% of human genome, over 90% of lilies. Explains variation in the order of magnitude of the size of genomes. 1. transposon-derived repetitive elements. LINE- autonomously transpose by RNA intermediate and convert to DNA using reverse transcriptase SINE- non-autonomous use RT. retrovirus-like LTR use RT. DNA transposons- goes via DNA intermediate using transposes enzyme 2. pseudogenes: inactive copies of genes 3. Simple sequence repeats: Microsatellites- 13nt repeats, mutate faster than surrounding, repeat 10-100 times at one locus. propagate due to DNA strand slippage. number of repeats at locus is hypervariable between individuals: useful for forensics and DNA fingerprinting. Minisatellites- 10-100bp, often GC rich, in humans they are concentrated at the sub-telomere regions. 4. segmental duplications: up to 300kb 5. non-interspersed repeats e.g. ribosomal gene clusters and heterochromatin: heterochromatin concentrated at the centromeres and telomeres of chromosomes to aid spindle attachment and movement
Hierarchical vs shotgun sequencing
hierarchical: develops low resolution physical alignment so that sequence can be obtained in large, ordered pieces. Shot gun: whole genome fragmented into very small pieces, sequence these small fragments and use computer algorithms to guide the assembly.
chip sequencing
sequence immunoprecipitation sequences using NGS platforms. -protein of interest cross linked with DNA in cell culture -lyse cells, shear chromatin from fragments -add specific antibody for protein -select POI, reverse crosslinks -purify DNA, amplify and label -hybridise to microarray
shotgun sequencing
sequences small pieces of genomes which are assembled by a computer
Serial Analysis of Gene Expression (SAGE)
sequencing technique for determining the quantities of different RNAs in a mixture mRNA is extracted, reverse transcriptase used to copy mRNA into stable cDNA. cDNA then digested by restriction enzymes, producing tags. The tags are concatenated (linked together) and sequenced to find frequency of each tap.
SNPs and types
single nucleotide polymorphism synonymous (silent)- still codes for the same aa non-synonymous- aa results in codon change that codes for a different aa. in regulatory region
Transcriptomics
study of the complete set of RNAs produced in cell or sample from the genes of an organism -better understand gene function by studying genes up or down regulated when the gene is missing -identify genes related to conditions by looking for genes up or down regulated -reconstruct splicing variants and expression pattern -build co-expression networks.
Metagenomics
the study of genetic material recovered directly from environmental samples. Total genome content of particular niche or habitat.
What is epigenetics?
the study of reversible inheritable changes in organisms caused by modification of gene expression rather than alteration of the genetic code itself. -histone modification change the chromatin structure and affect DNA accessibility to regulatory proteins such as TFs -DNA methylation affects the binding of regulatory proteins to DNA
introgression
the transfer of genetic information from one species to another as a result of hybridization (sexual transfer of DNA) between them and repeated backcrossing.
horizontal gene transfer (HGT)
transfer of DNA from a donor cell to a recipient cell. Transformation: recipient bacterium takes up extracellular DNA from donor Conjugation: temporary direct contact leads to transfer of plasmid DNA Transduction: donor cell DNA transmitted by bacteriophage and integrated into the host's genome. Leads to replacement of homologous DNA with sequence from another lineage or insertion or deletion of non-homologous DNA. Results either in loss of function. but also potential to introduce or duplicate genes, giving new function and promoting host adaptation.
genotyping
-extract, amplify and fragment DNA -microarray or sequence and then establish the genotype
What is the "Out of Africa" hypothesis?
Homo sapiens emerged in Africa and dispersed less than 150,000 years ago, spreading to Near east, Europe, Asia and replacing other hominids like Homo erectus. -nuclear, mitochondrial and Y chromosome based methods support -Africa has the highest genetic diversity -Non-African populations have a subset of the variants present in African populations
What were the main findings from the Human genome Project and gene studies?
-fewer genes than expected ~25,000, since lowered to just over 20k -many genes derived from HGT from bacteria or transposons (moveable DNA element) -large variations in gene size -uneven gene distribution: gene-rich and gene-poor regions. in general, expressed genes are clustered in regions of higher GC content. -much of the genome may contain functional regulatory regions
How do you label cDNA for hybridisation?
-Direct: incorporate fluorescent dyes during reverse transcription of mRNA to cDNA -amplified RNA: T7 RNA polymerase in cloning vector primes the amplified mRNA with biotinylated or aminoallyl nucleotides -Indirect: 3-DNA attached to oligo(T) primer primes cDNA synthesis. Use an anti-3DNA antibody to immunolablel the bound probe
Which elements are associated with H. Pylori virulence?
-cag PAI
Why are epigenetic changes important?
-development of multicellular organism, maintaining differentiated states in cells -imprinting of genes -environment-organism interaction which are sometimes passed onto offspring -pathogenesis of disease: changes in epigenetic can lead to cancer. e.g if tumour suppressor gene becomes methylated at promoter, it will be silenced and uncontrolled growth occurs. Areas that are usually hypermethylated but become hyppmethylated cause genomic instability and contribute to tumourigenesis. If as few as three genes are turned off, a normal cell can be converted into a cancer cell.
Limitations on interpretation of GWAS results
-most SNPs linked to condition have low predicting power, each usually explaining less than 1% of the variance -statistical power low when testing for millions of SNPs -all identifies SNPs for disease only explain a fraction of the heritable component -most SNPs no known mechanism linking the SNP to the condition, often not know which gene affected by the SNP. Associated SNPs may be driving unrelated covarying traits, e.g ageing. -GWAS relies on known variants, missing rare variants which may have larger effect on the phenotype (particularly in disease) -by sequencing more genomes, more SNPs and polymorphic indwells can be identified, allowing larger scale GWAS studies.
Issues with Sanger sequencing
-must be cloned in E. coli vectors: some sequences this may not work as may encode toxic proteins or the relocation material may not tolerate it. -DNA polymerase may struggle to copy some sequences -DNA polymerase has an error rate of 1 base per 10,000 -manual editing sometimes required.
Alternatives to GWAS
-scans for copy number variation -exome sequencing, less expensive and can identify rare variants even in small studies -whole genome sequencing: allows rare variants in non-coding regions. Much more expensive.
Single cell sequencing
-several techniques, Multiple Displacement Amplification is favoured. Single cell isolation either from laser capture micro dissection, FACS or microfluidics. DNA is then extracted, amplified with PCR or Ilumina bridge, then sequenced.
Evidence for recent selective sweeps in modern populations
-studySNP frequencies, in areas on low heterozigosity, can be interpreted that there has been recent sweep increasing the frequency of that allele.
Manhattan plots
type of scatter plot, where data points of higher magnitude could indicate SNPs involved in trait.
Sanger sequencing method 1977 (chain termination method)
1. denature template (often a plasmid) to single strand 2. allele universal primer to template (from 5' end). Universal as it can bind to he plasmid sequence adjacent to the cloned DNA 3. set up 4 reaction mixtures each containing: template, primer, DNA polymerase, free dNTPs (radio labelled) and modified nucleotides (dideoxyNTPs)- one type of nucleotide per mixture; C, G, A, T as they all induce chain termination at their nucleotide in PCR. No hydroxyl group on backbone so chain terminates. 4. denature the DNA again to single strands, then separate DNA by size using acrylamide gel electrophoresis (DNA -ve so small fragments migrate faster to anode) Gives read lengths of ~1kb
Potential sources of bias for GWAS
1. multiple testing: p values probability of spurious (fake) result being taken as significant. p=0.05 or 0.01 used often. If the chances of the null hypothesis being true less than 5% or 1% respectively, it will be rejected. GWAS many thousands SNPs tested, using p=0.05 means many SNPs identified as being associated when they are not (false positives). Can use Benjamini Hochberg correction. 2. sample size: the larger the sample size the higher the probability of identifying small associations. For small sizes, may miss important associations, can use TWO STAGE STUDY. First scan to find areas of interest that may not reach statistical interest. These areas are genotyped or sequenced independently to confirm associations. 3. population stratification: systematic difference in allele frequencies between subpopulations in a population due to different ancestry, as a result of diverging in different geographical locations. -for example if one disease more common in one population, several neutral SNPs could be significantly associated with the phenotype, when really just markers of the origin of the individual. 4. hard to measure phenotypes such as mental health issue, can lead to badly matching case and controls, leading to spurious associations. 5. also some SNPs may not be called in particular individuals, using different technologies. This is tackled by merging datasets using genotype imputation which predicts genotypes not directly assayed in a sample, based of SNP dense haplotype map or whole genome sequences.
How are proteomes analysed?
1. prepare the sample, purify the proteins and remove other biomolecules. Add salt and other chemicals. 2. Separation of proteins within the mixture -2D gel: separate proteins by charge and molecular weight, stain and visualise -Liquid chromatography 3. characterisation of individual proteins. Cut out spots of interest and analyse using mass spec.
genome-wide association studies
A large-scale analysis of the genomes of many people having a certain phenotype or disease, with the aim of finding genetic markers that correlate with that phenotype or disease.
Compare different NGS techniques
All: -Sample prep: DNA input from library, amplified and fragmented or have adaptors ligated. -cluster generation (oligo bound to beads) -sequencing -data analysis Sequencing by synthesis: Illumina Miseq: dNTP bases have a fluorescent tag and terminator group. The fluorescence is recorded when it is incorporated, then the tag and terminator is removed so the next base can add. Bases added using DNA polymerase reads 100-150bp. Pyrosequencing: Roche 454. pyrophosphate released upon addition of a dNTP base by DNA polymerase. This undergoes chemical reactions and luciferase releases light. Up to 1kb reads. High reagent cost, high error rate. Sequencing by ligation: Applied Biosystems SOLiD: Using 16 8-mer oligonucleotide probes. Bind primer to adaptor sequence. Anneal probe with DNA ligase. Detects fluorescents of joining probes. Very short read sequences of 35-75bp. Ion semiconductor sequencing/ ion torent proton: Semiconductor transistor measures the changes in pH. As DNA polymerase incorporates nucleotide bases, a proton is released which decreased the pH and this is detected. -cost and time efficient -200bp reads.
Human Genome Project
An international collaborative effort to map and sequence the DNA of the entire haploid human genome within 15 years (2005). First vertebrate and largest genome sequenced to that date Public International Human Genome Sequence Consortium (IHGSC) $3 billion project founded in 1990 by US government. Its principles were that collaborators welcome and the sequenced data freely available within 24 hours ('Bermuda' principle, data released ahead of publication) HOW? 1. hierarchical approach: physical map establishing the distances between markers along genome 2.Shotgun approach completed in 2001. 3. Finishing phase: PCR used to amplify unknown segments, then sequenced. for large gaps >20kb, BAC libraries screened to identify segments containing the edges of the gap, then sequenced using shotgun sequencing. Celera genomics announced private project at 1/10th cost, completed at same time as public. Craig Venter. Used a random shotgun sequencing method and had goal to sequence human genome in just 3 years. Used public efforts physical map. 10,000 genomes project: UK government sequencing whole genomes from NHS patients focusing on rare diseases, cancer and infections. 4000 people studied for many diseases and traits over many years. 6000 people studied as they have extreme obesity, neurodevelopmental disease or other disease. Costs have been dramatically reduced and processes much faster: genome sequencing currently takes about 2 days and <£1000
Using basic local alignment search tool BLAST to classify gene function
BLASTn- compare nucleotide sequence to nucleotide database. BLASTp- compares amino acid sequences to protein amino acid sequence database BLASTx- translates nucleotide sequence into all possible reading frames and scans this against a protein database,
DNA analysis from ancient samples: paleomicrobiology
Can't culture microbes, so study genomes. DNA is usually degraded and pathogen DNA tiny proportion of total. 1. molars are favoured as large, often intact, enamel used for radio dating 2. pulp from both extracted as pathogen DNA from systemic infections can be extracted 3. from the pulp extract the DNA Can also use microscopy for hairs, textiles. Can culture if specimens were frozen. preventing contamination from environmental flora is crucial: wearing gloves, scraping external surface. cleansing and UV radiation. Suicide PCR: use primers only once
Hypermutators
Emerge when there is a strong selective pressure such as antibiotics or change in host niche.
Molecular clock
DNA and protein sequences evolve at a rate that is relatively constant over time and among different organism. Can be calibrated using temporal information from aDNA to date samples.
Hierarchical sequencing
DNA libraries: Restriction enzymes or sonication DNA fragments 50 - 200Kb cloned into BAC or P1 vectors and propagated in E.coli Hybridization: All of the (BAC) clones in a library that carry a particular seq can be identified rapidly by hybridizing a small radioactively labeled probe containing the seq to a filter on which an array of ~10,000 of clones is printed (A). Probe hybridises only to BACs that contain this sequence. Fingerprinting: Assemble contigs: Compare and align according to restriction digest profiles. Fragment separation by electrophoresis (B). Information converted to profile by computer. Shared bands (red, blue) suggest shared sequence Fingerprint converted into BAC alignment. Choose which BACs to sequence. End-sequencing: Fill in the gaps after fingerprinting e.g. sequencing both ends of the collection of BAC clones The minimum number of clones that form a contig that covers the entire chromosome comprise the tiling path that is used for sequencing.
Transposons and classes
DNA sequence that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size DNA transposon: single ORF encoding transposase Retrotransposons: LTR, non-LTR autonomous and non-LTR non-autonomous Why do we have so many? -selection not strong enough to get rid of them -functional: increase expression during cell stress -chance functionality: incorporated into regulatory regions, providing TF binding sites, or into exons.
Assembly methods
De Bruijn graph (dBG): k-mers (AKA n-mers) identified from all reads, graph of k-mer. Best path is chosen. Small k-mer, less information, better coverage. Large k-mer, more information, more breaks in contigs. Overlap-layout-consensus (OLC) overlaps between reads identified, overlap graph, non-branching stretched identified. Issues include slow for large datasets, consensus callings complicated for multi-copy genes. Pros include using all info from read (not split to k-mers) repeat regions are unresolvable so the output is a collection of contiguous sequences- CONTIGS
Exome sequencing and Miller syndrome
Exome sequencing sequences only the coding regions of the genome. DNA is shred and sections containing exons are captured using probes. Next generation sequencing then allows changes in DNA to be identified beyond genotyping of set of known SNPs. Allows extremely rare variants to be identified. can hybridise the exons with microarray. candidate genes were filtered against public SNP databases. Single candidate gene DHODH was identified, which encodes key enzyme in the pyrimidine de novo biosynthesis pathway. Sanger sequencing confirmed presence of DHODH mutations in three additional families with Miller syndrome.
Microarray
High throughput, high input RNA amount required, low labour intensity, reference transcripts needed for probes. Therefore no new genes or alternative splicing variants can be identified.
Finding genes via in silico methods
GeneFinder Grail for microbial. Genie, Genescan for eukaryotic.
What is GWAS?
Genome wide association study. - is an examination of many common genetic variants (SNP) in different individuals to see if any variant is associated with a phenotypic trait or disease. Comparing cases and controls to see differences (globally) and look for trait association. Has been able to obtain clear results for single-loci diseases. -Guides experimental research in certain diseases by pinpointing genes -aided in the development of some drugs
Histone modification
H2A, H2B, H3, H4 can be modified by the following: -acetylation: transfer of acetyl co-enxzyme A to lysine at N-terminal leads to neutralising positive charge. Loosens chromatin packaging and activates transcription. Histone deacetylases remove acetyl group, re-establish charge and repress transcription. -histone methylation: histone methyl transferase methylates specific amino acids such as lys 4& 9 in H3. silences the nearby genes. -phosphorylation: adds phosphoryl group. Has been associated with increased expression of proliferation associated genes, but can also silence. Phosphorylation increases during cell stress and DNA damage. It is thought histone modification enzymes remain attached to the DNA strands during replication, acting as epigenetic marks. then histones are recruited and modified. PcG-polycomb group of proteins shown to repress expression. TrxG thiorax group shown to promote expression.
Pros and cons of microarray and oligonucleotide array
Pros of oligo: higher densities of genes, conditions controlled precisely so the data is comparable. detects closely related. Cons: lots of sequence information required from organism, costly and issues if the sequences used to design the probes get revised. Pros of microarray: cheap, no sequenced genome required, reduced effect of sequence polymorphisms on hybridisation. cons of microarray: not all mRNAs represented, 5-15k max probes -cross hybridisation of similar sequences so hard to distinguish -high background -construction of cDNA libraries from variety of tissues and conditions required. pros of RNAseq: -sensitive to genes expressed at a very high or low level -lower technical variation-> more reproducible -don't need sequenced genome -provides detail to transcriptional regions cons: -costly -loads of data (power and storage) -more complicated analysis
Evaluate Read mapping and its applications
Pros: -rapid, accurate, high confidence, comparable and reproducible Cons: -requires high quality genome sequence -can't be used to identify genes not present in reference -not reliable for large genomic events such as translocations -repeat regions are problematic. applications: -produce phylogenetic trees from closely related species or isolates from one species -identify presence or absence of set of genes such as antibiotic resistance -estimate coverage of assembly and use to improve confidence in assembly -compare to a number of references to rapidly find the most closely related. -call multi-locus sequence type genes
Transcriptomic methodologies
Relative expression levels not absolute due to the fact that post-transcriptional modification can give rise to multiple protein forms from a single mRNA transcript. cDNA microarray: -cDNA clones, labelled and purified with PCR -robotic printing of DNA onto slide -cDNA hybridises to DNA on microarray -wash off unbound -lasers detect the amount of fluorescence oligonucleotide arrays: 1. gene chips e.g affymetrix -22 probes per transcript -25-mer oligonucleotide probes synthesised and applied to chip -labelled cDNA or RNA hybridised to chip -laser detects fluorescence Illumina HT-12 beadchips: -50mer oligonucleotide probes cot beads -cDNAs hybridise with the beads -intensity of fluorescence measured using optic fibres that the bead sits on -each bead has a unique DNA barcode RNA-seq: high throughput cDNA using NGS. sequence every RNA molecule, expression based off the number of times transcript is sequenced. -cDNA fragments created from polyA RNAs. -adaptors added to fragments, NGS obtains reads -align reads with reference -estimate no. reads per kb of predicted exon per million total reads= RPKM.
454 sequencing 2005 NGS
Roche 454 pyrosequencing: 800,000 reads of 450-100bp each. As polymerase adds sequence, it adds PPi (pyrophosphate), generates ATP. combined with sulfurylase and luciferase, as nucleotides added, light emitted at different levels depending on the nucleotide. illumina solexa: 60 million reads of 100bp each.
Comparing sequences for gene annotation
S score= similarity score base match = +3 base mismatch = -1 gap penalty = -5 E value given and if < 10^-5, alignment likely to be significant as that is the probability level of there being another alignment that has a higher similarity score than the one in question . It is important to know that alignment and comparison alone is error-prone when coming to elucidate function. For example lactate dehydrogenase gene is an enzyme in invertebrates but a structural protein in vertebrates.
DNA methylation
The addition of methyl groups to bases of DNA after DNA synthesis; may serve as a long-term control of gene expression. Added by methyl transferase proteins to the cytocynes on CpG dinucleotides. methylated CpG islands close to promoter regions cause repression of expression. the methylation is copied to the new strands of DNA during DNA replication.
What is proteomics?
The study of protein sets produced by a cell or sample of cells at a particular time on a large scale to obtain a global integrated view of disease. Proteins are what make the phenotypic trait.
Gene ontology categories
biological process: e.g. DNA repair or signal transduction molecular function: e.g. transport or catalysis cellular components: whether they are stand-alone or a component e.g. of a ribosome.
Distinguish between biological replicate and technical replicate
biological: same test on multiple samples of the same material e.g. cell type, tissue etc technical replicate: test the same sample multiple times (testing the protocol variability itself)
expressed sequence tag (EST)
cDNA clones are selected randomly from a library and sequenced to identify actively expressed genes. Usually 100-800 nucleotides long.
Relative expression of mRNA in transcriptomics comparative techniques
cDNA samples v cDNA standard reference loop: each sample compared with every other sample split plot: test the expression levels under different conditions e.g. temperature, drug treatment, time.
G-value paradox
lack of correlation between genome size and the biological complexity of an organism in eukaryotes. In prokaryotes, genome size relates to gene number and its complexity. Size is largely determined by the amount of repetitive DNA. What increases complexity? -complex regulatory mechanisms -multiple replication origins -changes in gene content -protein domains and interactions -more distinct transcripts via alternative splicing (multiple distinct protein products produced from single gene). These goes against the one gene one protein hypothesis.
Comparing the expression of two samples statistically
log2ratio: +1 means double the expression -1= half the expression columns= samples rows= genes cluster genes into functionally homologous groups significant if below the p value. yellow/green= negative ratios blue/red= positive ratios
RNA- seq
mRNA extracted, fragmented and copied into ds-cDNA by reverse transcriptase. sequence via high-throughput short-read. align to reference genome, shows which genomic regions were being transcribed. Data can be used to annotate where expressed genes are, their relative expression levels and any alternative splice variants. -lowr input RNA than microarray, higher labour intensity. higher sensitivity and can detect SNPs and splice variants. -allows identification of new genes and novel splicing events. if used on total RNA in can help study non-coding RNAs too.
cDNA synthesis
making a complementary DNA sequence of your RNA with reverse transcriptase Oligo(T) primer binds to poly(A) tail of mRNA, add reverse transcriptase and dNTPs, complementary strand produced. NaOH causes the double strand to separate. The 3' end curls round as hydrophobic, binding to itself called hairpin loop. This acts as a primer for DNA polymerase to use dNTPs to synthesise strand complementary to the ssDNA. S1 nuclease cleaves the single loop to form double stranded DNA.
Metagenomics for disease outbreak investigation
phase 1. Draft genome of outbreak strain obtained, after extracting DNA from samples, fragments and sequenced. Human DNA is removed, microbial DNA assembled into collection of environmental gene tags (EGTs). phase 2. Then depth of coverage of outbreak strain genome determined to each sample. phase 3. Sample sequences compared with known bacteria to identify pathogens other than outbreak strain. e.g. Shiga-toxin genes detected in 27/40 STEC-positive (shiga toxin Escherichia Coli) samples, and in phase 3 sequences from Clostridium difficile, C. jejuni, Salmonella enterica and C. concisus were recovered.
Evaluate assembly sequencing and its applications
pros: reference free, so novel sequences can be constructed and new gene/sequence identified -can be used to identify variants from large genomic events cons: computationally expensive and time consuming even for small genomes. struggles to resolve repetitive regions such as gene duplications Applications: -assemble novel genome of organism and compare to related genomes -identify large and small variants -identify pan genome of a sample -identify core or accessory genes in a sample -map reads to a genome and assemble the remainder to identify novel genes.
How do you verify array and RNAseq expression data?
quantitative RT-PCR -RT to synthesise dsDNA -syber green 1 only fluoresces when intercalated between dsDNA strands
NGS: Illumina solexa sequencing
sample prep: Tagmentation: fragmentation followed by 5' and 3' adaptor ligation. adaptors added to the fragments which are complementary to the flow cell oligos , as well as sequence binding sites. Adapter-ligated fragments PCR amplified and gel purified Cluster generation: two types of oligos on the flow cell. The adaptor on the fragment hybridises to one of the oligo, polymerase makes a complementary strand, it is denatured and original oligo strand washed off. The strand then bridges over and the adaptor at other end binds to the other type of oligo. the strand is copied using polymerase, hence amplified by bridge clonal amplification. Reverse strands washed off. 3' ends blocked. sequencing: fluorescently labelled nucleotides add complementary the fragment, emitting a characteristic light wavelength when it does. this creates many reads. A terminator group will eventually block further elongation. To restart the next cycle, fluorescent signal and terminator removed. data analysis. computers detect each base type, generating sequenced reads and these are aligned against reference genome and analysed for variants.