Advanced Genetics & Genomics
What are the key questions (or limitations) for metabarcoding experiments?
- How many reads are needed for statistical power? - Is there any overestimation due to sequencing error? - Is there an overrepresentation of species that have multiple copies of the 16S rRNA gene? - How is the accuracy of taxonomic classification? - How do you explain for not all rRNA genes amplifying equally well with the same "universal" primers?
Describe what the methods underlying metabarcoding and the goals / purpose.
- Metabarcoding - sequencing PCR products of a marker region (like 16S) from a given sample/ecosystem - Q: Who's there? Try and determine the organisms present in that sample/ecosystem - Identification of species, phylum, genus - Estimation of species abundance Aim is often to identify different species and compare different community (without reconstructing their entire genomes)
Describe the methods underlying metagenomics and the goals / purpose.
- Metagenomics - sequencing all the DNA in a given sample/ecosystem - no PCR amplification. - DNA usually very hard to annotate - Q: Who are they AND What are they doing? - Functional annotation - Estimation of gene/ pathway abundance
Why do you NOT always have to sequence an entire metagenome?
- expensive - isn't needed to answer question - low-abundance
Why do different loci have the same demographic histories? In other words, what causes high LD?
- strong positive selection (i.e lactase persistence) - population size - lack of recombination - population admixture (migration)
D' is one metric that can be used to quantify linkage disequilibrium. What are the minimum and maximum values that D' can take? If you find a maximum value of D' between two loci, what does this imply about linkage disequilibrium for those loci?
0 and 1. You can 100% of the time associate what's at one locus is at another locus.
In a case-control association you do TWO chi-square tests. 1) is to check that it is in HWE and the 2) is the check for an association. Note the difference when making the calculation.
1) When check in HWE - expected You use the totals for case and control for each genotype to calculate expected values for each genotype CC/CT/TT, regardless of case or control by first calculating allele frequency for C and T. Then you use p2 + 2pq + q2 to find expected genotypes. *MAKE SURE TO MULTIPLY BY TOTAL INDIVIDUALS FOR FINAL NUMBER (unless frequency is asked) 2) Association test - expected You multiply the totals corresponding to that cell and divide it by the total number of individuals to get the expected # of individuals. The expected can be in decimals.
List two evolutionary forces that can cause LD and for each, describe why these processes cause LD.
1. Bottlenecks (sampling) - gene flow between two or more genetically distinct populations - If a small group of individuals splits off, they carry only a fraction of the original alleles. This will of course disrupt normal allele frequencies. 2. Selection - There will be selection for one allele over another , since the phenotype for that genotype will confer great survival.
What are the 6 main project goals of the Human Genome Project (HGP)?
1. To identify all of the genes in human DNA. - Approximately 30,000 2. To determine the sequences of the 3 billion chemical base pairs that make up human DNA 3. To store this information in databases 4. To improve tools for data analysis 5. To transfer related technologies to the private sector 6. To address the ethical, legal, and social issues (ELSI) that may arise from the project.
Name five potential benefits that arose from the outcomes of the project.
1. Use of microbial genomics research for safe, efficient ENVIRONMENTAL REMEDIATION 2. Pharmacogenomics "custom drugs" 3. Gene therapy and control systems for drugs 4. Establish paternity and other family relationships 5. Disease-, insect-, and drought-resistant crops (food security) 6. More nutritious produce (GMO rice fed to certain Asian people due to deficiencies) 7. Study migration of different population groups based on female genetic inheritance
Despite the overwhelming positive reception of the project, there were also a number of concerns that arose from the project. Name three such concerns.
1.. Psychological impact and stigmatization due to an individual's genetic differences. 2. Fairness in the use of genetic information by insurers, employers, courts, schools, adoption agencies, and the military, among others. 3. Privacy and confidentiality of genetic information. Like in solving criminal cases, the use of distant relatives' DNA to identify suspects without consent.
How many common SNPs are there in the human population (within an order of magnitude)?
4 - 5 million common SNPs (10^6) *If the SNP is in at least 5% of the population = common
Explain the PCA plot in detail, including what the axes represent, and how the SNP data were used to generate such a plot.
A principle component analysis (PCA) plot is a DIMENSION REDUCTION method that allows for identifying and adjusting for ancestry differences among individuals. PCA when applied to genotype data (SNPs) can be used to calculate principal components (PCs) that explain for differences among the sample individuals in the genetic data. The x and y axes are the top PCs that explain for the most genetic variation among individuals (PC1 explains for the most variation, PC2 explains for the second largest amount of variation). Individuals with similar values for a particular top PCs will have similar ancestry for those axes. However, this method ignores biology, and doesn't really have a model for how the world works. Cannonical example: Western European geography reconstructed on a PCA plot. Each genotype assigned either a 0,1, or 2. You take an average at each locus, where each locus gets a SCORE! We want to use a locus with the most difference between individuals at that locus. Essentially you're trying to find the places in the genome that differ the most among individuals.
Give an example of a human trait for which GWAS has led to the identification of disease causing loci.
Age-Related Macular Degeneration - found a SNP for complement factor H in AMD cases to be more common relative to controls in a case-control association. - GG at position 3
Explain population stratification in GWAS and Q-Q plots.
Besides multiple testing with GWAS, you need to diagnose POPULATION STRATIFICATION. You would use a Q-Q plot to diagnose that. You plot observed vs expected chi-square. You expect observed to match up with expected. If you have population stratification you see a steady deviation from the expected line (there's strong population structure, DIFFERENCES IN ALLELE FREQUENCIES). Refer back to Manhattan plot for towers. Stratification (when you know more than 2 populations in a sample)
Explain the colorectal cancer (CRC) study that Sebastian conducted to infer the subtype of CRC or health state based on characterizing the microbiome. How was this done?
CRC is known to be influenced by gut microbiota. He wanted to determine if different subtypes of CRC have different microbiomes. If so, perhaps as a non-invasive clinical test, you could infer the subtype of CRC of future patients based on their microbiome. After classifying the CRC samples into the 4 different subtypes (RNA-seq), he utilized 16S metabarcoding with qPCR to determine the abundance of bacterial species in each sample belonging to the CRC subtypes. He was able to see what species were enriched or depleted in each subtype. If certain subtypes have a different relative abundances of certain taxa, then perhaps that can be used to associate microbes to different CRC subtypes.
D, D' and r are used in quantifying linkage disequilibrium. What do each of the three mean?
D: non-random associations between AB alleles D': D normalized by allele frequencies r: correlation (D normalized by allele frequencies)
Many cancers have mutations in genes that are involved in DNA damage repair (e.g. BRCA1). Why?
DNA repair genes code for proteins whose normal function is to correct errors that arise when cells duplicate their DNA prior to cell division. Mutations in these genes allow cancer cells to avoid a normal DNA damage response that involves repairing these DNA mutations or lesions. By this, mutations can ACCUMULATE. If the rate of DNA damage exceeds the capacity of the cell to repair it, the accumulation of errors can overwhelm the cell and result in cancer.
Most cancers are genetic diseases that occur through the accumulation of mutations in somatic cells. However, some cancers tend to manifest during childhood, while other manifest during late adulthood. Why does this occur if cells in most tissues have similar mutation rates? Give an example of two different cancers that occur at different ages, and explain why this occurs.
Different cancers have different pathways. The pathways they follow determine the type of mutations and how many they have. For instances, where mutations are usually inherited the onset is also early, especially if the genes mutated are critical in development. However, late adulthood onset of cancers is usually due to the accumulation of random mutations over time.
What is the issue with finding driver mutations? What types of molecular effects do driver mutations have?
Driver mutations are hard to distinguish! Also hard to tell what is the driver vs what is the passenger mutation. - cell cycle disruption - DNA repair - longevity (telomere length)
Many phenotypes may be caused by very rare SNPs (rather than common SNPs). If this is true, what approach should we take to identify associations between rare SNPs and phenotypes? Why must we take this approach?
Genome sequencing? Rare SNPs are obviously not going to appear in the GWAS since GWAS only looks at common variants with relatively small effects. Rare SNPs are expected to be rare if they cause large effects. Therefore full on sequencing is required before making any associations.
Would strong, recent selection at a specific locus in a population lead to high or low levels of linkage disequilibrium at that locus?
High levels of linkage disequilibrium. Since you're selecting for one allele and that frequency is increasing relative to the other.
What are histone modifications? What effects do they have?
Histones are the proteins that DNA is wound around. The modifications pertain to how tightly or loosely the DNA is wound around the histone proteins (chromatin compaction) due to methylation, acetylation, phosphorylation etc. This effects of chromatin compaction include increased or decreased gene expression levels since the easy accessibility of DNA is crucial for transcription machinery to bind and transcribe your genes. If the histones are methylated, the DNA is tightly wound around, and the chromatin is in a condensed state (heterochromatin).
In a region of the human genome with high linkage disequilibrium, would you use more or fewer tag SNPs to infer haplotypes, compared to an area with low linkage disequilibrium? Explain what tag SNPs are, and why you would use more or fewer.
In a region of high LD you would require fewer tagSNPs to infer haplotypes. In instances where LD is high, you have greater confidence in the association of seeing an allele at one loci to infer what allele is at another loci, so fewer tagSNPs are able to sufficiently capture most of the haplotype structure.
What is represented by a Manhattan plot? Why use the -log?
It displays on the Y-axis the -log(P-value) for each SNP situated on the X- axis. P-value are usually very small numbers so for better visualisation of these small numbers we take the log. They use logs in a number of different plots to better visualise very small or big numbers. We take a negative of the log since the log of small numbers is negative, so to put it back positive.
What is ChIP-seq?
It identifies the locations in the genome bound by proteins. 1. Glue DNA-bound proteins (Even the ones you don't care about) to DNA 2. Restrict/cut 3. Isolate protein using a specific antibody 4. Isolate antibodies 5. Unglue and wash away proteins 6. Sequence the reads and map to genome to see where most reads mapped.
Describe what a tag-SNP is and why they are used in genotyping.
It is a representative SNP (single nucleotide polymorphism) in a region of the genome with high linkage disequilibrium (LD) that represents a haplotype block. It allows for identifying genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region due to these SNPs with high LD.
What is ascertainment bias (in the context of SNP analyses and GWAS)?
It refers to the bias present in population genetics studies when one uses SNP data from only the European population and applying that to study the variations in other non-European populations. SNP diversity in one pop. does not reflect that in another pop. As well as good tag SNPs in one pop. may not be good in another pop.
What is the Human Genome Project?
It was an international, collaborative research effort to determine the sequence of the human genome and to identify the genes that it contains.
If alleles are not at the expected frequencies that is known as what?
LINKAGE DISEQUILIBRIUM
In a single sentence, define linkage disequilibrium (LD).
Linkage disequilibrium refers to the non-random association of alleles at two or more loci in a general population.
Describe the goals of the human HapMap Project and the data that resulted from this research.
Main goal? To describe and catalogue common patterns of human genetic variation (SNPs) that are involved in human health and disease. By cataloging these variations in different populations, these can help minimise the ascertainment bias on genotyping chips. They sequence the whole genomes broadly with similar sample sizes. Admixture plot showing the genetic makeup of those from different ethnic populations.
Do we use all the SNPs from a sequencing project?
Not all Human HapMap SNPs needed, since we have LD. We can use tagSNPs within haplotype blocks.
What is the main idea with cancer genetics?
Numbers of mutations differ drastically between cancers due to some having inherent genomic instability as well as time to occurrence. Every tumour is different. Cancer is a genetic disease that results from an accumulation of somatic mutations.
Distinguish between an odds ratio, a likelihood ratio, relative risk, and what explained variance. Use the values as an example. Genotype - Control - Cases AA or Aa - 380 - 520 aa - 20 - 80
Odds ratio: a measure of the strength of the effect - by how much does it increase your odds? i.e. (80/20) / 520/380 = 2.9 Increased odds of 2.9 if you have an aa genotype. ODDS RATIOS NOT NEEDED TO BE CALCULATED ON EXAM! Likelihood ratio: - the likelihood of the disease (phenotype) if you have a certain genotype RELATIVE to the likelihood if you don't have that genotype in that sample. - LIKELIHOODS DO NOT GO OVER ONE, BUT LIKELIHOOD RATIOS DO! i.e. (80/80+20) / (520/520+380) = 1.39 Increased likelihood of 1.39 if you have an aa genotype Relative risk: the probability you will develop the disease (phenotype) if you have a certain genotype RELATIVE to the probability you will develop the disease if you don't have that genotype in the population. Based on population prevalence. Absolute risk: the probability you will develop the disease (phenotype) if you have a certain genotype.
Explain the polygenic risk score method.
Polygenic risk scores take into account a whole bunch of SNPs and to quantify the increased probability that you're going to have a disease. You rank individuals according to where they fall on the polygenic risk score. For example, if you have 300 SNPs you're considering, you look at the genotype at every SNP, and then you add up the genotypes. Essentially you're relying on a whole bunch of places in the genome to figure out what your phenotype will be.
Briefly describe population structure (stratification).
Population stratification can be a problem for association studies, such as case-control studies, where the association could be found due to the underlying structure of the population and not a disease associated locus. A form of confounding in genetic association studies caused by genetic differences between cases and controls unrelated to disease but due to sampling them from populations of different ancestries There must be systematic ancestry differences (or bias) accounted for between cases and controls in GWAS.
What plot would you use to diagnose population stratification (also known as population structure)? What does a plot with little population structure and high population structure look like?
Q-Q plot. Little pop. structure: most points are on the line, but all points off the line are dispersed High pop. structure: a lot of points deviate from the line in a very correlated manner. The Q-Q plot is used to assess the number and magnitude of observed associations between genotyped SNPs and the disease under study, compared to the association statistics expected under the null hypothesis of no association.
Explain reduced representation bisulfite sequencing (RRBS) and describe how it is used to infer DNA methylation patterns at promoter regions.
Reduced representation bisulphite sequencing (RRBS) is an efficient and high-throughput technique for analysing the genome-wide DNA methylation profile at a single nucleotide level. It combines restriction enzymes and bisulphite sequencing to enrich for areas of the genome with high CpG content. Purified gDNA is first DIGESTED with methylation insensitive RESTRICTION ENZYMES. The REs produce sticky ends that are repaired at the 3' end and then METHYLATED SEQUENCE ADAPTERS are ligated. The fragments are run on a gel for separation and the desired length fragments are purified from the gel. The promoter regions are representative of fragments that are around 40-220 bp. The selected fragments are BISULPHITE CONVERTED, meaning biluphite will turn cytosines into uracils, but methylated cytosines will remain unaffected. The bisulphite converted DNA (unmethylated CpG islands) is PCR amplified and purified for sequencing and mapped to see which regions of the genome have been expressed due to being unmethylated.
List two types of mutations that have been established as driver mutations in cancers besides missense or nonsense mutations (i.e. amino acid changes).
Synonymous changes can have splicing changes. Or even frameshifts mutations.
Explain why the 16S rRNA gene is often used for amplicon sequencing in metabarcoding of bacteria.
The 16S rRNA gene is ubiquitous in prokaryotes since it codes for a subunit involved in the ribosome (translation). Therefore, since the ribosome is required for translation, the gene is highly conserved and does not evolve as quickly as other regions in the prokaryotic genome. The 16S rRNA gene also provides enough phylogenetic information to identify the isolate at least down to the genus level.
BeadArray Chips.
The Illumina BeadArrays are optimized where you have the most information with the fewest number of SNPs (700K). BUT there is ascertainment bias. We are only looking at SNPs we are aware of. And the SNPs we are aware of are generally biased towards European SNPs. The beadarray has beads each covered with oligo primers. Only one base is extended with a fluorescent probed nucleotide.
Explain the association results for the Crohn's disease risk loci.
The OR for each SNP plotted and their effect on the OR for Crohn's disease. When you do a replication study, there's all sorts of things that differ from your original study and you tend to find effects that are exaggerated in that unique sample ('winner's curse'). Therefore when you do replication studies the p-values tend to always be larger. You should IMPORTANTLY do replication studies once you find an association.
How is the oxford nanopore used to detect DNA methylation directly?
The Oxford Nanopore sequencing enables to directly detect methylation states of bases in DNA from reads . DeepSignal, a deep learning method or computational method to detect DNA methylation states from Nanopore sequencing reads is available via model PREDICTION.
Why do SNP arrays have severe ascertainment bias?
The arrays are enriched with SNPs from mainly European populations. You can't infer variation in one population to be the same in another population.
Ancestry can also be inferred using mitochondrial DNA (mtDNA) or Y chromosome DNA. What is the fundamental difference between using mtDNA or Y chromosome DNA and using autosomal DNA? Explain, in detail, how using autosomal DNA can lead to very different conclusions compared to mtDNA or Y chromosome DNA.
The fundamental difference is that mtDNA/Y chromosome does not recombine like autosomal DNA does. Our mtDNA is passed down from our mother, and a boy can only get his Y-chromosome from his father, but autosomal DNA recombines with every generation and is inherited from both parents. The mtDNA or Y chromosome DNA can really only tell you haplotype information unique to maternal or paternal lineages and a shared common ancestor for individuals admixed populations, but autosomal DNA tends to provide more comprehensive info on individual ancestry because they represent a greater proportion of genome history due to genetic markers being inherited from both parents. But this too is complicated since not all autosomal DNA is passed on, so you still only get a snapshot of your ancestral profile.
Many ancestry analyses assume a model of population admixture. What is meant by the term admixture?
The interbreeding between populations that are genetically differentiated results in the introduction of new genetic lineages into a population. Essentially it involves a population with mixed ancestry.
Describe, briefly, what is meant by the power of a study (in the specific context of GWAS).
The likelihood of correctly identifying a difference between cases and controls (or 2 groups) when a difference TRULY exists.
Explain what is meant by the 'power' of a study.
The likelihood of correctly identifying a true effect. In the case of GWAS, probability of identifying a difference between 2 groups in a study when a difference truly exists
What is a p-value?
The probability of seeing a test statistics equal to or greater than the observed.
Explain the Illumina BeadArray.
There are these microarrays with beads etched in these micro-wells. Each bead targets a specific locus in the genome and is covered with hundreds of thousands of copies of a specific oligonucleotide. As DNA fragments pass over the bead chip, the oligonucleotide probes bind to a complementary sequence in the sample. Allele specificity is then conferred by single base extension that incorporates labelled nucleotides. Nucleotide labels emit a fluorescent signal that allows for determining which allele is in place. It utilises TAG SNPS to genotype.
What evidence, if any, is there in the above plot that some Australian individuals (small dots and stars labelled with three-letter codes) have or have not interbred with individuals from other populations? Explain what is plotted, and the evidence for or against this hypothesis.
There is evidence that the Australian individuals have interbred with individuals with other populations due to the dispersion of individuals. They're genetic dissimilarities cause them to be dispersed toward the New Guinea population, all the way to the European population (distance wise in the plot) as a sign of admixture. The fact that the individuals in the Australian population can't cluster like individuals in the European population do, indicates there is some admixture within that population at play.
Why can't we always predict someone's phenotype given genotype?
There's missing heritability! This can be due to INCOMPLETE PENETRANCE. Reasons for incomplete penetrance may be due to environmental factors you are in (i.e healthy nutrition) or the environment your genes are in (epigenetics). Age, gender etc. also change penetrance. *Know a few specific examples of what combinations of mutations required to exhibit a phenotype i.e alzheimers.
You genotype SNPs at two loci (locus 1 and locus 2) in a population of 1000 individuals. The polymorphism at locus 1 is A / G. The polymorphism at locus 2 is T /C. Of the 2000 chromosomes in total that you genotype you find the following genotypes: # of chromosomes - SNP 1 - SNP 2: a. 1200 - A - T b. 50 - A - C c. 650 - G - C d. 100 - G - T Where the first column indicates the SNP genotype at locus 1 and the second column indicates the SNP genotype at locus 2. Are these two loci in linkage disequilibrium? What is the evidence that they are or are not? You do not have to do any calculations.
These two loci are in LD. Since you know there's an A at the locus 1 you also know there's a T at locus 2 most of the time. When you can infer what's at one locus based on what's at another locus that is known as linkage disequilibrium.
How are tagSNPs chosen?
They are chosen based on calculating LD between SNPs. The tagSNP is the most representative of a haplotype block.
Do older populations have more or less linkage disequilibrium?
They have less LD since they have had time to recombine. It would be hard to associate what is at one locus to what is at another locus when there may be lots of diversity at this point.
Who was involved in the Human Genome Project?
Two main groups involved: 1. International consortium - many countries involved 2. Celera - Craig Venter's private company
What type of functional assays can we do to distinguish driver mutations? What type of functional assays can we do to determine the EFFECTS these mutations have?
Ways to identify driver mutations: GENE EXPRESSION ARRAYS - Look at the gene expression of that gene of interest - Look at the expression of genes that the gene of interest controls (if it is a regulator, or TIF) - i.e. WEE1 overexpression in almost all retinoblastomas EXOME SEQUENCING - Exome sequencing on a bunch of cancers to see which genes tend to be mutated in different cancers. If you see a gene that is mutated in a good portion, you can hypothesise which genes are drivers, and then go on to determine it's effects. - i.e. SPOP in prostate cancer WHOLE GENOME SEQUENCING PROTEIN ASSAYS Ways to determine effects: CELLS - Cells can be manipulated as well to display a mutation, and you can compare the wild-type to the mutated to see how they differ in proliferation. - Change genotypes of cells to see how more likely they are to metastasise
"Mitochondrial Eve" and "Y chromosome Adam" are the most recent common ancestors for all human mitochondria and Y chromosomes, respectively. Did these two individuals live at the same time or in the same place? Explain your answer.
We all inherit mtDNA that derives from a single woman living about 200,000 years ago and the Y-chromosome from a man living about 250,000 years ago. Essentially, since when we are referring to Y-Adam and M-Eve we are referring to DNA sequences and not so much humans, it is possible that both M-Eve and Y-Adam lived in different millennia. If the population that went on to form modern humans was made up of a more diverse group of men than women, then Y-Adam would be significantly further back in their evolutionary tree than M-Eve. W
How do we determine in which genes or mutations are highly expressed in a cancer patient? How do driver mutations fit in?
We look at how gene expression differs between different cancers. You can look at expression levels of different genes on a heat-map. Look at which genes are expressed at high levels for designating a prognosis. Driver mutations cause these over expressions! They are responsible for the proliferation and growth of cancer cells. Cancer cells accumulate both passenger and driver mutations.
What is population stratification? Why is it bad for GWAS?
When populations have differences in allele frequencies It makes allele frequencies outside of HW. WE WANT HW for GWAS. *You can't use a locus outside of HW to make genotype-phenotype associations in GWAS.
Describe the problem of multiple testing in GWAS.
When we do GWAS, we're testing 700,000 SNPs. This is a problem of multiple testing, where some of the SNPs will be falsely significant due to p-values being uniformly distributed between 0 and 1. For example, if you do 10,000 tests, you will get 500 tests significant based on a 0.05 significance. You would need to correct the p-value threshold either by Bonferroni correction or FDR.
Where on a Manhattan plot might you find a causal variant?
You look at "towers" and within that tower there will be a cluster of SNPs, one of which will be the causal SNP. Regions with many highly associated SNPs in linkage disequilibrium appear as "skyscrapers" along the plot. There may be a indirect association where the significant SNP is actually linked to the causal SNP vs. in direct association, the significant SNP on the chip is actually the causal SNP.
What do you expect the distribution of p-values for these 1,000,000 association tests will look like if there is no genetic basis for this trait?
You would expect a UNIFORM distribution of p-values if there was no genetic basis for this trait. In other words, the same frequency of
Would you expect more SNP diversity in a sample from an African population or a non-African population? Justify your answer.
You would expect more SNP diversity in a African population. Ancient populations such as the African population are older and have had much longer for recombination and so are expected to have more SNP diversity.
One way to classify human cancers is based on exome sequencing. This requires sequencing 4-5 Gbp of DNA, as opposed to the 90 Gbp of DNA required to sequence a whole human genome. i) What is exome sequencing? Explain the method in detail. ii) If a human genome is only 3 Gbp long, why does "sequencing a human genome" require producing "90 Gbp" of DNA sequence?
i) Exome sequencing involves sequencing the exons that code for protein (~2%). It potentially enables us to quickly understand the functional genetic variations in genetic diseases. You would first capture the exons in a DNA sample by using some microarray with oligonucleotide probes that hybridize to just the exons. After you capture all the exons on the array, you can sequence those with any NGS technology. ii) It is to ensure around 30X sequencing coverage to trust a variant derived from an exome sequence.
GWAS aim to identify common human genetic variants (usually SNPs) that have relatively small effects on human traits. For human GWAS published in the last five years: i) approximately how many SNPs are typed per individual? ii) Approximately how many individuals are required to identify "small effect" variants? iii) What effect sizes (in terms of odds ratio) do such "small effect" variants have?
i) approximately 700K ii) iii) around 1.2 - 2 odds ratio
Jane the research scientist collected mud from five different cow paddocks in different regions of New Zealand. After she returned to her laboratory, she extracted DNA from the mud samples and performed 16S metabarcoding to understand which species of microbes were present in each of the different cow paddocks. (A) Discuss whether the method she used (16S metabarcoding) is an appropriate method to answer the scientific question. (B) Devise an alternative method to answer the same scientific question, and describe the method step-by-step. (C) Discuss hurdles that Jane the research scientist might encounter in your alternative method.
(A) 16S metabarcoding is commonly used for taxonomic classification of prokaryotes due to the ubiquitous nature of the 16S gene (ribosome) present in prokaryotes. Amplification of the 16S gene produces PCR products that can be sequenced to determine what organisms are present in a sample or ecosystem. Therefore, since Jane was aiming to understand which species of microbes were present in each paddock, 16S metabarcoding was an appropriate method. (B) Jane could also perform a metagenomic analysis to identify the species (and functional characterization). Rather than just sequencing 16S amplicons (metabarcoding) she can sequence all the DNA present in a mixed community sample. The work flow would include extracting the DNA from microbes in each paddock sample and then fragmenting the DNA by mechanical or enzymatic methods. Then a library can be prepared from the DNA fragments and sequenced with any sequence technology of choice (i.e. Oxford nanopore, Illumina, PacBio). You then conduct bioinformatic quality control on the generated sequence reads by eliminating duplicates and low-quality bases. For inferring the composition of a microbial community and to identify the different taxa you would compare your reads to either reference sequences or assemble de novo your reads (OTU picking). (C) Metagenome analysis involves a greater amount of sequence reads to sift through compared to metabarcoding because you are trying to reconstruct the genome of many species in a sample. Therefore, the reads generated from sequencing the whole sample in metagenome analysis (rather than just 16S amplicons) are usually very hard to annotate and can be highly sensitive to errors along the work flow like DNA prep.
What is the main idea in cancer evolution?
- "multiple-hit" model of cancer, where multiple mutations are usually required to turn a normal cell into a fully cancerous cell. - a model of tumours in which they are driven by a succession of mutations which gives them a selective advantage. Each mutation gives the cell increased growth rate or ability to survive in a particular environment, and then there are more cells of that clone that can accumulate additional mutations - There is a tumour "micro-environment" - different cell types can be adapted to different environments within the tumour
There's a shift from doing GWAS due to a number of reasons, and more towards whole genome sequencing. What are some reasons GWAS is deemed useless by others?
- the assumption of genetics playing an important role in the risk to common diseases is a flawed assumption - don't explain enough genetic variation in the population - doesn't deliver meaningful, biologically relevant knowledge ore results of clinical utility - results can be SPURIOUS
List three novel hypotheses about human population history that have relied on information obtained through sequencing the genomes of archaic humans.
1) DENISOVANS BRED WITH NEANDERTHALS The genome from a girl who was a first generation child (or hybrid) of both a Denisovan father and Neanderthal mother introgression event showed how she shared DNA from both species of hominins. This discovery also highlighted that Denisovan and Neanderthal interbreeding was actually more common than was thought despite them inhabiting different geographical regions. 2) OCEANIANS AND DENISOVANS After surveying archaic genomic sequences in a worldwide sample of modern humans (mainly Melanesians), reconstruction of genetic history suggests that Neanderthals bred with modern humans multiple times, but Denisovans only once, in ancestors of modern-day Melanesians. 3) EARLY MODERN HUMANS IN EUROPE AND NEANDERTHALS Mixture between modern humans and Neanderthals was not limited to the first ancestors of present-day people to leave Africa. Evidence from DNA extracted from an early modern human skull found in Romania dated around 40K years old suggested introgression between modern humans and neanderthals also occurred in Europe about 4-6 generations before he was alive. His genome was interspersed with Neanderthal-like DNA segments.
Frequently, GWAS cannot explain all of the expected heritability in phenotypic traits. This may be for true biological reasons, or to errors in the sampling or analyses. Name and describe three specific biological mechanisms that affect penetrance, and for each, give a real-world example of a disease or phenotype
1) Epistatic interactions: Digenic/Oligogenic mutations i.e Bardet-Biedl syndrome (oligogenic 3) Clinical penetrance can vary due to mutations in multiples genes that are required for the disease to manifest. Bardet-Biedl syndrome involves 3 genes that require all 3 to be mutated in order to acquire the syndrome. 2) Environmental factors such as age or gender. i.e. Tourette's syndrome (gender) Environmental factors can improve or exacerbate the impact of heritable genetic variants as modifiers of disease penetrance. Allelic variation influences the Tourette's phenotype in a sex specific fashion. Certain variants associated with Tourette's are more common in males than females explaining for the fact that TS is much more common in males. 3) Epigenetic influences i.e. Childhood leukaemia Epigenetic influences such as the heritable modifications of DNA methylation and histone modification can affect penetrance without changing the DNA sequence. Childhood leukaemia was seen to be discordant in twins due to differing BRCA1 methylation patterns explaining for the differing disease pattern.
Discuss three mechanisms by which alleles change in frequency in populations.
1) gene flow - migration of individuals removes - transfer of genetic variation from one population to another 2) strong selection - beneficial traits to increase in frequency in a population - lactase persistence 3) bottlenecks or founder effects - size of a population is dramatically reduced in some random event, like a natural disaster; changes the frequency of alleles - loss of genetic variation that occurs when a new population is established by a very small number of individuals from a larger population
Main ideas from cancer unit.
1. Different incidences of cancer depending on why or how those cancers occur. Early onset of cancers - The mutations responsible for these are inherited or occur early in development. PRE-EXISTING MUTATIONS. 2-hit hypothesis i.e. Retinoblastoma Later onset of cancers - Increased time of exposure to environmental factors leading to cancer. i.e. Carcinoma During a stage of growth, incidence may be high, but once growth cedes so does the cancer. Cancer has something to do with growth. i.e. Osteosarcoma 2. Different cancers have different pathways. i.e. CRC typically takes ONE pathway The pathway the cancer takes determines both the TYPE of mutations and HOW MANY mutations.
What are the six assumptions of Hardy Weinberg Equilibrium?
1. Diploid 2. Non-overlapping generations 3. Sexual reproduction 4. Random mating 5. Infinite population size 6. No selection
Describe, in detail, three ways that you could increase the power of this specific study.
1. Sample size - larger sample = more power 2. Effect size (difference between two groups) - larger effects = more power - odds ratio, likelihood ratio, relative risk 3. Statistical tests - more false + = more power
In three basic steps, how is a GWAS performed? What's the point of GWAS?
1. Select two populations (thousands of individuals in each) 2. Genotype across the genome 3. Look for SNPs that are more common in one population than the other You're looking for SNPs enriched in your cases vs. your controls.
Give five examples of what we have learned in terms of our genome sequence from the outcomes of the project.
1. The human genome contains 3 billion nucleotide bases. 2. The average gene consists of 3000 bases, but sizes vary greatly. 3. Less than 2% of the genome encodes for the production of proteins. 4. The total number of genes is estimated at 21,000, much lower than previous estimates. 5. The functions are unknown for over 20% of discovered genes. 6. The gene-dense centers are predominantly composed of G and C, and the gene-poor deserts are rich in A and T.
Admixture models.
Admixture models: - Relies on a model of biology. - You specify the number of ancestral populations in your model (NOT SPECIFYING A AFRICAN OR EUROPEAN GENOTYPE ETC. see note below) that led to individuals in current time. These individuals have a proportion of their genome belonging to different ancestral populations. - Explicit insight into how admixture has occurred by modelling it. Parts of the genome are assigned to different ancestral populations. You don't need a whole bunch of ancestral populations (k parameters). No matter what, the more parameters you add the better the data will fit. NOTE: We emphasise or assign genotypes to different ancestries. We assign genotype X as a European genotype etc. etc.
In June 2013, the American Supreme Court ruled that human genes cannot be patented in the U.S. because DNA is a "product of nature". Why was this ruling such a big deal at the time?
Celera, a private organization launched by Craig Venter, had already filed patent applications for genes following their discoveries. By the time the case was ruled against patenting, there were around 4,300 human genes already patented. The Court decided that because nothing new is created when discovering a gene, there is no intellectual property to protect, so patents cannot be granted.
How is chromatin immunoprecipitation (ChIP) used to identify locations in the genome with a particular histone modification?
ChIP is used as a way to detect what region of DNA a certain protein is binding to. - You cross-link the DNA with the proteins they're already bound to. - Fragment the DNA into pieces. - Immunoprecipitate using an antibody to your protein of interest, like an antibody for histones to "pull them down". Wash away anything else not bound to antibody. - Reverse the cross-links between protein and DNA, and isolate the DNA - PCR to see if your DNA is there - You can further sequence the DNA to be used in ChIP-seq or ChIP-chip.
Given our current knowledge, describe what the epigenome is. Contrast the epigenome to the genome of an organism.
Chemical modifications to DNA and histone proteins form a complex regulatory network that modifies chromatin structure and genome function and no change to the DNA sequence. These changes in the epigenome are potentially heritable across the genome.
What is the point in finding driver mutations? Explain in detail a couple examples.
Driver mutations can be a good target for drugs. BREAST CANCER - herceptin - HER2, a receptor tyrosine kinase in - HER2 is over-expressed in 15-30% of breast cancers - Dimerizes to stimulate growth by upregulating proliferation and blocking apoptosis - Herceptin, humanized antibody, inhibits HER2. The HER2 protein was inserted into mice, in which the mice developed an antibody against it. The HER2 targeting regions of the mice antibody was added to the human antibody to prevent human immune response against it. "Humanized antibody" CHRONIC MYELOID LEUKEMIA - gleevec - bone marrow cancer where abnormal white bloods cells are abundant - caused by a Philadelphia translocation between chromosome 9 and 22 - constitutively active fusion gene abl (kinase) -bcl (immunity) - Gleevec tyrosine kinase inhibitor
Describe in detail the following general models of human genomic and phenotypic variation that explain for the "missing heritability," and for each, name one trait that you expect (or know) follows that model. Choose two from this list to describe: Infinitesimal model Rare allele model Interaction model (Broad sense heritability G x E model) CDCV model
INFINITESIMAL MODEL - A very large number of common very small effect alleles are present, but each explain very small percent of the population variance. - The gene contributes to every trait, but with effect sizes so small that it would take samples greater than the population size of the species to detect them. The thresholds i.e. Height, BMI RARE ALLELE MODEL: - There is a very small number (RARE) of large-effect variants. Not detected by standard GWAS - They're so rare we won't be able to detect any association. - PRO: Strongest evidence comes from evolutionary theory, where disease-promoting variants should not be common since it is deleterious to fitness. i.e. Schizophrenia INTERACTION MODL (Broad sense heritability G x E model) - There is some combination of genotypic, environmental and epigenetic interactions that contribute to the expression of the trait. Twin discordance for certain traits like obesity. - Con: Conflated genetic and environmental effects due to it being hard to find a homogenous study population. - Epigenetic and epistasis effects i.e. Obesity CDCV MODEL: - The model that complex disease is largely attributable to a moderate number of common variants, each of which explains several per cent of the risk in a population. - Like in GWAS, a small number of common alleles that are explaining a great deal of the variation. AGAINST: this model violates evolutionary assumptions examined in recent history where deleterious or disease causing genes shouldn't be common in the population due to natural selection selecting against these deleterious alleles. i.e. Alzheimers
Given the following table of genotypes for cases and controls, calculate whether there is an association between an individual's genotype at this locus and their propensity for the disease (i.e. their likelihood of being in the case category). Assume that the population is in Hardy-Weinberg equilibrium. If you need it, there is a chi-square table below. Genotype - Controls - Case TT - 250 - 130 TA - 330 - 140 AA - 100 - 50
Likelihood ratio (50/150) / (270/850) = 1.049 ~ 1.05 - Ratio of cases with aa over controls with aa over a ratio of cases with AA or Aa genotype over controls with AA or Aa genotype. 5% increased likelihood of being in the case category if you have an 'aa' likelihood.
To ensure that your results are not artifacts you first make sure that the loci you find with "statistically significant" association are not just false positives due to multiple testing of a large number of different SNPs. Describe one method that can be used to correct for multiple testing when performing a GWAS, and list one advantage and one disadvantage of that method.
One method of correcting for multiple testing is using the Bonferroni correction. It involves dividing your p-value (i.e. 0.05) by the number of tests being done (Alpha/n). An advantage of this test is it's ability to leave out all false positives, leaving you with basically only true effects. However, a disadvantage is that this test is very stringent and so it also leaves out other true positives that may have not made the cut-off. The power of this test is less than that of say the FDR (false-discovery rate).
Most alleles in the human population are in Hardy-Weinberg equilibrium within populations. Describe and explain a situation in which you might expect two alleles to not be in Hardy Weinberg equilibrium (this can be a real or artificial situation).
Strong selection will cause the frequency of one allele (p) to increase more so than another allele (q) thus disrupting natural frequency if the phenotype for that genotype can confer better survival. Gene flow will of course cause the frequencies of the alleles to change since if individuals leave the population that will change the alleles present in the gene pool.
The following table shows the observed number of SNP genotypes in cases (individuals who drink more than four cups of coffee a day) and controls (individuals who drink one or fewer cups of coffee a day). Calculate the expected number of cases and controls of each genotype. Genotype - Controls - Cases TT - 300 - 180 AT - 180 - 60 AA - 150 - 30
TT - CONTROL: 336 TT - CASES: 144 AT - CONTROL: 168 AT - CASES: 72 AA - CONTROL: 126 AA - CASES: 54 Cross product of TOTALS/ Total # of individuals.
What's the story of tagSNPs?
TagSNPs are SNPs diagnostic of haplotype blocks. Haplotype blocks are a group of SNPs that are in high linkage disequilibrium. We want to minimize the number of SNPs we need to do by looking for maximum haplotype blocks (so we aren't doing more than we need to do). 2nd problem with ascertainment bias: Different populations have different levels of linkage disequilibrium at different places in the genome AKA different haplotype blocks. There fore, the tagSNPs are different.
Describe the main differences between taxonomy dependent and taxonomy independent analysis of metabarcoding data.
Taxonomy dependent: - Align all sequences against a reference database to assign. You can only discover what is known. Taxonomy independent: - based on a specified sequence variation (97% similarity) Hybrid approach: align all sequences you can to a database, and for those that don't align, cluster based on sequence variation. Essentially, taxonomy dependent requires a reference database while taxonomy independent does not (it clusters based on sequence similarity).
You would like to complete a genome wide association study (GWAS) for leg length in yellow fever mosquito, Aedes aegyopti. However, you do not have an Illumina BeadArray available for this species, nor do you have a list of SNPs. Describe how you could generate a set of common SNPs for this species. Specify two precautions you would take to ensure that this is a quality set of SNPs (for example, to minimise ascertainment bias)
The most ideal method would be to do whole genome sequencing and sequence the genome of many flies to generate a set of common SNPs. You would have to conduct some quality control on your chosen SNPs by choosing those that are diagnostic of haplotype blocks in high linkage disequilibrium (tagSNPs). PERHAPS: You could also use a reference set of SNPs from closely related species. After determining a species most related to the yellow fever mosquito, you could use the SNP data on hand for that species to find common SNPs. Reference SNPs. Account of difference in population structure in the models, or only account for haplotype data.
There are two islands, Blue and Grey island. On each island, there is a population of weta. Twenty years ago, a pathogenic virus arrived on both islands, causing significant weta mortality on both islands. Five years later, a mutation providing complete resistance to the virus swept to fixation on Blue Island. Would you now expect to find more linkage disequilibrium in the population of weta on Blue Island (where weta are now resistant to the virus) or Grey Island (where weta are still susceptible to the virus). Explain your answer.
The mutation is undergoing strong positive selection. Strong positive selection quickly increases the frequency of an advantageous allele, with the result that linked loci remain in unusually strong LD with that allele. Therefore, you would expect to find more LD in the weta population, and where the mutated allele becomes fixated you will see a substantial reduction of heterozygosity at a neutral loci closely linked.
Explain the purpose of using a false discovery rate (FDR)
The use of FDR as a correction for multiple testing allows for or controls for false positives (Type I error) in order to try and capture all the true effects. This test has more power than the Bonferroni correction since it finds MORE true effects. You already know which ones are true positive and which ones are not, but on a 0.05 significance level you choose to make 5% of the true negatives as false positives. False discovery rate = false + / false(+) + true (+) Case 1: you care about the identity of SNP? Case 2: you care about effects of overall phenotype? How much do you care about true positives/false positives?
Having successfully made a new BeadArray for Aedes aegyopti (which has three chromosomes), you perform your GWAS and find three genomic regions that have strong, statistically significant associations with leg length. You decide to visualize these results using a Manhattan plot. Draw a schematic of a Manhattan plot, labeling the x-axis and y-axis. Include the scale on both axes (i.e. place numbers and labels on the axes). Below the plot, describe in detail what the y-axis represents.
Y-axis: -log10(P-value) X-axis: Chromosome number The Y-axis represents the -log10 of the p-value for the tests done to test for significant differences in frequency between cases and controls for each SNP. So each point is a SNP. The p-values are logged because some of them are so small that for graphing purposes it's much easier to plot the log, and it is negative because the log of a really small number is negative so to put it back into positive space they are multiplied by negative.
Describe one method to determine not just which populations you have ancestors from, but when those ancestors joined your family tree.
You could use an admix model by assuming a set number of ancestral populations (although bias introduced) and estimating genome-wide ancestry proportions of every individual. You can see the fraction of an individuals genome that belongs to what ancestral population. This method considers the biology to create models to infer admixture. But, also due to ongoing recombination within admixed populations, the lengths of these ancestry-specific DNA segments are expected to be inversely related to the timing of admixture. Therefore, it is possible to estimate the timing of admixture events by inferring the lengths of ancestry-specific DNA segments.
What is the workflow of metabarcoding?
You would first COLLECT a representative sample and while avoiding any contamination EXTRACT the DNA from the sample. Then you would PCR AMPLIFY a barcode of choice (i.e. 16S gene in prokaryotes). After amplification, you would purify your PCR products and prepare it for sequencing. The sequencing reads produced should all be around the same length since we amplified the same barcode. We would quality control the reads and then begin assigning taxonomy using a reference database, or grouping sequences by sequence similarity into OTUs (operational taxonomic units).
Many phenotypes may be caused by very rare SNPs (rather than common SNPs). If this is true, what approach should we take to identify associations between rare SNPs and phenotypes? Why must we take this approach?
You would move towards whole genome sequencing to identify rare variants, since other variants not covered by exome sequencing or SNP arrays can be detected. You could couple WGS with linkage analysis to look at mutations within a haplotype block that have been associated with a tagSNP in a GWAS. There is potential to find rare mutations that have a non-random association with the tagSNP.
You are selecting SNPs to use on a new Illumina SNP Chip. In areas of the genome with low linkage disequilibrium, would you use more tag SNPs or fewer per megabase of genome? In your answer, explain what a tag SNP is, and why you would use more or fewer.
You would use more tag SNPs per megabase of the genome. Tag SNPs essentially "tag" a haplotype in a region of the genome. If the haplotype has low LD, then that means that compared to a haplotype with high LD it may be inconsistent to infer what is at one locus given you know what is it at one locus. This diversity of the haplotype must be accounted for and therefore more tag SNPs are required. In a haplotype block of high LD, you have a sparse diversity, and so a few tag SNPs are sufficient in accounting for that diversity. TagSNP = A readily measured SNP that is in strong linkage disequilibrium with multiple other SNPs so that it can serve as a proxy for these SNPs on large-scale genotyping platforms.
