Genomics Unit 3

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

How can we recognize when the odds ratio is significantly different?

(to make sure being greater than 1 wasn't by chance) 1. perform chi-square statistical test 2. p-value: the lower the p-value is, the less likely something has occurred by chance - p value < 0.05 or 0.01 as significant -in gene association studies, we want p-values that are extremely low -alleles of a SNP can be associated with a binary state (whether you have the disease or not) -- calculations to show if associated or not -can also do associations between SNP alleles and quantitative traits -can see if SNPs are associated with the expression of our gene of interesting (expression is a quantitative trait and varies from person to person) -not necessarily these SNP alleles are causing phenotype --> these SNP alleles are associated with particular levels of gene expression. they may be traveling with other genes that actually regulate the gene expression -ex. looking at associating between genotype of SNP and height -can see certain allele (G allele) is associated with height (phenotype)

Summary of all Three Off-target Discovery Assays/Techniques

(typically performed as these three steps) 1. Computational Assay: -Guide RNA is a 20 base pair strand - only want modification at this site -can have sequences in genome that is different from the guide RNA sequence = that is what mismatch means -setting up computational experiment - asking program what are all the sites in the genome that look like the guide RNA — mismatches are differences in the sequence from the Guide RNA — just asking what are all the other sites in the genome that resemble this site but just have a few mismatches —> that Cas9 could still recognize and hybridize (so this would be an off-target editing site) — less likely for Cas9 to bind to these sites if more mismatches, but want to check more mismatches in case for more important sites (allowing for more mismatches in exons) — ultimate goal is to put Cas9 in patient's body for treatment, so need to check to make sure it doesn't harm any other gene functions - especially need to make sure it doesn't affect other exons — *more mismatches get you to more potential off-target editing sites (giving it more chances to be an off-target editing site) - used for looking for off-target editing sites in exons* 2. GUIDE-Seq: -Wherever the oligo nucleotide is where the potential off-targeting region is located - can design primer from that site into the genome DNA -looking at where in the genome did the tag insert itself (the short unique DNA sequence - oligo) -you know where the oligo sequence is to design a primer from it -primers used to sequence DNA you got from the oligo nucleotide! --> sequencing allows us to confirm whether or not mismatch sequences in computational assay are found in the genome -positive control: did I get the oligo nucleotide in the intended spot -now we are confirming what we got from the computational step! -now seeing if oligo nucleotide ended up in the predicted site from the computational step 3. Last assay: SITE sequencing is the Cas9 enzyme and guide RNA and purified genomic DNA -get whole list of sites -same thing as GUIDE seq.- adding adapters where it cut -then sequencing DNA to figure out where it cut -same as GUIDE seq, but using naked DNA instead of cells

ancient DNA vs. modern DNA

*Generating and analyzing ancient DNA data is not the same as modern DNA. Modern DNA: • Long molecules • Repair processes prevent "lesions" in the sequence • Present in high quantities ("high copy number") Ancient DNA: • Short, fragmented molecules (<<<100bp) • Exhibits 'characteristic' damage patterns • Present in low quantities ("low copy number")

Pacific Biosciences Long-Read Detection Limits of Deletions

- Characterization of potential DNA insertions and deletions >100 bp after genome editing - Individual NA20383 selected as genomic DNA (gDNA) standard because it harbors a well- characterized pathological 966 bp deletion - Titration of individual NA20383 against individual NA24385 allows calculation of detection limit

Long-Range PCR and Long-Read Sequencing with Pacific Biosciences

- Characterization of potential DNA insertions and deletions >100 bp after genome editing. - Susceptible to amplicon size PCR enrichment bias.

CRISPR/Cas9 Off-Target Editing gRNA Qualification

- Off-target editing process consist of two phases: 1. discovery phase: use unbiased genome wide sequencing techniques to identify locations in the genome where potential off-target editing can occur -can't just use whole genome sequencing to determine how specific your CRISPR Cas9 guide RNA sequence is -that 20 base sequenced determines the entire off-target editing landscape because each 20 base-pair sequence is unique and CRISPR only has activity with homologous regions to the guide RNA -always homology directed -only doing assays 2. Conclude in the Validation phase -in the discovery phase, don't edit a cell, just did assays genome wide and sequenced them with next generation sequencing deeply to identify locations in the genome where potential off-targeting can occur -end up with list of these locations --> validation phase -take list of potential off target coordinates and validate if off-targeting occurred at these loci in cells that have been treated with the drug -very few validate this (the goal is none) Overall: - The 20 base pair CRISPR/Cas9 gRNA sequence determines its off-target editing landscape - Two-phase strategy for comprehensive off-target characterization: 1. On-target activity screen 2. Off-Target Discovery (complementary assays): -Computational prediction -GUIDE-Seq (cell-based assay) -SITE-Seq (biochemical assay) 3. Off-Target Validation: Targeted off-target sequencing in primary cells overall target editing validation: 1. have guide RNA to target the mutated gene 2. validating any off-target effects -doe this by either GUIDE-seq and PCR or Cell-Free SITE seq (happens quickly and with naked DNA with Cas 9 enzyme, which gives another list) or by CasOFFinder 3. ideally, these lists/techniques show no off-target effects

Cell-Free SITE-Seq Off-Target Editing Discovery

- now starting with naked DNA (instead of cells) --meaning it is deprotienated and there is no enzymatic substrate restrictions for CRISPR/Cas9 present (that you would see in a living cell). Every cell has its own nuclear organization and genomic structure content - so there are cell-type specific limitations associated with the cell-based assays --in the biochemical assay, the DNA is the same in every cell and you have removed all the epigenome (so no other information where that DNA came from) - Species genome-wide off-target discovery technology --applicable for every cell-type in the human body (unlike the cell-based assay that is specific to a certain type of cell the CRISPR/Cas9 is inserted in) - Deproteinated genomic DNA input - Permissive enzymatic CRISPR/Cas9 cutting conditions leads to high sensitivity and low specificity steps: 1. High molecular weight genomic DNA and CRISPR/Cas9 digestion (saturate the editing - 4 hours) 2. Cuts make them susceptible to getting adapters to be added to the ends and for library construction (put on adapters to genome) -these adaptors preferentially get ligated to CRISPR/Cas9 cuts because CRSPR/Cas9 makes blood cuts that are phosphorylated which allow this ligation to happen -allows us to ligate and need adapter ligates on two sides -one primer was in the donor oligo sequence in cell-based assay. in this case we ligate adapter for first primer, and then enrich for it with biotin and the adapt the second primer, and then sequence all the DNA (completely unbiased, not primer specific sequence that enhances more) -DNA sequences are mapping genome

DNase I footprinting

-*more accurate assay (than previous one) -DNaseI normally cleaves every phosphodiester bond, but it can't cut where DNA is bound by a TF 1. start with DNA fragments with end-label 2. a. have control where don't add nuclear extract or protein of interest b. add nuclear extract/protein of interest or purified TF directly 3. lyse DNase treatment -get DNA with different sizes --> runs samples with and without TF on the same gel and compare sizes to eachother --> blank spaces are fragments that were not cut by DNaseI - on gel, these fragments are longer and are the ones bound by TF (because weren't cut by DNaseI) --> footprint is the "blank space" in the gel 4. can take DNA fragments and can sequence them and figure out what were the fragments protected by the TF by DNaseI activity

Case Study

--> Used to understand the function of a TF

Humans genetics

--> personalized medicine -heterozygosities -human history -monogenic vs. polygenic traits

Part III of Human Genetics

--> personalized medicine -polygenic traits/diseases -How do we determine if a phenotype is polygenic and how many genes are associated with that phenotype? --> association testing

Basic-helix loop-helix motif

-1 alpha-helix will be the recognition helix (goes into the major groove of DNA molecule and will interact with specific nucleotides found in DNA sequence) -proteins with this domain dimerize --> meaning it makes a dimer -dimer: 2 individual TFs coming together (work together and bind DNA) -nomodimers - 2 of the same TF -heterodimer - 2 of 2 different bhlh TFs

linkage

-2 alleles are linked to each other during a cross -linkage tends to be long range (1-2 mbp) - genes can be far apart from eachother and stay linked when we're performing cross -this is because in just a few crosses/generations, just 1-2 recombinations per chromsome per meiosis

Helix-turn-helix DNA binding domain

-2 alpha-helices separated by a Beta-turn -2nd Helix is the one that interacts with DNA sequence -important to note that in structure of DNA, the major groove is where the TF interacts with -specific interaction in 2nd Helix -whole part of it is important, but only part of it is interacting with DNA directly

What is ancient DNA?

-Ancient DNA is the study of DNA collected from biological specimens that lived anywhere from hundreds to hundreds of thousands of years ago -When we carry out ancient DNA analysis, we move from 'bones to base pairs'

Developing CRISPR/Cas9

-CRISPR/Cas9 through LNP (RNA): In vivo CRISPR is the therapy --genetic diseases -CRISPR/Cas9 through the cells: Ex vivo CRISPR creates the therapy --immuno-oncology --autoimmune diseases

Representative Comparison of Off-Target Editing Loci Discovered

-Cas-OFFinder allowed up to 3 mismatches in non-coding DNA and up to 4 mismatches in protein coding DNA -SITE-Seq discovered more loci than GUIDE-Seq --SITE-Seq is our best assay, the most sensitive (but least specific, but that is ok because we have techniques and technologies to validate potential off-target sites.) -Validation of all loci discovered by these complementary technologies can provide a comprehensive approach to off-target editing characterization

rhAMPSeq Indel Detection Limit

-Genome in a bottle are the best characterized human genomes by NIST -62 naturally occurring indels from 1-20 bp and amongst a variety of DNA sequence contexts were curated from individuals NA12878 and NA24385 -100ng genomic DNA input and 1,000 sequencing reads can achieve >90% indel sensitivity down to 0.2% DNA editing

Guest Speaker: Genome Editing with CRISPR-Cas9

-Genomic characterization of potential unintended genome editing with CRISPR/Cas9 -Example patient: Bill living with transthyretin amyloidosis --> company wanted to use Cas9 to permanently remove the transtheyretin gene Conditions that make gene mutation/disease compatible for CRISPR/Cas 9 therapeutics: -only for when human disease is caused by a genetic mutation -monogenic trait/disease --> can engineer guide RNA -gain of function gene --> if loss of function, much harder to engineer replacement for it)

Why study ancient DNA?

-Humans have not descended in simple ways from populations that lived in the same locations in the distant past -Human history is characterized by people moving from place to place and mixing with each other -Ancient DNA is a powerful way to look at our ancestors at known times and places

Characterization of Potential DNA Structural Variants

-Illumina short-read NGS technology characterization of DNA structural variants -Pacific Biosciences long-read sequencing characterization of long-range PCR amplicons -Pinpoint DNA FISH direct visualization technology

Recombination in humans

-Recombination or meiotic crossover is fairly sparse in humans -meaning, 1-2 recombination events per parent during meiosis

monogenic traits

-Shaped by variations in a *single* gene -"mendelian"

Direct Visualization of Structural Variants with Pinpoint DNA FISH

-Standard karyotyping is not applicable to post-mitotic tissues -Pinpoint DNA FISH is used to characterize allelic inter- chromosomal translocation -KromaTiD's pinpoint DNA FISH technology uses three- dimensional imaging to detect DNA structural variation

FoxD3 TF

-TF we are studying -how techniques can be used in zebra fish to understand TF FoxD3 function

Insertion-deletion (indel)

-again due to replication error -2 alleles -300,000 of these sites in our genome -can have 5 new mutations per person

Genome wide association study Data

-all human chromosomes -Manhattan Plot: used to visualize this data

Most of this variation is functional

-all these heterozygosity sites show observation that there is one heterozygosity every thousand base pairs in the human genome -most of this variation is functional, meaning that it is affecting gene expression (could lie in regions of genome with "junk DNA", but a lot of it is not) -100 genes with loss-of-function mutations -1,000 of genes with missense alleles -can also have these changes in DNA occur in parts of the regulatory sequences (even if mutation is not in an exon, can have mutations be in regulatory segments) => can affect gene regulation --almost all genes vary in expression across people and this correlates with variation observed in our genomes

Single nucleotide polymorphism

-arise from errors in DNA replication (such as presence of mutagen) -2 alleles: ex. at a site person could have C/T, T/T, or C/C in their genome (depends on what you inherited from your parents) -3 million sites per person (SNPs are common) -happen spontaneously, so can get 50 new mutations per person (meaning these would arise in the formation of sperm and egg that were fertilized to create an organism, can have 50 new point mutations that would be SNP in every generation)

Pedigree diagram analysis

-autosomal recessive inheritance pattern: trait/disease phenotype is not present in every generation (ex. just shows in three siblings that are not in parents), but it can come up in the next generation that did not have the trait/phenotype -autosomal dominant: disease/trait phenotype is present in all the generations - need just one copy to show phenotype -X-linked recessive: mutation in the gene that is on the X chromosome, situation where you may not see phenotype in every generation, specially affects males because only have one X chromosome (ex. females are carriers, and produce male offspring that show phenotype) - usually only shows in males -X-linked dominant: (less common)- phenotype shows up in every generation because it is a dominant mutation, if you have affected male, his daughters will all be affected because they inherited X chromosome with mutation from their father, whereas the son's will inherit Y chromosome from father and a normal X chromosome from their mother. --daughter gets it from father with it -sex-specific inheritance patterns that points you in the direction of where the mutation lies in the X chromosome -mitochondrial inheritance: can have mutations in the mitochondrial genome that can cause a disease state - only inheriting from mother because mother contributes an egg and the egg has the mitochondria in it (sperm does not carry mitochondria with them). --an affected female will give rise to offspring and all her offspring will be affected because all of them inherited mitochondria genome from her -Y-linked: mutation on Y chromosome - only males carry Y chromosome so only *males affected --would act in a dominant fashion because only one Y chromosome present, so if mutation on that chromosome, you would see the disease phenotype

Pedigree of 3 generations

-children can inherit crossover chromosome -granddaughter shares parts with her grandfather -and when comparing siblings to each other, they would be identical in genomic content because both of them have segments of DNA to the grandparents -other cases, children could be very different from each other -really depends on what recombination events occurred -can see what segments of DNA you share with siblings, parents, or grandparents -happens infrequently, so can see/compare these segments

Zinc finger domain

-commonly present -10% of mammalian genes encode for zinc finger TFs (a lot!) -only alpha helix directly contacts DNA -Beta sheet also important because makes non-specific interactions with phosphate backbone of DNA -relationship in how the zinc fingers are being held actually dictates the general structure -multiple zinc fingers in one TF -variations of zinc fingers: --multicysteine zinc fingers --treble clef zinc finger etc.

Improving human health:

-determine features of the human genome that associate with a particular disease --> compare genomes of patient with a condition to healthy counterparts -identify changes in DNA sequences associated with cancer -determine how genetic predisposition can be affected by lifestyle changes -genetic counseling for couples who want to have a baby -choose drugs or treatment based on a patient's genome sequence (pharmacogenomics)

microsatellite

-due to replication slippage (because microsatellites are repeats - during replication, polymerase can get confused and add or delete some of these repeats) -many alleles (depends on what you inherited for how many repeats) -30,000 sites in the genome -can have 50 new mutations in these sites

Forkhead domain interacting with DNA

-find this information from prosite database --> tells us highly conserved/notable domains: which we learn from this program that it has Forhead domain -Forkhead domain in FOXD3 TF (which we can google) -purpose is to bind DNA -4 alpha-helices -2 Beta-sheets -2 wing domains

What is HTT?

-gene is expressed in a number of tissue, protein is present in many cells -transport of other proteins -structural functions? (this is not very well-known) -excess glutamates cause clumping with other proteins --> causes neurons to die

How GWAS is helpful for understanding these diseases and polygenic disease

-genes with associating variants/SNPs tend to concentrate in specific biological pathways -leading us to understand what is going wrong -most proteins are not working in isolating but working in complexes -GWAS analysis can show you that genes most associated with a disease like Schizophrenia are actually all genes that are working in the same biological pathway or protein complex ex. Schizophrenia: common variants in Schizophrenia -L-type calcium channel present in neurons in your brain: made up of different proteins (protein complex) --variations meaning lower expression or point mutations in genes associated with these complexes --affect expression levels for the formation these complexes -alternations to L-type calcium channel complex in patients with Schizophrenia --> points us to ways to relieve this -powerful way of understanding a disease where we don't know the molecular basis of, GWAS can point us in that direction -can figure out how we can use model organisms and see if a mouse has this type of calcium channel and understand what that channel is doing in the brain of a mouse and therefore what it is doing in the brain of a human.

Cell-Based GUIDE-Seq Off-Target Editing Discovery

-genome wide and unbiased -cell based: assay occurs in cells -need to start with cells and introduce CRISPR and Cas9 and the ribonuclear protein complex, AND include *dsODN tag*: oligonucleotide => double strand DNA sequence about 40-50 base pairs long, and is not a naturally occurring sequence in the human genome (unique sequence) -when you include this donor oligo, allows for additional misrepair which includes integration of this oligo. Now, we have an anchor in which we can use PCR primers to enrich unbiased genome wide in GS libraries for sequencing -Allows off-target integration of donor oligo over the period of genome editing -*oligo sequence*: simply for identification of the location of the off-targeting sequence -Limited by available primary cell types and confounded by tissue specific genomic organization Biology of it: 1. donor oligo gets incorporated in the cells, then harvest the cells (have whole genome) -some loci with this donor oligo that is unique and inverted in there 2. take genomic DNA and make it smaller because use short-reading sequencing 3. (unique step for NGS library) once you put the adapters on, meaning the end-pair, adapter ligation, now you have a library. -went from purified genomic DNA to a library because now have all the molecules in that genomic DNA uniquely adapted with oligonucleotides -this is part of preparing genome for analysis -stretch of unique base pairs: NNNN -P5, Index 2, etc. on genome -gives PCR space with double strand oligo tag -need one primer in the oligo and one primer in the genome -for amplification: only DNA fragments with the oligo and the adapter can get amplified and sequenced -take all the reads and map them to the genome and that is how they identify in living cells where potential off-target sites are (where are the reads (black genomic regions)) - look for the accumulation of these reads and annotate these sites as potential off target sites (now have computational prediction sites and cell-based prediction sites)

Results of protein binding microarray

-get back a list of DNA sequences -results show: spot on the chip, and how well our TF bound -so can see where there was good binding and at what spot/nucleotide/DNA position -gives the census binding site

Types of Domains:

-helix-turn helix DNA binding domain -winged HTH domain -Pou domain -Zinc finger domain -basic-helix-loop-helix motif

Heterozygosity

-humans have a fair amount of variation in their genomes = refers to the density of heterozygous sites --> in humans, 1 base per every kilobase -parts of genome that is different in your genome compared to another humans genome (heterozygosity sites)

Can also see morphology of zebrafish

-in addition to determining which genes are regulated by FOXD3, we can also look at the zebra fish animals themselves in which we down-regulated or inactivated FOXD3 and look for any obvious morphological changes -morphological changes: ex. Bones - forming bones -bone structure has fewer bones with FOXD3 mutation

What's happening in HTT gene

-in excerpt of the HTT gene sequence: -there is repetition of CAG (in the first exon) --> encodes for glutamine --protein gives polyglutamine tract -example of microsatellite that's present within a gene sequence -during DNA replication, there can be replication slippage where DNA polymerase gets confused and can insert or delete extra of these sequences into the genomic sequence while it's replicating DNA -disease ultimately is caused by an expansion of an intragenic microsatellite -in patients with Huntington's, there is an expansion of the CAG repeat length -6-35 repeats in genome - healthy -36-39 - sometimes get HD -40-100 - almost always get HD -disease onset is later in life, the more repeats, the earlier in life have HD

Contact of TFs with DNA

-interaction is specific: will bind the same site in DNA (consensus binding site) -consensus binding site - usually 8-10 nucleotides -interaction is non-covalent -in the major groove, hydrogen bonds form between the nucleotide bases and the R groups of amino acids in the recognition structure -not opening up DNA - closed state -interactions in the minor groove - hydrophobic interactions are more important -on the surface of the helix - electrostatic -DNA has a negative charge and amino acids like lysine and arginine that are positively charged => can help interact with DNA - b/c of attraction (these are more non-specific interactions)

Distribution of GWAS variants

-interestingly, most often SNPs associated are found in/arise from the noncoding regions of the genome -most often, common variant effects arise from the noncoding regions of the genome --frequently in enhancer sequences (DNA sequences that are present upstream or downstream of genes where a transcription factor can bind, and from that they will affect the expression of one or however many genes are in the area) -these variations in our genome result in changes in levels of gene expression (rather than creating nonsense mutation) -we each are carrying different variations which then affects the expression levels of different genes that then affects how much protein we have in a complex (that could affect cognition) --variation in the enhancer sequences that regulate the expression of a gene(s) and making of certain proteins -variants in genome result in changes in levels of gene expression

POU domain

-is usually present in TFs that also have HTH domains --> these 2 motifs (DNA binding domains) work together to bind DNA in different regions -bind to two different spots in DNA through two different domains --> makes the interaction more stable

What about polygenic diseases?

-many diseases are polygenic as we've learned from GWAS analysis -ex. chances of developing coronary artery disease? -(1) GWAS analysis performed on a large population of people: this is going to tell you all the variants associated with this disease (time consuming, but has to be done first for a particular disease, that's why these studies are so important) -(2) now we can calculate "polygenic risk scores" (PRS or PGRS) -count all the risk or protective alleles that a person has inherited at all loci, weighted by their effect -ex. 30 variants associated with the disease and one inherited 15 of them, that would translate to a particular risk score. if certain variants are more associated with the disease, could affect your score -graph: can graph polygenic scores: - x-axis shows percentile of polygenic score - as you have more of the variants associated with the disease, - y-axis: prevalence of the disease in the patient - the more of the risk-variants you're carrying, the more likely you are to develop the disease *polygenic risk scores work well to predict the onset of disease (will be more closely monitored - but also ethical concerns) -important to note that this information is trained on the results of genome-wide association studies

Human history shapes the variation in our genomes

-migration out of Africa resulted in genetic bottlenecks -100,000 years ago -genetic separation has not been that long and we as a species have very similar genomes

3 million heterozygous sites - where do you think they come from?

-modern, arose recently or ancient? -50 new mutations per generation -2 unrelated people, they would share 50% of their heterozygous sites

Next: inactivate FOXD3 - tells us the function of FOXD3

-now we know where to focus interest (neural crest cells) -next is to get rid of FOXD3, to determine what happens in the neural crest cells in the absence of FOXD3 --> tells us its function -a "knockout" -compare FOX3 -/- (zebra fish knockout) to wild type FOXD3 +/+ -perform RNA-seq on wild-type neural crest cells and FOXD3 mutant neural crest cells --> we know FOXD3 is a transcription factor that regulates down stream gene expression so might be interested in understanding what are the genes it is normally turning on or off in wild-type (those genes should be disregulated in FOXD3 mutants and should be able to pick that up by RNA-seq) -could do single-cell RNA seq on FOXD3 mutant cells (but costly because doing RNA seq on every cell that comes out of experiment) -isolate neural crest cells and do bulk RNA seq. experiment on those cells altogether and still we will be comparing our mutant FOXD3 neural crest cells to the wild type neural crest cells

Huntington's disease

-patients have jerky movements due to the death of specific neurons in the brain -exists in multiple generations, not inherited in sex-specific ways => autosomal dominant

ChIP seq results (image)

-peaks are FoxD3 binding sites -can determine which genes are near FoxD3 binding sites --> list of genes that are likely to be regulated by FOXD3

Broad goal in human genetics: Which alleles and how many underlie a phenotype?

-phenotype can be easily observable trait -in an ideal setting, can take population of people that are all showing phenotype of interest and see which alleles are associated with being tall vs. short for example. ideally take large population and perform whole genome sequencing in all participants in our study. (variants present in all people who are tall for example) -ideally, take large population and perform *whole genome sequencing --> fairly expensive: $1,000 for whole genome sequencing of one person -*whole exam sequencing --> $250/genome --> specifically sequencing exons, which is small percentage of the whole genome (but can't see variations - if there are variations in regulatory sequences, this will not be picked up. only looking at exon sequences and don't have any information about regulatory or intergenic sequences) -*SNP analysis (most often) --> use special DNA chips where the DNA sequences on this chip are designed so that they can detect all the different variations of SNPs that are found in the genome. Doing it on the level of all the SNPs in a particular person's genome, and these chips are fairly cheap. --> can see SNP variations --> < $50/genome --> can do this on 1,000s of people --detecting the SNP profile of a particular individual. SNPs are single nucleotide variations/polymorphisms in a person's genome that may be used as markers or the cause of a change that is causing a particular phenotype. -why SNP is so powerful: --> *get information for 500,000 SNPs in the genome --> can show haplotypes

Recombination happens infrequently

-recombination occurs infrequently and at certain hotspots -because of this, there are stretches of genomic sequences that are always inherited together (shown in previous slide with pedigree with 3 generations)

Example of these traits

-same disease in humans can have monogenic and polygenic causes -Alzheimer's disease: --early onset of the disease is caused by one gene --late onset of the disease is caused by variations in many genes -Strong cue that the trait is monogenic is if it segregates in simple, predictable ways

polygenic traits

-shaped by variation in *many* genes -biological process are complex -in populations, there are variations in genes

ATTR

-shows how affective the drug is *sustained > 95% serum TTR protein reduction after a Single Dose in NHPs -x-axis: time (weeks) -y-axis: measure of TTR (transthyretin gene) protein from mass spectrometry -dramatic reduction in protein expression (> 95%) and it is stable over the course of the year -therefore, it is a permanent genetic alternation/change, and cure of the disease -instead of drugs that modulates the disease symptoms -therapeutically relevant serum TTR knockdown

Types of heterozygosity

-single nucleotide polymorphism (SNPs) -microsatellites -insertion-deletion (indel) -structural variation

*FOXD3 is expressed specifically in neural crest cells

-single-cell RNA seq results from database to see where FOXD3 is expressed: --each little dot represents an individual cell and have been clustered with each other based on how similar the transcriptional profiles are of these cells --after this clustering happens in an unbiased way, we can go back and identify what these clusters are based on some of the genes that are expressed since maybe we already knew where some of these genes are expressed before doing the experiment -next going to ask where FOXD3 is expressed in all these single cells shown -the expression level is represented by brightness -positive for FOXD3 are in neural progenitors and neural crests -progenitors: divide to give rise to... --muscle --neurons + glia --cartilage + bone

haplotypes

-some SNPs/alleles of genes are tightly linked + inherited together

ChIP-seq

-stands for chromatin immunoprecipition -technique to figure out where in the genome TF bind to and get determine this inside of a cell 1. treat cells with formaldehyde -"fixative" --> fixing TF onto DNA cross-linking -manipulate the structure 2. sonicate the DNA into smaller fragments 3. immunoprecipitation - use an antibody to "fish" out our TF of interest (at the end of this step, have isolated and purified TF of interest by using the antibody that will will it out) 4. Now want to an analyze the DNA sequences and perform NGS on them. So step 4 is to heat our sample to release DNA fragments - which will release our TF factors away from the DNA, which results in just a sample of DNA fragments that were bound to our TF. Now want to identify the DNA fragments (with NGS). 5. perform next-generation sequencing to identify our DNA fragments that came out of this --> once we get those reads back, we can go through and look at our reference genome and map our reads onto the reference genome (see what parts of the genome do these sequences relate to, which can therefore tell us where in the genome this particular TF has bound) --> get back sequencing reads that are for example are upstream from Gene X - tells us that our TF must have bound here --> can find where TF binds to in genome!

GWAS analysis for Schizophrenia

-tells you that there many genes/many variations associated with this disease: they found that there are 128 loci/SNPs (sky-scraper peaks) that are highly associated with Schizophrenia -learning from these studies that many human diseases are highly polygenic --> all these SNPs are contributing to the phenotype (at least a combination, but many different ways to have the disease/phenotype) -*most human traits and diseases are highly polygenic* -hard to tell in these studies which variants/SNPs are the functional ones, but can see where they are located in the chromosome, or study more carefully in individual patients if it affects the expression of certain genes -brings out attention to certain parts of the genome (but doesn't explain why they have the disease)

Result of FOXD3 mutant cells (knockout)

-the data of RNA-seq: shows plot of genes that don't change, versus genes that are unregulated in our mutant cells as compared to wild type -the amount expression is changing is shown -generating big list of genes who's expression is changing in varying degrees (can have both up and down regulated) -can see that one of the genes that is down regulated is FOXD3, which makes sense because we inactivated it and created a knockout (shouldn't be expressed - should not be able to detect it) -*ultimately shows genes whose expression changes in FOXD3 mutant cells -we have list of genes whose expression changes in FOXD3 mutant cells -we also performed Chip Seq. to see where in the genome FOXD3 bind, and what are the genes located near those binding peaks -now can compare our lists: have both list of genes who expression changes in FOXD3 mutants, as well as list of genes that are near FOXD3 binding peaks from Chip Seq. experiment (and know where FOXD3 binds to) -and list of genes that are near FOXD3 binding peaks => can put together list of genes regulated by FOXD3

Ancient DNA: The early days

-the early days of ancient DNA: the first full human genome was only published in 2010 -extracted DNA from his hair -early days of ancient DNA: 1. Single genome: from one individual 2. ~4,000 BP: relatively recently in the past, not super ancient 3. Permafrost region (samples we were studying - that is where DNA preserves the best)

In vivo: CRISPR is the therapy

-the way they deliver CRISPR/Cas9 into the body is through lipid nanoparticles - LNP Delivery (lipid nanoparticle system) --> this system is meant for intravenous delivery (in contrast to the muscle through vaccination) -these lips nano particles are absorbed at high rate in the liver -fortunate for them because transthyretin, the gene they are trying to eliminate, is only expressed in the liver -by infusing this LNP, they are able to direct Cas9 mRNA sequence and a guide RNA (gRNA) sequence -once absorbed inside the cell, these components can escape into the cytosol. Cas9 mRNA can then be translated by a ribosome into a protein, and the protein can find the specific guide RNA sequence. Once the Cas9 protein assembles with the guide RNA sequence (they have high affinity for each other) it becomes an active ribonucleo-protein. Once this happens, it can translocate to the nucleus of the cell and search the whole genome for its 20 base pair target sequence. When it finds it, it can cut. -the major key is that the entire landscape of editing is determined by the 20 base pair targeting sequence in the guide RNA -want to make sure that this targeting sequence is only editing at specific locations of the genome (usually just one target location) -reduce unintended editing events -proves potential safety of the drug overall: -systematic non-viral delivery of CRISPR/Cas9 provide transient expression -gRNA (guide RNA) determines CRISPR/Cas9 genome editing target -Genome editing results in DNA base pair insertions and deletions collectively called indwells *LNP (Lipid Nanoparticle) Delivery system: gRNA Determines Genetic Target --20 base targeting sequence! -want to reduce unintended editing

Graph - ancient mutations common today

-these variants - 10 million --now carrying different version of variants --these explain most of our modern heterozygosity -filtered by natural selection -*recent mutations are rare -limited to families or regions -*50 new mutations x7 billion people --> this is not filtered by natural selection

How can we detect TF binding to DNA?

-this assay take advantage of a technique we talked about: gel electrophoresis -run DNA fragments on a gel and will separate them by size: can determine if DNA fragment is bound to a TF because a piece of DNA bound by a TF will not be able to move through the gel as easily and will be retarded in the gel => Gel retardation/mobility assay -have DNA cut up by restriction enzyme (shorter fragments of DNA) in one well as our control -cut the same sample of DNA with same restriction enzyme to make the same fragments, but now incubate DNA fragments with either purified TF or nuclear extract (usually TF's present in it) -DNA with TF will not travel as far in gel (will look like a larger fragment - "retarded band") -easy way to detect if DNA sequence of interest is bound by a protein, using this assay (generally that a protein is binding DNA)

Structural variation

-types of structural variation: genomic segments can be deleted out, can have insertion of another genomic segment, inversion of genomic segments, tandem duplication (can make up macrosatellite), dispersed duplication, and these events can lead to copy-number variant -due to double stranded breaks or aberrant meiotic crossover -1,000s of sites with these large-scale variations (pretty rare) -0-1 new mutations (occur during meiosis)

Protein binding microarray

-will tell you exactly the DNA sequence that is being bound by the TF -use DNA chip: consists of double stranded DNA sequences that includes all possible 10 nucleotide sequences (this is because TF are between 8-10 nucleotides) --> each unique 10 nucleotide sequence -which DNA sequence was binded to 1. prepare chip 2. take transcription factor interested in and add it to our chip (tagged so we can follow it) -one chip --> the presence of DNA -second chip --> the presence of our TF -seeing the similarity between them (the chips) --> to see where TF binded to (the DNA 10 nucleotide sequence)

Manhattan Plot

-x-axis: all human chromosomes -y-axis: strength of the association (It is -log(p-value) where the p-value indicates the likelihood of an event occurring randomly.) -each dot represents a SNP - you consider each peak as a being driven by one particular SNP. -lupus: autoimmune disease - example Manhattan Plot -started with patients who have lupus as well as those who do not -looking to see which SNPs are associated with the lupus disease -low p-values show high on Manhattan plot -only interested in the ones that are very high up -and now can see where the SNPs are located in the genome and what genes are near by or even in the gene -for lupus - few genes associated with the disease (one of the genes in charge of immunity) --> contributing to the disease -doing SNP-typing (SNPs that differ between people) and doing it across participants with and without the phenotype being examined (ex. Lupus) -can visualize this SNP-typing on a Mahattan Plot --shows physical layout on chromosome --ones with high y-value, means very associated with the phenotype -doesn't say cause of mutation, just shows where it might lie -ones with low association --> in both controls and phenotype (ex. Lupus) participants -when more than one loci gene --> polygenic

And requires us to...

...critically consider the damage we are doing to skeletal repositories -- Who are the gatekeepers of skeletal collections, and what does responsible sampling look like? ...recognize the thoughts/opinions/desires of all stakeholders --Not everyone wants their ancestors studied. ...become increasingly aware of potential consequences of our work --Could this effect claims to land, undermine feelings of belonging, etc.?

The current state of ancient DNA research allows us to...

..explore increasingly specific questions about population history --Sometimes we do 'reshape' stories about the past, sometimes we don't ...add (necessary!) nuance to broad ideas like "migration" --Multiple waves --Alteration, but not replacement ...increase our confidence in the interpretations we make using our data --More data allows us to revisit old interpretations

Process:

1. Apply FoxD3 to a protein-binding microarray (PBM) -result: then where in genome? 2. ChIP-seq -zebra fish genome -antibody that binds FoxD3

Ancient DNA analysis faces unique challenges

1. Degradation of DNA: DNA is a fragile molecule that decays over time -use chemicals/computerization to fix this -*Terminal miscoding lesions - "ancient DNA damage pattern" --> *ends of short reads have the damage patterns 2. Contamination -Microbial/fungal -Modern DNA -pie chart of a DNA extract: only a fraction of it is Human DNA -and only a fraction of that human DNA (looking at the human DNA), also only a fraction uncontaminated (endogenous - meaning it comes from the organism we are looking to study) human DNA --PCR is going to much rather amplify modern DNA

Conclusions

1. Selection of guide RNAs (gRNAs) for therapeutic gene editing with CRISPR/Cas9 requires in-depth analysis of unintended off-target editing and DNA structural variants 2. Comprehensive off-target characterization consists of discovery and validation phases - Off-target editing discovery using a biochemical approach has proven superior to the widely used cell-based experimental technology - Off-target editing validation of potential loci with targeted sequencing is done in primary cells representative of the intended target tissue 3. Complementary technologies characterize potential DNA structural variants: short- read next generation sequencing (NGS), long-read NGS and direct visualization of the genome

Trends in Ancient DNA Research

1. Tens to hundreds of individuals analyzed(instead of a single genome) 2. New periods of time (sometimes more ancient, sometimes more recent) 3. New world regions(not just permafrost/Eurasia)

Questions about it:

1. What DNA sequence does FoxD3 bind? 2. Where in the genome does FoxD3 bind?

Personalized medicine

= an individual's genome sequence and use it to improve or inform on medical treatment --> Human Genome Sequencing Project - long-term goal: improve human health

GWAS (genome wide association study)

= genome wide association study -collected genomic DNA from research participants (blood sample) -population of 1000s --> case (1000) vs. control (1000) 1. quantitative trait like height 2. "type" SNPs using a SNP array (for thousands of SNPs, for a SNP, shows what allele is there) 1. A 1. T 2. G 2. C 3. C 3. C -getting a long list of what the SNP profile is for each person -directly identifying 500,000 SNPs in this analysis 3. impute untyped SNPs (because fragments of our DNA always travel with other fragments of DNA b/c their not near a recombination hotspot - getting even more information because certain SNPs are associated with other SNPs) 4. test each SNP for association to the phenotype (what is the relationship between the presence of a particular SNP and the phenotype you are interested in) -lot of computation to figure this out *need large sample size (bc you're looking at a large population), so need a small p-value (want a high association): want p-values 5*10^-8 (equivalent to 1 in 20 million) - only SNPs that are strongly associated with the phenotype of interest are SNPs that you want to follow up on (higher association than normal for an experiment with p-values)

linkage disequilibrium

= occurrence of some combinations of alleles more or less often than by chance association -short range (2-200 kb) - because there have been 1000s of meioses since our common ancestor -SNPs that are always traveling together over the course of human history because they are really close together in the genome and there is no recombination hotspot in between them => would say those two SNPs are in linkage disequilibrium because they are always associated with each other -when we do a SNP analysis of someone's genome and get read out of 500,000 SNPs, we can input that certain SNPs are always associated with other SNPs because they are in linkage disequilibrium and are inherited together --> from this can assume presence of thousands of other SNPs -one SNP always associated with another SNP because in linkage disequilibrium -can assume presence of thousands of other SNPs because allele will travel in haplotype blocks

polygenic traits definition

= traits that are shaped simultaneously by many different genes

FOXD3 (knowledge from experiment)

= transcription factor -determined it has winged helix forkhead domain (from prosite database) - look at its structure and finding the alpha helixes, etc. -PBM analysis - look at exactly what DNA sequences it binds -ChIP-seq. - to determine where in the genome FOXD3 binds -where it is expressed - it's expressed in neural crest cells -generated FOXD3 mutant cells - from that did RNA seq. experiments to see which genes are up-regulated or down-regulated in the absence of FOXD3 (regulates gene expression) -morphological analysis: found they were missing glia and facial bone structures with FOXD3 mutant cells (from morphological analysis)

Winged HTH domain

=> 3 alpha-helices + Beta-sheet -one of alpha helices is directly interacting with DNA -other parts set up structure and interaction with DNA

Paleogenomic analysis

=> Paleogenomic analysis can provide unique insight into past peoples -Many people say we are in the midst of the "Paleogenomics Revolution"

What are transcription factors?

=> proteins that recognize and bind specific DNA sequences -key feature of TFs is the DNA binding domain

CRISPR/Cas9 Genome Modification - Different Approaches

Approaches to using CRISPR/Cas9: 1. knockout: inactivation deletion of disease-causing DNA sequence -for transthyretin gene is to knockout the gene -error in repair leads to nucleotide insertion or deletion that results in frameshift (insertion or deletion that is not divisible by 3, the protein becomes non-functional (typically at stop codon)) -leads to reduction in protein, but also results in reduction of mRNA transcripts (reduction in the mRNA that causes no translation in the protein) 2. repair: correction of "misspelled" disease-driving DNA sequence -much more difficult to use CRISPR to repair something -much easier to break things than to create new functions (need evolution and generations to do this) 3. insert: insert new DNA sequence to manufacture therapeutic protein -also complicated to insert a whole fresh transgene at a specific site with CRISPR

Unique Identifier Tagmentation (UnIT) DNA SV Characterization

Characterization of a variety of potential DNA structural variants: • On-target editing • Inversion • Duplication • Inter-chromosomal translocation

Calculating allele frequency in a population

Ex. -CC n = 35 -TC n = 50 -TT n = 15 allele C = (2*35) +50 = 120 allele T = (2*15) + 50 = 80 allele C frequency = 120/[(35+50+15)*2] = 0.6 = 60% (major allele, meaning > 50%) -denominator: total # of chromosomes in a population allele T frequency = 80/200 = 0.4 = 40% (minor alleles) *ratios should add up to 1!*

Example of association testing

Ex.: variation for allele: T or C association testing - a categorical trait disease or not for T allele: Calculating odds ratio: -take ratio of T allele in the population (cases/controls) divided by the ratio of the C allele in the population (cases/controls) = 2.67 (this is the odds ratio) --ratio for T allele and C allele = cases/control how to interpret this odds ratio: -1.0 same odds for both allele - > 1 : increased frequency in cases, making it a "risk allele" (doesn't prove it "causes" the diseases, just means it is associated with the disease) - < 1 : reduced frequency "protective allele" (not associated with the disease phenotype) ex. 2.67 is greater than 1, so it is a "risk allele"

haplotype

Group of alleles in an organism that is inherited together

Illumina Short-Read NGS Detection Limits of DNA Structural Variants

Inter-chromosomal Translocation vs. Chromosomal Inversion • Inter-chromosomal translocation and chromosomal inversion detection is highly sensitive

Two Classes of Potential Unintended Genome Editing with CRISPR/Cas9

Off-target DNA Editing -means that not only did CRISPR introduce an indel or frameshift at the transthyretin gene, but also made a genetic alternation by introducing insertion or deletion base pair at another genomic location that has homology to the guide RNA sequence --meaning it has complimentary to the sequence, although not perfect. We know that CRISPR can still cut DNA with up to 6 base pair mismatches out of a 20 base pair targeting sequence -using genomics need to determine where all those potential 6 base pair mismatches could occur in the genome • This may occur at loci in the human genome with DNA sequence homology to the gRNA target sequence - imperfect pairing of gRNA and target DNA *ultimately using genomics and then targeted sequence to discover and then validate potential off-target editing --> so that we can select the most specific guide RNA sequences (20 base pair sequence) for safe and effective human therapeutics DNA Structural Variants • This may occur as a natural consequence of double-strand DNA double repair -potential for misrepair in double-stranded DNA • Imperfect DNA repair includes: 1. inter-chromosomal translocations 2. DNA inversions 3. DNA duplications 4. large deletions

Isogenic

Organisms with the same or highly similar genomic sequences -organisms used in the lab have very similar genomes

Off-Target Editing Validation with DNA Sequencing

PCR reactions: 1. targeted rhAMP PCR 1 -Activation of champ primers by RNase H2 cleavage -Amplification 2. indexing PCR2 -Amplification with indexing primers - rhAmpSeq is used for validation of potential off-target DNA editing (indels) loci discovered by Cas-OFFinder, SITE-Seq and GUIDE-Seq - Multiplex primers allow the enrichment of > 1,000 loci in a single PCR reaction

Computational Prediction of Off-Target Editing

Potential mismatches/homologous Off-Targeting Editing sequences that are not perfectly complementary to the target sequence: -Mismatches: instead of having an A, there is a T (base-pair mismatch) -Bulges: occur in the DNA or RNA. means there is a gap (base-pair bulge), deletion in the sequence (another type of potential off-targeting editing sequence) -Non-canonical PAM (more rare) - CRISPR/Cas9 can cleave DNA with up to 6 mismatches to the guide RNA (gRNA) - Computation prediction is log-linear in number of off-target sites and mismatches allowed (a log-linear relationship between the number of potential off-target sites and the number of base pair mismatches allowed) - *manipulate off-targeting editing prediction: Cas-OFFinder (Bae et al., 2014) allows up to 3 mismatches in non-coding DNA and up to 4 mismatches in protein coding DNA (exons) makes prediction tractable --allow more mismatches for exons because this is the region of the genome we really don't want to have off-targeting editing in (going to have functional consequences)

categorical vs. quantitative traits

categorical traits: -ex. blood type --defined by 3 different alleles (A, B, and O alleles) --but each person carries 2 of the 3 alleles (have particular blood type) --because 3 alleles, there are nine possibilities so they fall into distinct categories -these types of traits. are likely to be monogenic trait: the phenotype/trait we see in the population is defined by variations in a particular gene quantitative traits (most of our phenotypes/traits): -ex. height --doesn't fall into categories, it is a smooth distribution in our population -this is because it is a polygenic trait: many genes that shape height

gene association

figure out whether or not when one or more genotypes within a population co-occur with a phenotype more often than would be expected by chance -used to figure out what are all the genes that determine height of example and which ones are the ones you have to have in combination of -a lot of alleles we carry in our population are old alleles, its not new alleles, but rather the combination of alleles -common alleles (ones that are present in many people) are shuffled into a novel combination in each person -so our goal is to understand what combination of alleles causes a particular disease --difficult because hard to even show a single gene is causing a mutation

How HD was tested/discovered

multiple families --> analyze their genomes to figure out what markers segregate with the disease phenotype --> narrowed down to chromosome 4 --> candidate gene on chromosome 4 was identified --> amplify and sequence to see if there are any mutations --> Huntington/HTT

association testing

simple example of association testing: -looking at 2 alleles = a categorical trait: either have the disease or not -cases: people with the disease -controls: people without the disease -interested in one allele, and the variations of the allele is that you can either have a C or T at the allele --can either have CT, CC, or TT alleles --is one allele more common for having the disease (want to calculate frequency of alleles) to determine if one allele is more represented than others: -calculate the *odds ratio*: ratio of phenotypic proportions among carriers of two alleles --ends up being a measurement of effect size of binary/categorical phenotypes

Where and when is FOXD3 expressed in zebrafish?

two ways to do this: -tag FOXD3 with a fluorescent protein so you can make transgenic animals and then image those animals and look to see where there is expression of the fluorescent protein, like GFP for example (most likely the preferred method) -another method: perform a single-cell RNA-seq. experiment - take entire organism and sequence all the cells that are found there one by one and see which cells express FOXD3 (expensive way)


Kaugnay na mga set ng pag-aaral

APCSP College Board Questions For Midterm

View Set

Ch2 - Comm & Cust. Service Skill

View Set

LearningCurve: 8e. Resistance to Persuasion

View Set