MCB 104 - Genomics
coding sequences are highly conserved
1 - genome conservation 2 - exon conservation
how are plasmids transferred into bacterial cells?
1) by transformation 2) electroporation
find all of the genes involved in body fat regulation
1) start with library of 16,757 dsRNA-expressing E. coli -> 2.1) 1.85% (305) genes - reduced fat and no loss of viability 2.2 0.7% (112) genes - increased fat and no loss of viability 2.3) 1.6% (261) genes - reduced fat and lethality/sterility -> difficult to assess whether these genes play specific roles in fat metabolism
considerations for an appropriate cloning strategy
1) what is known about the gene to be cloned - determine whether a nucleic acid or antibody probe can used; chromosome walking 2) the size and nature of the gene - large genes require vectors capable of handling large inserts; small genes can be cloned in plasmids 3) the ultimate purpose for cloning the gene - whether for sequencing or expression and in what host cell; will determine the choice of cloning vector
RNAi immediately links phenotype to gene target
1) wide range of pathways and biochemical activities targeted 2) some of the affected genes are homologs of known fat/lipid metabolism genes in mammals 3) new findings ▫ beta-oxidation machinery required for fat storage ▫ targeting a regulatory of sterol metabolism (SREBP) reduces fat storage ▫ disrupting HNF4a (a nuclear hormone receptor) increases fat storage
GWAS success
Complement Factor H association reveals new biological basis for AMD, and suggests specific treatments for AMD risk allele has allele frequency of ~0.2 in humans worldwide
complementation testing - Drosophila eye color mutations
Drosophila eye color mutations produce a variety of phenotypes - do these phenotypes result from allelic mutations or from mutations in different genes?
why linkage disequilibrium is important for mapping in humans
In humans - greatly facilitates GWAS: ▫ because the human genome travels around in the population in chunks, haplotypes where mutations covary, then genotyping for GWAS only needs to "tag" or genotype each haplotype, not every base pair ▫ once a significant SNP is found with GWAS, the causative mutation must be on the haplotype that SNP is on - LD puts a 5' and 3' boundary on where the disease-causing mutation must be ▫ no need for genotyping multiple SNPs within a single haplotype because they all will associate with each other due to LD ▫ tagSNP - a representative SNP in a region of the genome with high LD that represents a group of SNPs called a haplotype
SSR and SSLP
SSR - simple sequence repeat = SSLP - simple sequence length polymorphism = VNTRs - variable number tandem repeats ex. microsatellite - site with highly variable number of short sequence repeats, usually 2-4 nt CACACACA... especially variable in populations
saturation mutagenesis
a genetic screen is saturated when you stop finding new loci (genes), but rather just more alleles (mutants) of the same loci; the genome is finite
clone
a heterologous piece of DNA carried within a plasmid
complementation test
a test for determining whether two mutations are in different genes (they complement) or the same gene (they do not complement) ▫ reveals whether two mutations are in a single gene or in different genes ▫ complementation group is synonymous with a gene complementation - the production of a wild-type phenotype when two different recessive mutations are combined in a diploid
genome-wide association study (GWAS)
a test of the association between markers (single nucleotide polymorphisms or SNPs) across the genome and disease or quantitative trait phenotype, usually involving hundreds of thousands of SNPs spread throughout the genome
DNA microarray
allele-specific oligonucleotides - short oligos that hybridize with alleles distinguished by a single base difference how does a microarray tell you what particular allele a person has? ▫ a detector records the flguorescence emitted by each area after shining a light on it ▫ the color and intensity of the fluorescence depend on the alleles and number of copies of the allele complementary base pairing allows for this to work - complementary DNA can hybridize
autosomal alleles have a more complicated history
because they not sex-restricted and because of recombination, but provide richer information - can use autosomal allele frequencies to identify where a person came from
odds ratio
below 1 - protective effect; above 2 - hazardous effect
mitochondrial eve
best current guess - 200,000 years ago in sub-Saharan Africa - most diversity there is also a Y chromosomal Adam who lived roughly at the same time
what method was used to trace human migrations?
by locating different mitochondrial lineages in a different world regions, we can find similarities in haplotypes that contain the same mutations
any region of the genome...
can be genotyped ▫ for every genotyped region, F2s fall into discrete genotypic categories (ex. AA, Aa, aA, aa) ▫ genotyped markers that are linked are inherited together (the more similar the inheritance pattern, the closer the linkage) ▫ chromosome regions can be molecularly genotyped so their segregation can be followed in crosses and pedigrees ▫ any phenotype can be measured in each F2 individual ▫ do these genotypic categories differ by phenotype?
how do we link phenotype to genotype?
challenge - most DNA polymorphisms don't affect phenotype ▫ less than 2% of the human genome consists of codons within genes ▫ even when they occur, many mutations of codons are silent - they don't change the amino acid ▫ if a particular mutation is not silent and has deleterious effects, natural selection could often lead to its disappearance from the human population the proposal: to identify genetic linkage between the disease gene and a chromosome you need ▫ a family (and their DNA) where some people have the disease and some don't ▫ the ability to track, in that family, which child got which chromosome from which parents
quantitative trait loci (QTL)
chromosome regions containing a gene or genes that influence a quantitative trait
RNAi interference - doing without mutations
classical mutagenesis: ▫ diversity of mutations - point mutations, deletions, inversions, etc. ▫ heritable, stable, and quantitative ▫ saturating the genome requires hitting multiple genes repeatedly RNAi screen ▫ high throughput ▫ reasonably equivalent disruption of each locus ▫ phenotype instantly linked to disrupted gene ▫ not heritable ▫ doesn't generate full depletion of target RNA ▫ can lead to spurious disruption of off-target genes
CRISPR
clustered regularly interspaced short palindromic repeats - repeats in bacterial genomes CRISPR is the only genome engineering technology that uses RNA (rather than protein) to direct nuclease activity to the desired location in a genome; RNA is both cheaper and faster to synthesize than protein ▫ spacers were found to be homologous to sequences from bacteriophage ▫ are CRISPR elements part of a phage defense system? new method based on bacterial adaptive immunity ▫ adaptation (CRISPR locus) ▫ crRNA biogenesis (CRISPR transcript -> crRNAs) ▫ invader silencing (effector crRNP)
recombinant DNA can be propagated in bacteria
creating recombinant plasmids - not always perfect ▫ N - plasmid closes back up ▫ N - gene goes in backwards ▫ Y - gene goes in forwards
Hardy-Weinberg equilibrium
describes a population in which the allele and genotype frequencies do not change from generation to generation; the state of an ideal population that obeys the assumption of the HW law ▫ HW describes an idealized population ▫ comparing allele frequencies in real population to HW helps us understand how organisms evolve ▫ genotype frequency - proportion of total individuals in a population that are of a particular genotype ▫ allele frequency - the proportion of all copies of a gene in a population that are of a given allele type
linkage disequilibrium
deviation in the frequency of haplotypes in a population from the frequency expected if the alleles at different loci are associated at random; the nonrandom association between two loci ▫ observed haplotypes fit expected frequencies based on allele frequencies vs. observed haplotypes do not fit expected frequencies
end reads from multiple inserts can be sequenced in parallel
end read - a short stretch of sequence read from one end of a clone; length of read is constrained by sequencing technology, not by the length of the clone
a trait controlled by two loci
experiment - how is a continuous trait, such as kernel color in wheat, inherited? methods - cross wheat having white kernels and wheat have purple kernels; intercross the F1 to produce F2 -> P generation -> -> F1 generation -> break into simple crosses and combine results conclusion - kernel color in wheat is inherited according to Mendel's principles acting on alleles at two loci
lactase regulation in primates
fetus low expression -> infant high expression -> adult low expression - lactase non-persistence -> adult high expression - lactase persistence
forward genetic screens vs CRISPR
forward genetic screens will always be better than "reverse genetics"-based strategies at searching for genes influencing a phenotype in a totally unbiased manner
TALEN proteins
fused to nuclease (TALEN) can be easily designed for specific cleavage of DNA sequence how TALEN differentiates - DNA molecules have different shaped grooves running along their edges, and it is these groove shaped (rather than the sequences themselves) that the TALEN proteins use to discriminate among sequences
allele frequencies of adults in population predict genotypes of offspring
in a large population of randomly breeding individuals with no new mutations, no migrations, and no differences in fitness based on genotype p2 + 2pq + q2 = 1
Why is the elimination of a fully recessive deleterious allele by natural selection difficult in a large population vs a small population?
in small populations - genetic drift due to random sampling of finite gamete pools can alter the frequency of an allele rapidly (more likely if deleterious) until it eventually becomes lost or fixed in large populations - mating is more random; if population is large enough, the frequency of alleles doesn't really change
reverse genetic screen
known gene -> phenotype resulting from used to find out how a known gene affects a phenotype
bicoid mutants
missing the anterior region
a change in the DNA sequence does not need to change the phenotype of the organism for us to be able to track it
molecular markers: ▫ segregate according to Mendelian rules ▫ can be used to search for linkage with human disease ▫ once linked to disease or trait, marker can be used as starting point to find linked relevant DNA sequence changes for disease/trait
inducing mutations
mutant - an organism or cell carrying a mutation can induce mutations and cross them to look for mutant phenotypes
homology map for a 100 kb region of the human genome
orthologous genes are genes in two different species that arose from the same gene in the species' common ancestor
two oligonucleotide primers (16-26 nt) are needed for PCR reactions
region between the two primers will be synthesized ▫ one primer is complementary to one strand of DNA at one end of the target region ▫ the other primer is complementary to the other strand of DNA at the other end of the target region
Xanthomonas
secretes DNA binding molecules (TALENs) with a modular "core" that mediate specific DNA bonding in the host nucleus
SNP
single nucleotype polymorphism AA(G/A)GCTCAT ^^ polymorphic site - some chromosomes in population have G, others have A ▫ can be genotyped with many different molecular methods ▫ by chance, some SNPs eliminate or create a restriction enzyme cut site - PCR analysis of restriction site-altering SNPs ▫ most SNPs don't alter restriction sites - DNA microarrays or whole genome sequencing can be used for simultaneous detection of millions of SNPs ▫ ost common genetic variation among people ▫ represents a difference in a single nucleotide - once every 300 nt ▫ useful - act as markers for researchers to locate genes associated with disease ▫ typically bear no effect on health/development
association mapping
test large numbers of present-day individuals for genetic variants to find those variants that correlate statistically with differences in phenotype
to link genotype and phenotype for any trait...
you need to collect genotypes and phenotypes from a population (e.g. an F2 genetic cross for QTL mapping, or cases and controls for GWAS)
genome
▫ humans are diploid ▫ 23 pairs (so 46 total) of chromosomes, which are long, linear chains of DNA ▫ range in size from 50,000,000 to 250,000,000 basepairs
applications of recombinant DNA technology
▫ used to diagnose and screen for genetic diseases ▫ gene therapy ▫ used to make pharmaceutical products - recombinant insulin and clotting factors
cloning inserts into vectors and produces recombinant DNA
▫ vector and foreign DNA are cut with the same restriction enzyme, then ligated together with DNA ligase - simplest method of cloning ▫ disadvantage - matching restriction sites may not be available; ligating also produces undesirable products (vector ligating to itself without foreign DNA insert)
linkage mapping
▫ works well for simple (single gene) traits like cystic fibrosis and Huntington disease ▫ works less well for common/complex traits like heart disease and cancer
two ways to change a genome
1) "natural" genome alterations - meiosis I and meiosis II 2) genome altering "technologies" ▫ mutagens (X-rays, EMS, etc.) - these cause random changes ▫ genome engineering technologies (RNAi, CRISPR, etc.) - these cause directed changes
DNA sequence coverage of the first rough draft of the human genome
1) 2001 ▫ draft sequence - 93% of genome ▫ error rate of 1/10,000 2) 2003 ▫ accurate sequence ▫ 97% of genome 3) 2006 ▫ finished sequence ▫ 99% of genome with 99.99% accuracy
RNAi - linking it back to mammals
1) RNAi highlights shared ancestry of C. elegans and mammalian fat storage regulation - disrupting homologous genes in worms and mice reveals comparable phenotypes 2) 50% of C. elegans fat regulatory genes identified in screen have mammalian homologues not previously implicated in fat storage - new pathways to target for drug discovery? 3) next steps - pedigree analysis in human obesity and quantitative trait studies in rodents
cloning vectors - characteristics
1) an origin of DNA replication so they can be maintained in a cell 2) a gene, such as antibiotic resistance, to select for cells that carry the vector 3) a unique restriction site or series of sites to cut and ligate a foreign DNA molecule
types of quantitative traits
1) continuous traits ▫ vary continuously ▫ ex. human height, blood pressure, carrying capacity 2) meristic traits ▫ measured in whole numbers ▫ ex. animal litter size, number of vertebrae, wing feather number 3) threshold traits ▫ measured by presence of absence ▫ ex. susceptibility to disease, malaria resistance
detection of SSR polymorphisms by PCR and gel electrophoresis
1) determine sequences flanking microsatellites 2) amplify alleles by PCR 3) analyze PCR products by gel electrophoresis
RNAi
1) dsRNA or shRNA 2) dicer -> siRNA duplex 3) Ago -> 4) formation of RISC -> 5) siRNA/mRNA-complex -> 6) sliced mRNA (silencing) ▫ the endogenous roles of RNAi can be exploited to disrupt the function of specific genes across the genome ▫ key concept - RNAi results in gene "knock-down" not "knock out" ▫ reverse genetics
two major categories of genes that will fail to be detected by a genetic screen (forward or reverse)
1) genes that cause lethality 2) genes that are "redundant" - genes that perform nearly identical biological functions but are located at different genomic loci
forces that disrupt HW/changes in allele frequences/evolution
1) mutation 2) drift 3) selection 4) migration 5) non-random mating
two methods for linking genetic variation with phenotypic variation
1) quantitative trait loci (QTL) mapping 2) genome-wide association studies (GWAS)
common disease/common variant CD/CV hypothesis
common disorders are likely influenced by genetic variation that is also common in the population many common diseases are caused by common alleles that individually have little effect but in concert confer a high risk how do find variants that matter most?
Y chromosome migration map
different mitochondrial lineages are found in different parts of the life - can use to reconstruct human migrations
double-strand breaks facilitate gene targeting by homologous recombination
homologous recombination: ▫ the exchange of DNA strands of similar or identical nucleotide sequence ▫ can be used to direct error-free repair of double-strand DNA breaks ▫ gene knockouts, deletions, and point mutations are readily made ▫ gene tags can be inserted where needed
genetic cross/pedigree
in a genetic cross or pedigree, you can:: ▫ study the segregation of a phenotype ▫ study the segregation of any molecular genotype ▫ can map genetic distance with molecular markers studying both genotypes and phenotypes from the same samples lets you look for linkage of molecular genotypes and the phenotype of interest
body fat regulation in mice and worms
known regulators 1) mice ▫ tubby - conserved protein expressed in the hypothalamus ▫ HTR2C - serotonin receptor 2) worms ▫ tub-1 - Tubby homolog ▫ tph-1 - serotonin biosynthetic enzyme ▫ daf-2 - insulin signaling ▫ can worms tell us something new about fat regulation in mammals?
whole-genome shotgun sequencing
make three genomic libraries ▫ plasmid library - 2 kb inserts ▫ plasmid library - 10 kb inserts ▫ BAC library - 200 kb inserts ▫ obtain 1000 bp sequence reads from ends of each clone ▫ computational assembly of sequences into chromosomes you can make a genomic library with a specific size insert (ex. by running fragmented genomic DNA on gel and cutting out specific bands such as 2 kb, 10 kb, 200 kb)
genomic position of trait/disease associated SNPs
of 465 unique SNP associations: ▫ 43% intergenic ▫ 45% intronic ▫ 9% nonsynonomous coding ▫ 2% in 5' or 3' UTR ▫ 2% synonomous coding most variation is due to regulatory variants
paired-end reads can be used to join two sequence contigs
paired-end reads - sequences of both ends of a piece of DNA that can connect two contigs (a stretch of contiguous sequences assembled from the sequences of multiple overlapping reads) into a scaffold (represents a large stretch of sequence built up from multiple contigs; may contain gaps)
phenotype vs genotype
phenotype - any observable feature of a living thing genotype - the information contained in the lowest level of this hierarchy, the genome living things are hierarchies of populations of other living things, nested in each other
forward genetic screens
phenotype resulting from alteration -> discover gene underlying phenotype (used to find genes affecting any biological process) 1) site directed mutagenesis: ▫ systematic induction of mutations to observe a phenotypic outcome ▫ allows observation of which alleles on certain genes are responsible for phenotypic traits 2) balancer chromosomes: ▫ modified chromosomes used for genetically screening a population of organisms to select for heterozygotes ▫ keep homozygous lethal or sterile mutations from being lost from a population ▫ prevent multiple alleles on the same chromosome from being separated by meiotic recombination
We decide to expose a population of Drosophila to radiation. One larva is blue and another is red. When both of these individuals become adults, we will look at the phenotypes of their offspring. Assuming both of these genes are recessive, if all offspring are WT, what can we say about complementation of these phenotypes?
the mutations are located on different genes
genome annotation
the process of attaching biological information to genome sequences (ex. determining which subset of the genome sequence is transcribed) genomic sequences - an average of the sequence of multiple physical copies of DNA; these alone don't tell you anything about the function of the DNA sequence - although it's a great start
PCR detection of the sick cell-causing SNP
the sickle-cell mutation eliminates an MstII restriction site ▫ PCR of the region containing the SNP produces a 500 bp fragment from both alleles (normal and sickle cell) ▫ digestion of the PCR product with MstII produces two smaller fragments from the normal allele, but doesn't affect the sickle-cell allele
population genetics
the study of genetic variation and how it changes in time and space; provides building blocks for massive-scale evolution; theoretical approach to understand how genotypes cause phenotype
If two species share a common ancestor about 1 million years in the past...
then all of the orthologous DNA sequences shared by these species also share a common ancestor about 1 million years in the past
gel electrophoresis
▫ GE acts as a molecular sieve ▫ gel - aqueous matrix ▫ DNA molecules are loaded into a slot at one end ▫ when an electric field is applied, the negatively charged DNA migrates towards the positive electrode ▫ shorter DNA molecules are less hindered by the gel and migrate faster than longer DNA molecules
genome engineering using programmable nucleases
▫ Zinc Finger Nucleases - ZFNs ▫ TALENs ▫ CRISPR/Cas9 all have in common that a programmable, DNA sequence specific binding domain is coupled to an endonuclease activity (DNA nuclease activity)
drift
▫ founder effect - 210 affected Huntington's patients in Dutch South Africans ("Afrikaners") from more than 50 families (80% of Afrikaner patients surveyed during 1980 survey) were found to be ancestrally related through a common progenitor in the 17th century ▫ small vs large popuations
Zinc finger DNA binding motif
▫ highly specific and well characterized DNA binding ▫ modular protein domain can be fused to other domains ▫ Zinc fingers fused to endonuclease (ZFN) to provide precise DNA cutting and DNA insertion
haplotypes
▫ linkage between 2 loci on same chromosome ▫ haploid genotypes - a combination of alleles at multiple loci on the same chromosomal homolog ▫ "share a haplotype"
population genetics - the mathematics that unifies all studies of living things
▫ mathematics - starts with definitions and makes deductions using only the rules of algebra and set theory ▫ living things - its purpose is to describe using math the hierarchical structure of living things and make predictions about how one level of the hierarchy causes observable features of levels higher up
diverse organisms display RNAi
▫ model animals - Drosophila, C. elegans, mouse, etc. ▫ non-model animals - cnidaria, beetles, crickets, crustaceans ▫ Tetrahymena ▫ Dictyostelium ▫ Plants - Arabidopsis, maize ▫ Fungi - Neurospora
quantitative genetics
▫ most traits in nature are quantitative ▫ most traits are influenced by genetics AND environment environmental variance + genetic variance = phenotypic variance
ligation + recombinant plasmids - EcoRI
▫ need to use enzyme to cut recognition site above and insert the target gene of interest ▫ EcoRI has made cuts at both sites in the gene and the plasmid below ▫ addition of DNA ligase + complementary base pairing = recombinant plasmid
haplotypes*
▫ new alleles/mutations occur on a single chromosome ▫ when they are young, and at low frequency in population, tend to still be found in original context (i. e. they are strongly linked to loci on that chromosome) ▫ but if they reach high frequency, recombination will have scrambled this background except in the immediate vicinity of the mutation ▫ so, for a given allele, haplotype size related to age
idealized genome assembly
▫ number of contigs should equal number of chromosomes ▫ every nucleotide should be accounted for - no gaps ▫ no errors in sequence
population genetics - reproduction
▫ population loses all of its hierarchical structure ▫ new population is made by sampling alleles with replacement from previous generation to populate genotypes in the new generation
human populations are unstable
▫ population size growing ▫ populations dividing ▫ populations migrating ▫ environment changing
starting assumption of mutagenesis
▫ the genome must be saturated with mutations ▫ this works best if your mutagenesis strategy results in an even distribution of mutations across the genome ▫ if your mutagenesis strategy is non-random, many more individuals will need to be screened and some genes or genetic elements might not be mutated
20,687 genes x 6.3 transcripts/gene
▫ we used manual and automated annotation to produce a comprehensive catalogue of human protein-coding and non-coding RNAs as well as pseudogenes ▫ referred to the GENCODE reference gene set ▫ includes 20,687 protein-coding genes with, on average, 6.3 alternatively spliced transcripts (3.9 different protein-coding transcripts) per locus ▫ in total, GENCODE-annotated exons of protein-coding genes cover 2.94% of the genome or 1.22% for protein-coding exons
gel electrophoresis distinguishes DNA fragments according to size
▫ with linear DNA fragments, migration distance through gel depends on size ▫ after electrophoresis, visualize DNA fragments by staining gel with fluorescent dye, and photograph gel under UV light ▫ determine size of unknown fragments by comparison to migration of DNA markers of known size