BIS 183 Midterm 1
Pro/Con of Microarray
PRO: 1. can use small amount of amplified material 2. well established alogithms 3. easy to use CON: 1. requires sequenced genome 2. not comprehensive coverage( bc youre selecting for unique genes) 3. relative results of intensity ( there's only so "bright", the saturation points make it so that you cannot explore the full dynamic range) 4. depends on hybridization
PRO/CON of Illumina
PRO: in 3 days you can get 100bp reads (17Gb SR, 68Gb PE) on HiSeq Con: A-T miscalls are very common
Differences between Illumina and PacBio
PacBio: 1. has a special, pause, polymerase 2. requires no terminator bc real time 3. requires zero mode wave guide 4. reads are sig longer 5. higher error rate (random) 6. phosphate flurophore Whereas Illumina 1. normal polymerase 2. requires multiple wash steps/factors 3. uses flow cell 4. requires amplification 5. low error (non random of A-T) 6. amino flurophore
Reason for Normalization
To control variation within and between samples so that you can properly preform statistics
Orphan node
nodes not connected by any edges. I would assume they are often discarded
Metric to determine coverage in genomes
- Lander/Waterman Equation - tells you genome coverage but does not work for transcripts since they don't cover the whole genome coverage= (read length * number of reads ) / genome length
How would you create a randomized network?
- have the same number of nodes, but randomly switch the edges n times to find average.
Dicer
- process double stranded RNA into short dsRNA molecules - piRNA doesnt use dicer
Agronaute
-unwinds short dsRNA and helps it finds its target
Steps to PacBio:
1. synthesize cDNA 2. add cDNA to zero mode wave guide 3. add phospho linked dNTPs 4. add polymerase 5. shine laser to excite the flurophore- detect emitted light 6. flurphore is released when phosphodiester bond is formed - results in (intensity emitted/time)
Multiplexing
when you use different barcodes for EACH LIBRARY. so you can do multiple runs and save money
What did the deep sequences of Arabidopis reveal about small RNAs contd
- most small RNAs were found at centromeres and transposons/ retro elements - microRNA are transcribed from specific loci (why theres a huge peaks. whereas the siRNAs have more broad representation. - (rdr) looking at two different mutations of RNA dependent RNA polymerases ( what makes siRNA) - (dcl) dicer like mutants. - allows you to ask which small RNAs does the enzyme process? Is there developmental specificity? - notice that rdr2 and dcl3 has way reduced levels of siRNA(bc mutant cannot make & process them) - rdr6 likely plays role of regulation of 21nt in seedlings (difference in blue line of wt) showing that in seedling both rdr6 and rna pol 2 are needed
Biological replicate
- need at least three - would be three different samples from ONE replicate/organism - ie three different cell types on the same person to control variation within samples
Tiling Microarray
- partially overlapping probes - very helpful with chIP - still requires a genome seq though - pro: can give comprehensive coverage of genome - con: issue with redundancy bc cDNA could bind to multiple places and you cant decipher which blongs to which ( will show expression for protien #1 when it really should be all #2) == makes data analysis harder
Network Motif
- patterns of interconnections that recur in many different parts of a network at frequencies higher than randomized networks - ie consensus seq of systems theory Found there are three motifs 1. Feedforward loop 2. SIM 3. DOR
Ping-pong cycle
- piRNAs made and transported out of nucleus - processed (cut) and loaded into onto PIWI which binds to target mRNA and cuts it. the next argonaute (AGO3) now separates the strand and the piRNA is once again transfered to a PIWI - goes against catalytic core (bc no dicer)
Reading a Heat Map
- rows= different transcripts - columns= different samples - allows you to visualize the relative transcript abundance (so middle colors are the average expression) ex allows you to conclude that 1. some transcripts have low/high expression vs other samples 2. some groups of transcripts have low abundances 3. some groups show differential gene expression between stages
What did the deep sequences of Arabidopis reveal about small RNAs
- seq different types of siRNA in 3 diff developmental points of Arabidopsis and mutants - revealed there is different proportions of siRNAs during development (blue at Col0 at inflorescense vs seedling) meaning there is developmental regulation of siRNA. - the highest proportion of siRNA was found at the inflorescence (just before reproduction) stage of the plant. - most si RNAs are found at inflorescence and equal proportions of miRNAs & siRNA in seedlings - plant had tissue specificity - realize the different colors on the graph represent different length of nucleotides. blue= 21nt (miRNA), green= 22nt, pink= 23nt, purple= 24nt(siRNA) - Col0= wt, the rest are mutants
Similarity between Agglomerative and Partioning methods
- since they both have randomized initiallization steps you can just run the test multiple times to get a consensus sequence - their goal is to maximize within cluster similarity
Microarrays
- substrate: glass, plastic, chip - probe: ssDNA printed on substrate and arranged in grid like manner. can also be cDNA/synthetic oligo (must be coding strand of transcripts so that is is comp to cDNA) - target: labelled ssDNA from a biological sample used to query probe ( does target sequence hybridize to probe?) -Measurement of intensity of spot fluorescence, or the difference in intensity between a match and a mismatch
Possible complication of systems biologys
- the ability of systems to evolve. - nodes and especially edges may change quite frequently. - proteins have different functions, friend groups collapse, etc
Moores law
- the concept of how quickly technology will advance. The sequencing is passing it steadily. -The number of transistors on microprocessors doubles every two years.
Variation in genome size is caused by
- the differing amounts of non-coding DNA - where genomes are about 20-30K, the exon space is about 30-60Mb meaning that the vast majority of the space is non coding
Systems Theory
- the organization of variables -reduces a biological system down to the basic building blocks to study it and reveal the emergent properties - very common for population ecology but can be useful for genomics as well 1. define all components of the system -1a. Define all nodes (variables) -1b. Define all edges (interactions between nodes)(usually with mathematical descriptions) 2. Systematically change and monitor components of the system 3. Reconcile the observed responses with those predicted by model 4. Design/Perform new perturbation experiments to distinguish between multiple or competing model hypotheses
PIWI
- type of Argonaute proteins - immunopurified these proteins - set up experiement so they could pull out just the PIWI argonaute from a cell mixture and crosslinked target to argonaute - reverse crosslinked it just to seq parts that were linked to it -- made mutant to piRNAs to show it actually reduced expression - PIWI cleaves antisense molecules
Affymetrix GeneChips
- uses oligos 10-20bp - has 10-20 different oligos for each gene which must be - you have a mix of perfect match or mismatch probes - from this you can infer gene expression by analyzig the (ave PerfectM intensity/ave MisM intensity) 1. unique to genome 2. overlapping - able to represent 77% of human genome and 75% of Arabidopsis
Technical replicate
- when you run the same sample multiple times - maybe use different machines/tecniques - its important to minimize variance due to machine error
RADseq
-Restriction Site Associated DNA Tags good tool for SNP discovery and genome sequencing in non-model organisms lacking complete reference genomes
Transcriptome
-The study of an organism's entire transcriptome -The complement of all transcripts produced within an organism, organ, tissue or cell type - very variable on time/space -Regulated deposition
experimental tractibility
-concept that this isnt the best way, but experimentally is the only/simplest - ie even though proteins would be more useful to study, without amplification techniques, it is almost impossible so we use transcriptomics
Trancriptomics
1. isolate mRNA from different tissues (as proxy for proteins) 2. annotate it ( define start/stops, align to reference genomes to infer functions etc) - previously for every cDNA it had to be cloned and Sanger Sequences just to get ESTs - now we just sequence the whole thing and get contigs (where 1 contig= 1 transcript)
Airport Network system example
1. node: SFO, LAX, JFK, planes 2. edges: flights (which connect each node) - you begin to start seeing hubs and the most important nodes by asking "what happens if you remove LAX?" how many are flights are effected if you remove LAX vs Stockton airport
Facebook Network system example
1. node: friends, photos, likes, status 2. connected as friends (person-person) or connected as photo (tagged photo) you can start to look at who is most popular. suddenly the edges' directionality becomes very important. Mark may be friended more often by April friends people back.
Regulatory Network System example
1. node: repressors, activators, enhancers (other cis elements), RNA pol II etc 2. edges: activator/repressor binds cis element and represses the transcription by not recruiting RNA pol II.
Other ways siRNA can influence expression
1. post transcriptional silencing 2. mRNA cleavage 3. translational inviolation ( can prevent rRNA from binding) 4. affect chromatin
How can you make small RNA libraries?
1. purify RNA into 21-24nt by size selection on a gel 2. since theres no 5' cap you can use an adapter that recognizes the open 5' P04 (so youre only getting RNA thats been cleaved 3. convert it into cDNA (using hexomers) 4. make library 5. sequence 6. assemble
Issues with RNA profiling of a non-sequenced genome
1. redundancy ( gene family similarity) may cause you to over/under estimate expression 2. picking a correct reference genome 3. alternatively spliced variants 4. not every transcript is expressed at the same time or place
Illumina unique needs:
1. reversibly blocked 3' terminator - theres a N3 on the 2' so that it bonds to the 3'OH and renders it useless for the next nt to add until the N3 is removed - also there is a flurophore that is originally attached which must be cleaved as well (between base-residual linker and flurophore)
Typical Model organisms need:
1. small/numerous 2. serves as reference 3. easily transformable 4. well annotated genome 5. quick reproduction cycles
PacBio unique needs:
1. special polymerase - the flurophore is attched to the phosphate so its automatically cleaved when its incorporated = way less steps witout the termination) 2. Phospho-linked dNTPs 3. Zero mode wave guide
Steps to Illumina
1.fragmentation (by sonication, fragmentase, transposase etc) 2. add adapter sequences ('SBS') 3. size fractionate (~200bp) 4. enrich for the seq size you want 5. PCR 6. select size 7. ligate on P5 and P7 8. add barcodes if multiplexing 9. wash over flow cell 9a. extension and remove temp 10. bridge amplification into clusters 11. degrade reverse strand 12. use of ddNTPs and polymerase (1 base additions) 13. was away unincorporated nt and take phot 14. cleave flurophore and release the N3. 15. repeat steps 12-14 - if paired end you will start at 10 again but degrade the opposite strand.
Average length of a transcript
1K
Rough number of genomes sequenced as of now:
Archael: 10K Bacterial: 55K Eukaryotic: 12K vs in 2013 Archael: 181 Bacterial: 3K Eukaryotic: 183
Metric to determine coverage in transcripts
N50 - represents the true mean of transcript length - when exactly 1/2 of transcripts lengths are greater than the N50 and 1/2 are below it.
Are miRNAs transcribed by one gene?
No, - there can be multiple loci which code for the same miRNA meaning there are physically different miRNA gens - miRNA sequence can be found in multiple primary mRNAs - remember that the transcriptomics will still report these seq because the transcript still has a poly A tail, its just not getting translated
One cluster represents
One copy of cDNA
PRO/Con of PacBio:
Pro: - longer reads (~10000bp) - real time sequencing - good for heterochromatic regions that are harder to seq - does not require amplification (less error and bias) Con: - costly - HIGH error rate 12-18%. it is random error though unlike Illumina
RNAi
RNA interference - post transcriptional mRNA destruction two types 1. miRNA 2. siRNA - "detect (foreign nucleic acids)and destroy them" - has a conserved catalytic core - think the examples of the 1.tabacco virus 2. blue petunia and 3. UNC C.elegans
Box plots
Way to visualize the data as quartiles - bottom is 25%, middle 50%, and top bound 75% of data. - the dots are the outliers - want the median points to be about the same
1. Feedforward loop
X and Y regulate Z but X also regulates Y
2. Single input module (SIM)
X regulates many genes and itself - benefit: you can make huge amounts of one protein at once. ie fight/flight hormones
Are siRNAs transcribed by one gene?
Yes, 1 siRNA is made from 1 locus - also they just target one gene (whereas miRNA can target more broadly) - but more commonly siRNA come from outside sources like viruses or scientists
Do correlations in gene expression provide any info about the biology?
Yes, you can INFER function based of expression similarity "guilt by association" - sparked the gene ontology movement - however, remember that you can only infer function. in order to DETERMINE function you need genetics
A model is a...
a) pen and paper diagram expressing experimental observations b) computed demonstration of experimental observations c) computed demonstration of inferred observations d) a simulation of transcription factors that bind to promoters e) all of the above
The increase in throughput of next generation sequencing is due to...
a. improvement in sequencing chemistry b. whole genome amplification kits d. improvement in data handling and advanced bioinformatics e. replacing gel and column purification kits with magnetic beads
C paradox
as complexity of organisms increase, its genome does not
Why is Pearson distance better?
because for TF they can vary by 1. binding affinity for target site 2. strength of interaction between activation site within the RNA pol II. - so a second gene may be less highly expressed due to a mutation in the binding site, but it will be expressed in similar conditions. so it will have the same profile shape but the magnitude is effected
Q value
calculated number of false positives
Agglomerative clustering method
ex of heirarchial - start with a single gene and successively join the closest clusters until all the genes are in a super cluster - siilar to building phylogenies
Patitioning clustering method
ex. k means - subdivide data into pre-determined number of subsets without any implied heirarchial relationship.
EST
expressed seq tag. - unique seq of cDNA that can be used to identify it
Normalization between samples
for between arrays - must center data - have to normalize variation between experiments - often use box/whisker plots - often you loose the outliers
Unsupervised analysis
meaning that there is no specification of desired patterns - you really let the data tell you - ex of clustering
miRNA
micro RNA - precisly target endogenous genes - derive from actual genes 1. miRNA gene is transcribed by RNA pol II holoenzyme 2. this long primary mRNA will form a stem- loop structure (meaning the 5' and 3' ends were complimentary) - there must be one mismatched base pair in stem though ~transfered to cytoplasm~ 3. dicer will recognize dsRNA by the mismatch bulge and cleaves it into 21-24nt length 5. product will bind to RISC (agro1) which unwinds the complex to give an ssRNA which is comp to part of a mRNA 6. RISC complex will recruit other factors to help dice the comp mRNA
3. Dense overlapping regulons
multiple X factors regulating many overlapping Y factors - benefit: there is a lot of redundancy so mutations do not carry as large of an effect. ex could be good in developmental periods.
FPKM
- Fragments Per Kilobase of transcript per Million mapped reads - you have to also consider the transcript length when you look a coverage. - if you just look at read length you may conclude the highest two are over expressed. but by looking at the FPKM you can see that those two are really just the shortest transcript so they were replicating quicker.
DNA reminders
- RNA= 2' OH unlike DNA which is deoxyribose - A&G=purine=larger - T&C= pyrimidine - mRNA is same as coding strand=sense (minus T) which complements template strand (nonsense) - the formation of the phosphodiester linkage releases a pyrophosphatase (what 454 used as marker) - transcriptional processing includes: intron splicing, polyadenylation, 7methG 5' cap
Gene Ontology
- a directed acyclic graph that groups genes either as 1. cellular component (chloro, ribosome) 2. molecular function (enzyme, signaling) 3. biological process ( reproduction, root development etc)
In-degree
- a network metric which describes the proportion of edges that go into a node - how many PEOPLE know you?
Out-degree
- a network metric which describes the proportion of edges that go out of a node - how many people do YOU know?
Euclidean Distance
- clusters based on expression level ( so it will cluster by vertical axis)
Pearson Distance
- clusters bases on expression patters (so it will cluster based on shape)
Multiple Hypothesis Testing
- considering a set of statistical inferences simultaneously - with 1 test with a 5% confidence level, there is only 5% chance of being wrong if you reject your null. However, if you have 100 tests, it is much more likely that at least one of you rejections is incorrect. - why you calculate your Q value
piRNA
- doesn't use dicer - longer small RNAs (24-29nts) - made from ssRNA by PIWI class of Argonaute proteins - majority they found came from unannotated transcripts that were transposons (ie these are in change of reducing transposon/retro-element expression) - NOT found in plants (only mammals, flies, and worms) - generated from ping-pong cycle
Motivation for the Agave paper
- for biofuel reasons we want to make plants that can isolate carbon more efficiently. - agave produce a large amount of carbon and also grow in non farmable soil - have CAMetabolism -what they did was seq the whole transcriptome (3 diff tissues from 2 different species as multiple developmental stages) - mainly used Illumina but used PacBio for better contigs - got 35K genes (more like 30K after alternative splicing)
RNAseq
- goal is to annotate cDNA by seeing how much and where it maps to genome mRNA > cDNA > RNAseq lib> SSR > contigs > find ORFs > compare to other genomes
Clustering
- grouping data into clusters which are more similar to each other than other outside groups. - defined by distance matrix (measure of simmilarity) 1. Pearson 2. Euclidian - Two different clustering methods 1. Agglomerative 2. Partioning - implies you have common regulation if youre clustering by expression level (means they often have same biological processes within group and this can help to infer function) - individual clusters can be analyzed to compare their properties (like biological processes/cis elements identification etc) - also a little scary because it is widely unsupervised data analysis where there are many methods to choose from.
Systems Biology is often used for
- identifies points of critical failure. - what is the most connected node and what will happen if it disappears - this is what hakers use, counterterrorism methods, etc
Establishing miRNAs
- if miRNA cleaves mRNA target you get two stands, one with 5'cap and 3'OH another with 5'P and 3'poly A tail - so you seq transcripts with polyA tail and look for with a 5' phosphate - align them back to genome - look upstream from transcript to see if there is a predicted miRNA seq there. (does it come from stem-loop? does it compliment our mRNA? Is stem-loop expressed in same tissue where target is found?)
Evolutionary Advances of siRNAs
- in both plants and animals, gametophytes have compainion cells (in close proximity) - their job is to generate siRNAs which silence transposons in the real gametophyte ( trying to make sure the genes that pass on are prime) - method to ENSURE no activation of transposons take place.
siRNA
- in nucleus- - transcribed by RNA pol 4 ( if at all transcribed and not artificial) - comes from either: viruses, scientist injections, or antisense RNA produced in cell that rebinds together 1. a long antisense RNA is produced (asRNA with no poly A tail) ~into cytoplasm~ 3. dicer binds to its long dsRNA molecule and cleaves it into smaller dsRNA 4. RISC binds dsRNA, finds mRNA and cleaves it
Is it worth sequencing the miRNAs for each organism?
- it is important to understand how post transcriptional regulation influences phenotype and disease - important to capture an organisms immune response (viral silencing responses)
Two best clustering methods
- k means and self organizing map
Supervised analysis
- meaning you have pre determined patterns you want to identify - ex. doing all possible comparisons through ANOVA. this really is not neccessary though and just exhaustive
miRNA as disease classifier
- miRNA can predict whether humans will have certain cancers (bc miRNAs needed to silence them) - they use a sweep of 22 miRNAs as a biomarker test
miRNA in animals
- miRNAs can be transcribed from multiple loci like before - however, in plants the miRNA (21nt) and perfectly matches the target - in animals (22-24nt long) and they only match their target with a seed (ie not the whole length, more like 7-8nt) - so in animals you get a way larger number of possible matches. ( more cleavage experiments)
PE runs
paired end runs - usually 300bp long then you would try to seq in 100-150 len from each side - after the forward strand is seq with the forward index, then wash away forward strand and seq the reverse index and reverse strand= get more contiguos sequences
Emergent Properites
properties which you can only identify after looking at multiple node interactions. cannot be deciphered from single node interactions (single genes) ex looking at one animals behavior you cant tell what its doing, but when you look at community level it makes more sense (lion, wildabeast)
Normalization within a sample
reasons for within a microarray: 1. unequal amounts of cDNA (inaccurately bright) 2. differences in hybridization (some seq just work better/stronger) 3. differences in scanning (camera dead spots) reasons for within a mRNA-seq 1. transcript coverage (analyze FPKM) 2. threshold for "present" gene ( how many reads before you say its actually expressed (3-5 us)) 3. rRNA contamination (since rRNA is transcribed in the nucleus theres a lot of it and it largely effects the ability to detect mRNA)
SR runs
single read runs - 1 direction only - 50 pictures= 50 length reads -100 pictures= 100 len reads, etc.
Hub
type of node which has the most connectivity. - can be defined as a disproportionately high amount of either incoming or outgoing nodes - hub is a single node but there can be multiple hubs - ie which node is most popular
Why isn't Illumina enough right now?
we need 1. longer reads 2. real time seq (quicker) 3. to finish incomplete genomes (this may be due to modifications like methylation/compaction) aka POLISHING/finishing the genomes
Expression value
will be a relative florescent calue - ex 3 fold higher florescence than average= why its given in ratios
miRNA discovery
you need to 1. find loci with stem-loop 2. seq them small ones 3. look at evolutionary history ( ex. 102 families in miRNA, only 8 in mosese) - ie if its defintely there is another organism-do stuff 4. where are they expressed? 5. how much theyre exxpressed? - can say whether they is developmental regulation 6. Does the miRNA regulate a target? ( evidence it can cleave a transcript= genetics)
Common Statistic test that may be preformed
1. T test 2. ANOVA (better for looking at differences between multiple samples) - harder for RNAseq because both require a normal distribution of data whereas RNAseq give negative binomials
Steps to Microarray
1. Target Labeling - use RT to make cDNA and add on fluorescently labled dNTPs which are incorporated into cDNA 2. Scan - the laser beam will excite each spot of DNA and detect the amount of fluorescence emitted. 3. Analysis - results are given as ratios (relative to each other) - issue with extremes because you can only be so bright/dim
Pros/Cons of having a fully sequenced reference genome vs a fully sequenced reference transcript
1. The genome is like a "one stop shop" where initially its more work. but then youre done. whereas with transcripts you keep having to go back in different spaces/times 2. transcript reference can tell you which are protein coding and their expression whereas genome cant tell you expression 3. genomes can tell you physically where the genes are located where the transcript cannot 4. transcript reference may be less useful because its not fully continuos due to alternative splicing. 5. Sequencing a transcriptome of a mutant vs wt is sufficient enough to determine function and you cant do that in a genome
When making a biological question:
1. be specific about what developmental stage/cell type/ mutant you are investigating. these conditions can change what you will be able to see 2. Chose an appropriate expression pattern platform (consider the money, time, coverage, annotation- which reference, alignment- how restrictive) 3. Look at levels of replication 4. Normalization Methods 5. Statistics on results - what is significant?
