cGenomics_ lec4b_Sequencing_techonology
Amplicon sequencing
> Amplicon sequencing is where you use PCR to amplify a small region of interest, and then sequence it and > that's an approach that's used in meta genomics, where you might want to identify all the bacteria in a particular population. > How? So we'll take all the genomes from bacteria in a sample, amplify just the 16 s RNA using primers specific to that 16 s region and then sequence all those amplicons from that population, and then do alignment to the different genomes to determine what species were in that population. > Why 16srRNA? the 16 s ribosomal RNA is useful because it has the regions of high conservation where we can use primers to amplify the 16 s gene for wide variety of different species. But it has other regions within the gene which are variable enough that you can detect which species that 16 S ribosomal gene was amplified from
Illumina sequencing youtube video transcript: much better. Read this https://www.youtube.com/watch?v=womKfikWlxM. Just clustering and bridge amplification picture
> Clustering is a process where in each fragment molecule is isothermally amplified. The flow cell is a glass slide with lanes. each lane is a channel coated with a lawn, composed of two types of oligos. > hybridization is enabled by the first of the two types of oligos on the surface. This oligo is complimentary to the adapter region on one of the fragment strings. > A polymerase creates a complement of the hybridised fragment. The double stranded molecule denature and the original template is washed away. > The strands are clonally amplified through bridge amplification. In this process the Strand folds over, and the adapter region hybridises to the second type of oligo on the flow cell. polymerases generate the complimentary strand, forming a double stranded bridge. This bridge denature, resulting in two single stranded copies of the molecule that are tethered to the flow cell. The process is then repeated over and over and occurs simultaneously for millions of clusters, resulting in clonal amplification of all the fragments. > After bridge amplification, the reverse strands are cleaved and washed off, leaving only the forward strand. The three prime ends are blocked, to prevent unwanted priming.
What do you mean by the term "READ" that is frquently used in dna seq?
> Read: the results of a sequencing method that you get is called a read.
Define scaffold and contig, consensus region
A chromosome is a very long molecule of DNA. And it is very hard to study it at once, so what researchers do is they break it into smaller pieces and they sequence each one of those individual pieces first, and then they attempt to put it together to reconstruct the original chromosome sequence. > A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA. or A contig is a contiguous length of genomic sequence in which the order of bases is known to a high confidence level. > Concensus sequence: In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. Scaffold: >Scaffolds are composed of contigs and gaps. > So, In genomic mapping,Scaffold represents a series of contigs that are in the right order but not necessarily connected in one continuous stretch of sequence.
copy number variation
A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next. Following the completion of the Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes. Copy number variation is a type of structural variation where you have a stretch of DNA, which is duplicated in some people, and sometimes even triplicated or quadruplicated. And so when you look at that chromosomal region, you will see a variation in the number of copies in normal people. >Sometimes those copy number variants include genes, maybe several genes, which may mean that this person has four copies of that gene instead of the usual two, and somebody else has three, and somebody else has five. Interesting, we didn't really expect to see so much of that. I > t's now turning out to be pretty common, and in some instances, if those genes are involved in functions that are sensitive to the dosage, you might then see a consequence in terms of a disease risk.
Sequencing technology > pros and cons of sequencing choice > general principles > hybrid approaches
Basically, we're going to cover a few different sequencing approaches and a little bit of the evolution of the technology. We're going to give an overview of some of the key sequencing technology. But there's too many to cover them all. So, basically, we will do Sanger sequencing, microarrays, Illumina sequencing, and the two companies working in long read sequencing. > We will also discuss a little bit about the pros and cons of sequencing choice for different application areas. It's a dynamic environment and the technology is changing constantly. The main players that you have now may not be the main players that are around in five years time. > So we're gonna look at the general principles. Don't go into the chemistry in any great detail. It is tempting to think that you'll only be using modern data at this point, but an Illumina is the major player in short read sequencing. However, hybrid approaches using two types of sequencing technologies to tackle a particular biological question are growing in popularity. And there's also a massive amount of historical data in public databases from older technologies that you may find useful for different approaches.
Paired-end sequencing
Bio-star Paired-end alignment typically means keeping track and reporting the alignment of the booth pairs in a read pair. Each read is aligned separately, and the information on both pairs is combined and reported in the same alignment line. So, you are aligning both the forward and reverse strand. Hence, called paired end. ---------------- We also have another approach to this technology called Paired end sequencing. As I mentioned, the adapters in the earlier slide, that particular adapter was designed to read from both ends of the insert > and that allows us to Do a few tricks to help us get over large repetitive regions in gene and genomic alignment. So, although we can only get 150 to 200 bases of sequence from short read sequencing, we can overcome some of these problems. Because we know what the insert size is, we've done the size fragmentation, we know what read one is, and we can get read2 as well. And then when we align that to the genome, it can help us bridge repeat areas that otherwise would be very confusing to genome alignment. So the video that I put the week2 module for you to watch about Illumina sequencing, actually shows paired end sequencing methodology. So have a look at that. https://www.youtube.com/watch?v=fCd6B5HRaZ8
Repetitive regions: Why long reads are much more optimal for repetitive region than short read?
In genome assembly, repetitive regions are very problematic and short read sequencing has a lot of issues with trying to resolve repetitive regions. 1. > That is, because if you have a repetitive region that covers an area of the genome that is longer than the read length (fig: see the short read in purple), it doesn't know where it needs to align, because this repeat is occurring five times . This short read could align here, or here, or here or here. > But if we have a long lead, it can bridge across that repetitive region and allow us to determine not only that there is a repetitive region there, but how big it is and how many repeats. 2. >If we have interspersed repeats, we can also have a similar problem. If we have short read that would bind to the repeat, it might not know , whether it needs to bind to this repeat or that repeat because they're identical. >On the other hand, a long read can span a series of interspersed repeats and determine their precise locations. So long read sequencing is much more optimal for repetitive regions.
Pac Bio sequencing: script https://www.youtube.com/watch?v=_lD8JyAbwEo
Introducing the PacBio sequencing system, powered by single molecule, real time or SMRT sequencing technology. Here's how SMRT sequencing works. > First, from any sample type, ranging from viruses to vertebrates, DNA or RNA is isolated. >Next, a SMRTbell® Library is created by ligating adapters to double stranded DNA, creating a circular template. > Primer and polymerase are added to the library that is placed on the instrument for sequencing. > At the core of SMRT sequencing is the SMRT® cell, which contains millions of tiny wells called Zero-Mode Waveguides, or GMWs, > a single molecule of DNA is immobilised in the GMWs, and as the polymerase incorporates labelled nucleotides, light is emitted. > With this approach, nucleotide incorporation is measured in real time. > With the sequencing system, you can optimise your results with two sequencing modes.: - Use Circular Consensus Sequencing (CCS) mode to produce highly accurate long reads, known as HiFi READ (>99% accuracy), or - use the Continuous Long Read (CLR) sequencing mode( half of reads >50 kb) to generate the longest possible reads.
Types of Conserved Sequences. What are significance of conserved sequences? https://www.commonlounge.com/discussion/cbec99e22825470d9318333139ddb8c2
In evolutionary biology and genetics, conserved sequences refer to identical or similar sequences of DNA or RNA or amino acids (proteins) that occur in different or same species over generations. These sequences show very minimal changes in their composition or sometimes no changes at all over generations. Conserved sequences can be categorized into two major categories: 1. orthologous and 2. paralogous. >A conserved sequence is called orthologous when identical sequences are found across species > and it is called paralogous when identical sequences are found within the same genome over generations Biological Significance: Conserved sequences found in different genomes can be either coding sequences or non coding sequences. As coding sequences, amino acids and nucleic acids are often conserved to retain the structure and function of a certain protein. These sequences undergo minimal changes. When changes happen, they usually replace an amino acid or nucleic acid with one which is biochemically similar. Similarly, other mRNA related nucleic acid sequences are often conserved. Non coding sequences, like ribosomes sites, transcriptional factors, binding site, etc, are also conserved sequences. Computational Significance Conserved sequences help us find homology (similarity) among different organisms and species. Phylogenetic relationships and trees could be developed and effective ancestry could be found using the data on conserved sequences. A common example is the conserved sequence "16S RNA" which is used to reconstruct phylogenetic relationship among various bacterial phyla. Conserved sequence can also be used to mark the origination of genetic disorders and mutations. By comparing genomes which have a certain conserved sequence common to them we can easily identify anomalies, any exist. Finding . >Conserved Sequences with K-mers In this section, we will see how given a section of a single DNA, how we can find short conserved sequences. The conserved sequences we are looking for are called regulatory motifs. Regulatory motifs are short DNA segments (say 15-30 nucleic acids) which control the expression of genes, i.e. how many times a gene is transcribed, and hence how much of the corresponding protein is produced. > K-mers are substrings of length k that are found in the input string. In case of computational genomics, the input string represents a sequence of amino acids or nucleic acids. For example 5-mers refer to substrings of length 5, and 7-mers refer to substrings of length 7.
Microarray signals
So on the left, we have some raw data that was retrieved from an Affymetrix microarray * and this was converted into a data file which just recorded image intensity at each x y coordinate, * and that was correlated with which gene that probe is related to. > the Illumina bead chip was similar, though we had two colours.
Illumina workflow
So the beginning of an Illumina workflow always starts with double stranded DNA that might be from an organism or a mixture of organisms. > If you're interested in gene expression and you're looking at messenger RNA, then you first have to convert that sample into DNA to make it double stranded. Illumina can only sequence DNA. > First of all, it's broken up into little pieces, called fragmentation and size selection is done to get a range of sizes that will work well for sequencing. The adapters are important, and we can see here that the adapters are illustrated in these two different colours. And they are added to both ends of the DNA fragment. Sometimes an amplification step is included, but not always. https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
Advantage of long reads
So the big advantage of long reads Is that what you can do a lot of things that you couldn't normally do with short reads. > Long read: These two histograms are of a standard Oxford read distribution for Oxford nanopore. So we can see that most of the reads are about five kilobases in length, between two and five, but they readily get reads up to about 20 kb. You also have a methodology that can get extra long reads and get very, very long reads here using a different chemistry approach in nanopore. So it makes some things feasible. > No library prep: The technology can also read the sequence of native DNA after it's extracted without any processing. So no library PrEP is needed. > Repeat regions: The length of these reads can span repeat regions, which can be very useful for genome assembly. > Computaion need:And also because there's a lot fewer reads per region, your computation needs are sometimes not as great. > Portability: And then Nanopore tech is also very portable. It has been used on the International Space Station and Antarctica, and can be used in the field for outbreaks such as Ebola and Zika. With some of the other more demanding technologies are not installed.
Pacific Biosciences ( PacBio)
So the other technology that we want to mention is PacBio. So PacBio is specific to biosciences. > They add adapters to the end of your target DNA of interest to Form a, what they call a smart bell. > So I think that's the what they mean the opposite of a dumbbell. So they have this continuous bit of DNA that can be sequenced by a polymerase. That smart Bell is flowed into a flow cell like here...gone tonext slide.encing
Long read sequencing: Long read technology
So the two long read technologies are PacBio- Pacific bio sciences, and Oxford nanopore. They use different methods: > PacBio allows the DNA to pass through a well, with a polymerase at the bottom. and it takes the signal from each nucleotide as it is added. > In Oxford nanopore, as the DNA pass through a molecular pore, a protein pore in an artificial membrane, it detects ion current changes as the DNA strand moves through the membrane and predicts what the sequence was that was passing through the membrane at the time that it detected the current change.
Sequencing diagram
So this in one picture, gives you another representation of it, > we have the (template )DNA fragments that have the adapters added to it, they're then attached to the flow cell. > they bend over and bind to a second primer and that forms a bridge, then that bridge is amplified using a PCR reaction, which we call bridge amplification, then that bridge is dissociated so that the primers can anneal to the single stranded DNA, like here, > and then we go through single cycles of polymerase synthesis. And we can see here, if we look at one particular cluster here with this C on it. First round of synthesis gives us (actually it goes in this other direction) it goes T A GC, because this would be the first nucleotide added, then the second, then the third than the fourth. So the sequence coming off from the synthesis would be T A, G, C, and then it would continue.
First Generation Sequencing: Sanger sequencing/ chain termiantion seq: > Farjana, Give me the priniciple of Sanger sequencing in one line. > What is the max and most common length of the DNA that you can sequence with it? > What are the pros and cons of Sanger sequencing? - wt is the level of accuracy? - Is it expensive? - is it time consuming?
The first generation sequencing was known as the chain termination method, and it was developed in 1977, by Fred Sanger, he won the Nobel Prize in Chemistry for that in 1980. > it can determine DNA sequences up to roughly about 1000 Bases long. 800 is more a better high limit. 600 is a common sequence length, but very high accuracy of 99.99%. > It's relatively expensive, because of the time required to do it and the reagents. It's quite slow, and it requires a significant amount of input DNA compared to current technologies.
DNA sequencing: How scientists have overcome challenges in dna seq
The main prob is dna is small that we need to read it by proxy. Scientists purified and tweaked enzymes isolated from diff microorgs and used them in a test tube which environment is mimiciking cell envo. They also employed some clever chemistry like : tagging the dna strand with flor. lavel and measuring H ion flow to measure disruption in current flow and synthesising those oligos: adapters, linkers and primers in lab. ---- So In DNA sequencing, it's too small to read the individual bases. That on the right is a picture of an actual DNA captured by a transmission electron microscope . You can see the helical structure there. It's quite elegant. > So a strategy for reading it by proxy is required. > So we need ways of using signals to detect the sequences. > molecular biologists have, over the decades adapted the cell's own machinery to engineer different methods to investigate DNA: *such as purifying and engineering enzymes such as taq polymerase. It is a thermo stable taq which can operate at high temperature. That is why it works in the polymerase chain reaction. Most other polymerases from eukaryotic cells would degrade at high temperature. so you couldn't do the cycling process. So that's an engineered method. * Also, we can mimic the cellular environment in a test tube to get enzymes like ligase and polymerase and reverse transcriptase to work quite heavily. > We're also using very clever chemistry and physics * to radioactively or fluorescently label different nucleotides. * for Measuring changes in hydrogen ion flow to detect minute changes in current, such as nanopore, which we'll talk about later. * And also synthesising, oligonucleotides on the bench, so that we can make lots of different oligos for adapters, and primers. > So, we also need, with all these other tools, very clever research approaches using clever design and analysis to ask difficult questions of our biological environment.
Microarrays: what is the purpose of microarrays? r they dna seq tech?
The next major leap in technology, not technically a sequencing technology, but was able to detect sequence variants was called chips or microarrays. >In chips, short DNA probes are (rem: probes ≈ oligos) attached to the array. The probe was carefully designed so that they are complimentary to a target DNA sequence that you were interested and they were laid on the array like grass. > The DNA of interest was fragmented and labelled, and then washed over the surface of the chip and then hybridised between the probe, on the chip. And if the label DNA was complimentary, it stuck. > The nonspecific DNA that was not complimentary was washed off. > It was then laser scanned to determine the intensity of particular coordinates. and the x y coordinates of the fluorescent signal was detected
Sequencing applications for COVID-19: Farjana, How DNA sequencing has helped in the fight against COVID 19?
Knowing the seq of bases in the DNA in the covid19 virus has helped use to 1. identify that it is a New strain of corona virus 2. Now we know the seq, so let's develop a diagnostic test based on that. 3. Let's develop a mRNA vaccine 4. what mutations the virus gained and how it is affecting vaccine potency 5. Tracking how the virus is travelling throughout the globe 6. .... ------ Lecture: So it's quite clear from the last year that sequencing has a lot of different approaches.Even within one problem- the pandemic for COVID-19 -sequencing has been involved in a number of different aspects of it. > It was used to identify the fact that there was a novel SARS Coronavirus in 2019. > The sequencing was also used to develop rapid diagnostic tests, and messenger RNA vaccine development. > It's also been used to detect mutations in new strains, and track global transmission rates. > It can identify viral mutations that can affect vaccine potency and track them and also look at a respiratory co infections with microbes and see whether or not there's anti microbial resistance genes within those populations. > So that's just one case where sequencing has been involved in lots of different aims. And different technologies has been used in different areas of that. > COVID- 19 is actually unusual case as well because it has an RNA genome, not a DNA genome. https://www.illumina.com/areas-of-interest/microbiology/infectious-diseases/coronavirus-sequencing.html
DNA Chip or microarray: What is dna microarray or chip made off? How scientists use microarray to identify mutation? see the pic
Large number of short single strandes dna probes or ogs attached to a surface is called a chip. dna chip or microarray is a tool used to detect mutation in a particular dna seq. A dna chip is made of small glass plates encased in plastic. DNA microarrays can take 20,000 or more different DNA sequences attached in microscopic spots to a glass slide. The different DNA sequences are oligonucleotides of about 20 bases in length. The oligo nucleotides represent tiny but unique regions of genes in the genome. >In chips, short DNA probes are (rem: probes ≈ oligos) attached to the array. The probe was carefully designed so that they are complimentary to a target DNA sequence that you were interested and they were laid on the array like grass. --- https://www.genome.gov/about-genomics/fact-sheets/DNA-Microarray-Technology The DNA microarray chip consists of a small glass plate encased in plastic. On the surface, each chip contains thousands of short, synthetic, single-stranded DNA sequences, which together add up to the normal gene in question, and to variants (mutations) of that gene that have been found in the human population. https://www.youtube.com/watch?v=VNsThMNjKhM
Commercial sequencing technologies
The sequencing technologies is a very diverse and dynamic domain. > Three of the technologies in the slides: solid, I on torrent and Roche 454 are no longer available, but they were pretty good when they're around. > Illumina has dominated the market particularly in Australia for short read sequencing, you can get up to 350 bases. > they have fairly high yield and very good quality and the cost per base is low. > Pack bio and Oxford nanopore are long read sequencing, which I haven't gone into yet, but we'll be covering later. >And they can do much longer reads, but the quality is very poor. The yield is okay. The cost per base also varies. >BGI is Beijing Genomics Institute. They have an institute called MGI which is currently active in Australia. > They have a similar chemistry to alumina, which is called ligation by synthesis, I believe, which basically uses ligase instead of polymerase. > They can get up to 400 single end reads and using paired end sequencing they can get about 200. > So, the technology is quite comparable to a Illumina. And a recent study on single cell RNA seq actually said that the quality was pretty comparable.
Define NGS/ or MPS or HTS: > Why do this sequencing > How done>
The term "Massively Parallel Sequencing" is used to describe the method of high-throughput DNA sequencing to determine the entire genomic sequence of a person or organism. This method processes millions of reads, or DNA sequences, in parallel instead of processing single amplicons that generate a consensus sequence. The result is a higher resolution of every sample. This process gives MPS vastly more resolution than traditional Sanger-based capillary electrophoresis (CE) technologies. The impact of this is huge for the forensic community: scientists can extract meaningful information from full or partial DNA samples even when databases do not confirm a match.
Bridge amplification
Then those DNA fragments that have the adapters on them are float onto the flow cell, and allowed to bind on to those oligos that are already fixed to the flow cell as a lawn of oligos. > So the complementary sequences at the end of those adapters on your DNA targets of interest, will bind to the flow cell and hold it in place. > We then need to amplify each strand, where it is bound in place to that flow cell. That is because one strand by itself is not going to give enough signal for us to detect it. > So we do an in situ amplification, it's based on the PCR cycle. So that polymerase chain reaction works in situ on the flow cell inside the the Illumina array > that is performed through a number of cycles through something called bridge amplification. So that's what we call that PCR cycle on the right, bridge amplification ,that produces multiple copies of the same template and forms a little cluster. > And it's the image that we get from that cluster that's going to tell us what the sequence is at each cycle of the sequencing reaction. And we were doing something called sequencing by synthesis, which I'll talk about in a minute.
adapters
Lawn of adapters in flow cell: so adapters are oligonucleotides that are synthesised to perform a specific job. > the flow cells in an Illumina array have a lawn of oligonucleotides bound to it, just like the affymetrix microarrays. And those are used to catch and bind and hold on to your DNA fragments that you want to sequence. 5 types of adapters are used: > Flow cell binding sequence, navy blue : Platform-specific sequences for library binding to instrument. So we need at the end of each of those DNA fragments to have a short sequence that is complimentary to the lawn of oligos on the flow cell. And that's what these p5 and P7 sequences are at the end. So they will attach to the complementary oligos that are bound to the lawn of the flow cell. > Sequencing primers sites, SP1/SP2, ash color: binding sites for general sequencing primers * So this adapter is designed to sequence on both ends. * So if sequencing primer one sequences in one direction , then sequencing primer two sequences in the other direction. > Sample Indexes,light blue and camel color: Short sequences specific to a given sample library. * so we can add little barcodes. So barcodes are normally about 8 to 10 bases long. * these indexes will allow you to multiplex your samples. So, you can run more than one sample in the same flow cell and then bioinformatically you demultiplex them afterwards to identify which read comes from which input sample. So that's what these little codes are for here. > Molecular indexe/barcode; green color: Short sequence used to uniquely tag each molecule in a given sample library. *And also sometimes you can use a unique molecular identifier. Those are used often in single cell RNA sequencing to mark single molecules for PCR amplification. *And it helps identify errors that are produced as a result of a PCR amplification process rather than a real sequencing difference. > Insert; black color: Target DNA or RNA fragments from a given sample library. *And here in black we can see this is the DNA insert side, our target of interest ,that we actually want to sequence. So these adapters can be quite complicated. And it's can be useful to know that because when you get your sequence, sometimes you have a little bit of adapter sequence left over in there and you need to remove it. https://sg.idtdna.com/pages/products/next-generation-sequencing/adapters
chain termination sequencing: Wt is the main diff bet chain termination seq developed by Maxim-Gilbert and Sanger?
Maxim-gilbert used radioactive atom to tag ddNTPs. So, in the radiograph of the gel , they had 4 lines, each for one type ntd. One line for G, another for T...then they are read from bottom to top. But in Sanger seq, they used 4 diff, colored fluorophore to tag the ddNTPs. So, they were able to contain the 4 diff reactions in single line. Then used gel elctrophoresis and laser beam to detect fsence emitted by the ddNTP. That makes it faster than maxim -Gilbert. You can also read the gel from bottom to top based on the color. ------ Lecture This slide illustrates the basic protocol. But I've also provided a video that I found on YouTube that explains it quite nicely. https://www.youtube.com/watch?v=FhlKYsc_9_A So on the far left, you can see a radiograph of a radioactively labelled sequencing gel that is the Maxim-Gilbert sequencing method, rather than the Sanger sequencing method, which has fluorophore labels. > So one of the big differences with the Sanger method from Maxam-Gilbert was that they(sanger) used four different fluorophore labels rather than a single label. > So they(Sanger) were able to pull their four different reactions into one lane, you can see there in the 2nd image. And that Lane was then read from the gel . > so we can see how we can read gels by looking at the radiograph(maxam-gilbert) on the far left, so you read from the bottom up and the first line is the G reaction. So that is the chain termination of all the guanosine in the sequence or the template. > The second lane is the chain termination for the adenosine, the third for the thymine and the fourth for the cytosine. > So by reading from the bottom up, we can read the sequence as G, G, A, G, T, G, A,T,T,T CCC, T, A. So that's how we read the gel. And the phrase "read" has actually been kept so ,even now, with the new modern sequencing technologies. > Read: the results of a sequencing method that you get is called a read. > So, after the reaction was developed to contain 4 diff reactions inone single lanes(sanger) in gels, they developed an automated method using migration through a capillary gel electrophoresis, which require quite far less amounts of DNA, and made it much faster because it was automated reading by laser. So that was the dominant sequencing method used between about 1978 to 2006. And it's still occasionally used today in clinical environments where they need very, very accurate results. It might be diagnostic testing. But it's not fast, and it's not cheap.
What do you mean by metagenomics?
Metagenomics is the study of genetic material recovered directly from environmental samples. The broad field may also be referred to as environmental genomics, ecogenomics or community genomics. # Purpose of metagenomics: Metagenomics is a molecular tool used to analyse DNA acquired from environmental samples, in order to study the community of microorganisms present, without the necessity of obtaining pure cultures.
Sequencing strategy
So as I mentioned, sequencing technology is changing a lot with time. we can see that since NGS sequencing became very popular, the cost per gigabase has gone down , whilst the Gigabase output per week has gone up. And this is showing how different sequencing technologies come onto the market. And they're just much more high throughput
What is the objective
So finally, I just want to reiterate that your choice of sequencing technology is going to be dependent on your biological question, more than anything else. Some of the things that you should be thinking about mainly are: 1. Time and money? > how much time do you have to complete this experiment? > how much money do you have to generate the data ?> and not just generate it, but analyse it and store it and transfer it? 2. Also, you need to think about sample quality. > what is the quality of the sample going in? Is it degraded? Some technologies can't handle degraded samples very well at all? >Is it due to not having much of it? some technologies require a lot more sample input than others. 3. And whether or not the technology is going to be available to you where and when you need to do the sequencing. So, if you know about location, then long read sequencing is going to be more feasible, but you'll have to handle whether or not the error rate is going to be appropriate. 4. Whether or not you're doing RNA analysis or DNA analysis, with the presence of a reference genome for your species of interest is also going to have a big impact on your analysis pipeline and thus, your sequencing choice.
Example objective: Farjana, Why would you like to find out the sequence of bases in a DNA sequence?
There are two main reasons for which you might want to seq a dna: 1. When u want to find out: which part of the whole DNA is behind problematic phenotype of the cell? > Is there a mutation in that DNA ? where the mutation is that is causing the trouble. > what epigenetic modification is suppressing the good gene? > detecting genes that are higly expressed during stress > GWAS: Which varible dna is causing the disease? > Wnat to know the seq of all bacteria in a species to find out which strain is carrying a mutation? 2. To compare DNA seq to find out: > If two persons are related (mt genome) > If the dna seq from crime scene matches the criminal? > If this bacterial DNA seq is new? > Also, to recreate facial feautures from DNA seq by comparing with dna seq stored in database ---------------------- Lecture So here's some example objectives : > you might have a particular project, it might want to ask what part of the DNA sequence, in a particular cell is contributing to a particular phenotype? * Or It might be what mutations in the DNA are affecting cancer growth, * which epigenetic effects are suppressing gene expression? *which genes are highly expressed when cells are under stress? * or in a genome wide association studies, called GWAS looking at: how variable DNA regions correlate with phenotype or disease. * You could also look at which strains of bacteria are carrying antibiotic resistance. And that might be in a mixed population of lots of different species of bacteria. > There are approaches for comparing DNA with other sequences where you can ask questions like: *is this a new species? *Is the sample from a missing person or suspect? * Is this animal related to another animal? * and you might be looking at mitochondrial genome for relation issues as well. *some of the more recent developments are trying to recreate facial features based on DNA sequence using all the phenotype data that we have in the databases. >Two of those examples with underscores are actually hyperlinked to examples I found on the internet. : *is this a new species? https://advances.sciencemag.org/content/6/9/eaax5751 * facial feature: https://sapac.illumina.com/content/dam/illumina-marketing/documents/icommunity/shriver-pennstate-interview-forensics-1470-2015-001.pdf
Technical errors
There's a number of different types of technical errors that are can arise with sequencing by synthesis. The reason of the decreasing sequence quality lies in the sequencing technology of Illumina. Illumina relies on the sequencing by synthesis procedure. During each cycle of the process the sequencer washes chemicals that include variants for all four nucleotide over the flow cell (which has different clusters with identical DNA fragments for each cluster). The nucleotides have a blocker (terminator cap) so that only 1 base gets added to each molecule of DNA at a time. After the detection of the coupled fluorescence signal the blocker can be removed and the cycle can start again. This way, the DNA fragments in each cluster get sequenced synchronously by expressing specific fluorescence signals. 1. phasing: The main reason for the decreasing sequence quality is the so-called phasing. Phasing means that the blocker of a nucleotide is not correctly removed after signal detection. In the next cycle no new nucleotide can bind on this DNA fragment and the old nucleotide is detected one more time whereby the fluorescence signal of this old nucleotide (probably) differs from the synchronous signal of the other nucleotides (Fig. 2). From now on this DNA fragment will be 1 cycle behind the rest (out of phase), polluting the light signal that the sequencer's camera has to read. >A similar effect occurs if a nucleotide has a defect terminator cap (prephasing). In this case two nucleotides can bind in one cycle whereby the fragment will be 1 cycle before the rest. These errors occur with a low probability. But over time (with increasing read length) they add up and pollute the light signal more and more. The signal gets more and more asynchronous. And since the light signal is used to calculate quality scores the asynchronous signal results in a decreasing sequence quality score. 2. Cross talk: So if those clusters are too close together, the light coming from one cluster will bleed into the sheen from another. > And the the image analysis is not so sure about whether or not that was a red spot, or a green spot because of bleeding coming in from other clusters. > So in order to get the most bang for your buck, you want to have as many clusters in your flow cell as possible, without having them overlapping too much to impede each other signal. > So this is under clustered. the optimal clustering is in the middle here. And then on the right, we've got over clustering, and they do have a lot of poor quality signal coming out of that one, because the spots are too close together, and the signal bleeds into it. 3. High G+C DNA Also, if there's high GC rich regions of DNA, these are very difficult to sequence. And that's because they have a strong secondary structure. If you look at the chemistry of a nucleotide bond between G and aC, they actually have three hydrogen bonds wheras A and T only have two hydrogen bonds. > So that may not sound like much. But if you've got a lot of GCs in a particular region, that strength over that long legion of all those extra hydrogen bonds, makes that DNA hybridization much stronger overall. https://www.ecseq.com/support/ngs/why-does-the-sequence-quality-decrease-over-the-read-in-illumina
Microarrays diagram: What are the two major microarray you can name? What are the difference between the two?
These were the two major microarrays that were around. 1. Affymetrix microarrays: https://www.youtube.com/watch?v=MuN54ecfHPw 2. Illumina beadchips They're not so common anymore, but the first was affymetrix microarrays. Affymetrix microarray: *they had 25 mer oligos synthesised directly onto the matrix using photo lithography. *and DNAs labeled with a single fluorophore would attach to those probes bound to the matrix. * The non-bound ones would wash off and then the signal could be detected. > The ILLUMINA beadchips was similar, except they attached their probes on the array to beads first. *they included a 29 mer address as well as a 50 base pair probe to target the sequence of interest. *They also use two colour detection. * So they had a slightly different detection algorithm. you could mix two different samples on one array to detect difference in intensity.
2 sequencing modes for PacBio
They have actually two sequencing modes, one of which improves the error rate. 1. Continuous long read 2. Circular consensus sequence >Continuous long read: continuous long read which can generate errors. >Circular consensus sequence: takes smaller insert sizes, but continuously sequences around and around the circle to sequence the template DNA multiple times. And then a consensus sequences is generated identifying which of the sequence differences are random errors due to inaccuracy of the polymerase. > Accuracy: So the per base accuracy of the circular consensus sequence is greater than 99%, whereas the regular continuous long read sequencing is only of 88% accuracy. > Read lengths: So the read lengths vary, as I said, you need to have a smaller insert size for the circular consensus sequencing. And the yield is therefore not as great. > Run time: but the runtimes are pretty similar, > Cost: the cost of circular consensus sequencing is a little bit more.
SNP detection using arrays: Explain the process of SNP detection using affymetrix and Illumina bid chip
They were able to detect sequence variants using a special approach for the affymetrix array, - the probes for both alleles(mutated and normal) was synthesized and bound to the array. -If the wrong allele tried to bind to the hybridization probe on the array, it impeded the binding of the labelled sample and that results in a dimmer single. * So a bright signal was a perfect match for a particular SNP of interest. *And a dim signal was a mismatch. > On the Illumina bead chip: - Genotype is determined using two color detection. One color is used for each allele:mutated and normal - probe attached to the bead, hybridizes to the region next to the SNV. - Then single base extension with one of 2 labels detects the identification of SNV in sample. - if it was guanosine, the signal is green and for thymosine the signal is red here - so they could tell what the opposite base was due to the signal that was illuminated from the particular bead, https://pubmed.ncbi.nlm.nih.gov/19570852/
Whole exome sequencing
Whole exome sequencing is an approach that used to look at large regions of the genome. and the most interesting regions of the genome are the expressed regions. So that includes all the coding genes, as well as a lot of non-coding regions that are expressed. > The exome is composed of all of the exons within the genome, the sequences which, when transcribed, remain within the mature RNA after introns are removed by RNA splicing. > This includes untranslated regions of mRNA, and coding sequence (or CDS) > So there are ways of doing this using a hybridization capture methodology. > So All the exome expressed regions of the genome for a human, for example, have been identified and oligos corresponding to those regions has synthesised and bound to a matrix. > This is a hybridization array matrix. So all the oligos bound to this matrix will recognise expressed regions of the genome only. > To use this,you take your DNA sequence of interest, you fragment it and denature it, so you have small single stranded fragments of the genomic DNA, and you flow that over the array. And only those regions of the DNA that are complimentary to the exome regions that are bound to the array will stick to it. > you wash off all the stuff that doesn't stick, and then you elute those regions that were bound to your regions of interest into another sample, and then you sequence that by taking it through the library preparation sequencer . > it's a lot cheaper than sequencing the whole genome, because only a small portion of the human genome is expressed. The arrays are known to capture about 99% of the exome, > It does have some disadvantages, it's a pre designed capture array, so you're only going to be able to sequence things that were designed to be captured by the array, it can't detect large rearrangements, and it can't detect variance in the regions that are not expressed. So the exome is only going to allow you to do, genome wide SNP analysis in regions of the exome, not the whole genome. By SarahKusala - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=9642877
Transcriptomics- transcript analysis. why long read sequencing is advantegeous transcript level analysis?
however, if you're interested in transcript analysis, that's the different transcript variants that come from a single gene, then, long read sequencing can give you a great advantage because you can sequence the whole transcript. > And , you don't need to have to work out whether or not a particular Exom came from one splice variant or another. !!!!!!!!!!!!!!!!! > So long read sequencing is in advantage in transcript level analysis, but is not really beneficial for gene level analysis.
mapping
https://www.genome.gov/genetics-glossary/Mapping The process of establishing the spatial relationship of landmarks within DNA is called mapping. Such landmarks can be genes or regions that vary among individuals. Mapping can involve simply ordering such landmarks, or in some cases precisely determining the spacing between them. Cytogenetic maps give the order of stained chromosome bands. Genetic maps represent the positions of polymorphisms--regions where the DNA sequence differs among individuals. Physical maps depict the actual physical locations of landmarks along a stretch of DNA. Mapping is the process of making a representative diagram cataloging the genes and other features of a chromosome and showing their relative locations. Cytogenetic maps are made using photomicrographs of chromosomes stained to reveal structural variations. Genetic maps use the idea of linkage to estimate the relative locations of genes. Physical maps, made using recombinant DNA (rDNA) technology, show the actual physical locations of landmarks along a chromosome.
Illumina sequencing by synthesis: youtube video . Vicky
https://www.youtube.com/watch?v=fCd6B5HRaZ8 >Illumina sequencing workflow is composed of four basic steps: Sample Prep ,cluster generation, sequencing, and data analysis. >There are a number of different ways to prepare samples. All preparation methods add adapters to the ends of the DNA fragments. Through reduce cycle amplification additional motifs are introduced, such as the sequencing binding site, indices and regions complimentary to the flow cell oligos. > Clustering is a process where each fragment molecule is isothermally amplified. > The flow cell is a glass slide with lanes. Each lane is a channel coated with a lawn, composed of two types of oligos. > Hybridization is enabled by the first of the two types of oligos on the surface. This oligo is complimentary to the adaptive region on one of the fragment strands. A polymerase creates a compliment of the hybridised fragment. The double stranded molecule is denatured and the original template is washed away. > The strands are clonally amplified through bridge amplification. In this process, the strand folds over in the adapter region hybridises to the second type of oligo on the flow cell. Polymerases generate the complimentary strand, forming a double stranded bridge. This bridge is denatured, resulting in two single stranded copies of the molecule that are tethered to the flow cell. The process has been repeated over and over and occurs simultaneously for millions of clusters, resulting in clonal amplification of all the fragments. >After bridge amplification, the reverse strands are cleaved and washed off, leaving only the forward strands. The three prime ends are blocked, to prevent unwanted priming . > Sequencing begins with the extension of the first sequencing primer to produce the first read. With the cycle fluorescently tagged nucleotides compete for addition to the growing chain. Only one is incorporated based on the sequence of the template. After the addition of each nucleotide, the clusters are excited by a light source, And the characteristic fluorescence signal is emitted. This proprietary process is called sequencing by synthesis. The number of cycles determines the length of the read. > The emission wavelength, along with the signal intensity, determines the base call. > For a given cluster, all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel process. This image represents a small fraction of the flow cell. After the completion of the first read the read product is washed away. > In this step, the index one read primer is introduced and hybridised to the template. The Read is generated, similar to the first three. After completion of the index read, the read product is washed off, and the three prime ends of the template are deprotected. The template now folds over and binds the second oligo on the flowcell. Index two is read in the same manner as index one. Polymerases extend the second flow cell all ago, forming a double stranded bridge. This double stranded DNA is then linearized, and the three prime ends are blocked. The original forward strand is cleaved off and washed away, leaving only the reverse strand. Read two begins with the introduction of the read two sequencing primer. As with read one, the sequence of steps are repeated until the desired read length is achieved. The read 2 product is then washed away. > This entire process generates millions of reads, representing all the fragments. Sequences from pooled sample libraries are separated based on the unique indices introduced during the sample preparation. For each sample reads but similar stretches of base calls are locally clustered. Forward and reverse reads are paired, creating contiguous sequences. These contiguous sequences are aligned back to the reference genome for variant identification. The paired end information is used to resolve ambiguous alignment.
Sanger sequencing/chain termiantion sequencing youtube video transcript
my understanding: 3 stages; 1: pcr amplification of sample of dna fragment. So, now you got lots of dna fragmensts. 2. add these dna fragments, the primers , lots of the dNTPs and small amount of chain terminating ddNTPs in the reaction mixture. > these ddNTPs lack -O radial at 3' position fo dNTP. this O radical is imp for adding new ntd. So, chain elongaion can't happen, resulting in chain termination. > as dna polymerase add each ntd to the primer, a random event determines the length of dna seq. Will the dNTP bind and let the dna seq grow? or the ddNTP will bind and terminate chain elongation? also, these ddNTPs are floursectly tagged. So, when they ar hit with laser beam, they emit characteristic colored light and this light is recorded by the comp to determine the identity of a particular ntd at a position. 3. Capillary Gel electrophoresis: so, we r getting dna fragments of diffe size. use gel elect. to separate the dna fragements based on their length. The shorter fragment will travel faster and will be at the bottom of the gel. So, fragement are sorted based on their length. 4. Detection: Scan the gel with a devise that hits the dna fragent with a laser beam and diff ddNTP tags emit ch light. This light is detected by the fluoresence detector. and the comp reports the dna seq ---- https://www.youtube.com/watch?v=FhlKYsc_9_A The most common method of DNA sequencing involves a modified DNA synthesis reaction. the reaction mixture contains many copies of the segment of DNA to be sequenced, as well as reaction primers, and plenty of the four types of deoxy nucleotides, or dNTPs to be used as building blocks in DNA synthesis. This sequencing method also requires so called Chain terminating di-deoxy nucleotides, or ddNTPS in the DNA polymerase reaction, whereas dNTPs have a hydroxyl group at the 3' prime position of the sugar, ddNTPs lack the hydroxyl group. the hydroxyl group is required for chain elongation, so ddNTPs are considered chain terminating nucleotides. > Four different types of ddNTPs are added in small amounts to the reaction with each carrying a different coloured fluorescent tag. > As DNA polymerase adds nucleotides to the growing strand a random choice determines each step. Will the dNTP bind, letting the strand continue to grow, or will it ddNTP bind and terminate chain elongation? > as the process continues. Many incomplete DNA strands form with a ddNTP as the last nucleotide Incorporated. > The size of each fragment is determined by its terminal di-deoxy nucleotide marked by a specific colour fluorescence. a relatively long strand that fluorescece blue identifies the position of a cytosine bearing nucleotide on the template strand near the five prime end. > Each length and colour, identify the position and type of nucleotide in the template strand. > The four newly synthesised strands shown here would be only a small portion of the strand synthesised in the entire reaction mixture. The last incorporated nucleotides in these fragments provide the complimentary sequence to the template DNA strand. > This complementary sequence is GAA CAG TTG CAT CAG, and therefore, the template strand consists of the sequence CTG ATGCAA CTG TTC. > How do investigators put the fragments in the correct order? by using gel electrophoresis [* Large& bulky fragments at top and small &light fragments at the bottom. * template strand: 3' to 5' * Complementary seq is read from light bottom --> to heavy top: 5'-GAA CAG TTG CAT CAG-3' * Now reverse it and: GAC TAC GTT GAC AAG * take complement: CTG ATG CAA CTG TTC to get the template strand : 3' -CTG ATG CAA CTG TTC] Through gel electrophoresis. The mixture of fragments can be ordered according to size with the smallest fragments migrating the fastest in the gel. An automated device scans the gel, a laser beam strikes frangments the gel, causing the tags on the DNA to fluoresce their unique colours, a computer sends the series of fluorescent colours and determines the corresponding sequence of the DNA.
Trancriptomics (gene level)
sequencing is also used a lot for transcriptomics and has been for some time. Short read sequencing is pretty good for gene level transcriptomics. > Bulk RNA-seq is where you extract RNA from a chunk of tissue. > single cell RNA seq is a fairly new technology, but it's being taken up fast by a lot of researchers. And that's where you're looking at the genes that are expressed in particular cell. > miRNA-seq: A variant of transcriptomics is miRNA seq— looking at micro RNAs. * miRNAs (microRNAs) are short non-coding RNAs that regulate gene expression post-transcriptionally. They generally bind to the 3'-UTR (untranslated region) of their target mRNAs and repress protein production by destabilizing the mRNA and translational silencing. > Short read sequencing is pretty good for all of these technologies.
Technology: zero-made Waveguide (ZMW)
so here we have a cell that's called a zero mode waveguide (ZMW) > And the DNA smart Bell with a polymerase stuck on it is float into the flow cell that always sticks to the bottom. > And there is light just at the bottom of that flow cell. And as the polymerase adds fluorescent nucleotides one by one, the light detection system at the bottom measures it and it determines which nature of nucleotide, whether A,G,C or T was added at that time. So it records the fluorescent signal > and this is what the raw output looks like here, that is converted into a read set, giving you an output to the readers
Hybrid capture method script
today we will be discussing hybrid capture technology. Hybrid capture is not a sequencing technique, however it is frequently mentioned in the same book chapters as sequencing since it arose during that wave of molecular diagnostics, and involves detection of a specific known DNA sequence by hybridising that piece of DNA to an RNA molecule, you'll find that hybrid capture is reminiscent of traditional EIA, or sandwich assays used in chemistry and Immunology. This test is most widely used for the detection of high risk HPV types and cervical pap tests CMV version of the test is also available. The test has slightly less analytical sensitivity versus PCR for HPV, but it is technically easier to do, and is cheaper. For the hybrid capture to work, we need to know the exact sequence of the target DNA. This is why the test gained popularity for HPV testing, we know the sequences for certain subtypes of high risk HPV, the target DNA is hybridised to a complimentary RNA probe. This is our hybrid in our testing device, we have a solid phase component, that is a solid plate in our test tube that is coated with antibodies that will bind to the DNA, RNA hybrid molecules, our DNA, RNA hybrid binds to the solid phase, then another antibody that is coated with alkaline phosphatase binds and sandwiches are hybrid from the other side. Finally, the development agent is added to the mix, and a Kenny luminescence signal is detected. https://www.youtube.com/watch?v=SBWXsUo-w8M
Epigenetics and Epitranscriptomics. Advantage of long read sequencing in epigeneticand epitranscriptomics .
> So epigenetics and epitranscriptomics is looking at changes that don't affect the actual sequence of the DNA or the RNA. So these are chemical changes to the nucleotides that don't change the nature of the nucleotides. So, the cytosine is still a cytosine, but in DNA, you can get a methylation that is now called a five methyl cytosine. > So that subtle change in the cytosine can affect gene expression in the DNA. > The RNA has also been shown to have lots of modifications, over 100 different modifications have been identified so far for RNA, and one of the most frequent is m6 adenosine: m6A The adenosine modification in RNA is thought to be involved in regulating messenger RNA transport and regulation of translation. > These modifications can't be detected with short read technology because short read technology strips all those modifications off of it and for RNA, you can only sequence double stranded DNA. So, RNA needs to be converted to DNA first, and you lose the presence of those modifications. So they're not captured. > So, if you're interested in epigenetics and epi transcriptomics, Nanopore and PacBio long lead sequencing are able to detect those modifications if you sequence the native sample, rather than do a PCR amplification or anything. > It is possible using some proxy techniques in Illumina, to look at methylation. > But I think it is not going to be possible to look at epitranscriptomics in short read sequencing methodologies. > But there are approaches to do methylation studies in DNA using Illumina. But it's a proxy. It's not a direct sequencing method.
Objective
> So it's good to think about the overall objective. Unless you're actually a chemist, the production of good quality sequences from a biological sample is not a useful output on its own. > Generating the sequence is just a means to an end. > The technology choice is going to depend on the approach you have chosen to answer the biological question, and other factors that we'll talk about in a little bit
Sequencing
> So once we've got that cluster, synthesised in place on the flow cell, we use a sequencing cycle of : > First of all, adding fluorescently, labelled A G C's and T's of a different colour, and that, in conjunction with the primer and the polymerase, adds a single base after the end of the primer to start synthesis of the complementary strand to the template of interest. > However, the bases have modifications on them, so only one base can be added. At that point, an image is taken off the flow cell and it records what is the colour at that particular cluster site. * In this illustration: adenosine is red, thymine is green, guanosine yellow and cytosine is Blue. > then, an enzymatic reaction is done to cleave the part of the nucleotide that's preventing the DNA polymerase from adding a second nucleotide. > And the cycle is repeated. The As Gs Cs and Ts are flowed onto the flow cell again, one more nucleotide is added and images taken and recorded and then termination part of the nucleotide is enzymatically removed. >So we can add one more or add a third cycle. Each time we do the cycle, the machine takes an image and it records what the colour is and determines what the nucleotide identity was that was added at that point by the polymerase. So that's what we mean by sequencing by synthesis. At every step that a nucleotide is added, it measures what that sequence change was. http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
Illumina sequencing youtube video transcript: much better. Read this https://www.youtube.com/watch?v=womKfikWlxM. Just sample preparartion picture
> sample preparation begins with extracted and purified DNA. >The first step in Nextera sample preparation is tagmentation. During tagmentation transposons simultaneously fragment and tag the input DNA with adapters. Once the adapters have been ligated reduce cycle amplification adds additional motifs, such as the sequencing primer binding sites, indices and regions that are complimentary to the flow cell oligos.
Illumina sequencing youtube video transcript: much better. Read this https://www.youtube.com/watch?v=womKfikWlxM. Just sequencing picture
>Sequencing begins with the extension of the first sequencing primer to produce the first read. with each cycle four fluorescently tagged nucleotides compete for addition to the growing chain. Only one is incorporated, based on the sequence of the template. >After the addition of each nucleotide, the clusters are excited by a light source and a characteristic fluorescent signal is emitted. This proprietary process is called sequencing by synthesis, the number of cycles determines the length of the read. The emission wavelength, along with the signal intensity, determine the base call. > for a given cluster, all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel process. > This image represents a small fraction of the flowcell. > After the completion of the first read the read product is washed away. > In this step, the index one read primer is introduced and hybridised to the template. The Read is generated, similar to the first read. After completion of the index read the read product is washed off, and the three prime end of the template is de protected. The template now folds over and binds the second oligo on the flow cell. > Index two is read in the same manner as index one. Index two read product is washed off at the completion of this step . polymerases extend the second flow cell oligo forming a double stranded bridge. This double stranded DNA is then linearized and the three prime ends blocked the original forward strand is cleaved off and washed away, leaving the reverse strand. > Read two begins with the introduction of the read two sequencing primer. As with read one, the sequencing steps are repeated until the desired read length is achieved. The Read two product is washed away. D)
Uses of microarrays: When microarrays are useful?
>microarrays were very popular, * they were able to detect the presence or absence of genes in a population, * the amount and type of mRNA that was expressed in a cell, *the presence or absence of small mutations in the sample of DNA and detect the copy number variation. > They had significant limitations though, *they were designed and targeted, so you could only detect what the chip was designed for. *And they couldn't detect sequence which didn't match to any of the probes.
What are exomes? What is the diff bet whole genome and exome. What is the relationship bet exome and exon>
Exome: The part of the genome that consists of exons are called exon What is the difference between Exome Sequencing and Whole Genome Sequencing? Whole Genome Sequencing sequences the complete DNA of an organism. ... The exome makes up only 1.5% of the whole human genome, however ALL protein coding genes are found in the exome. relationship? The exome is the part of the genome composed of exons, the sequences which, when transcribed, remain within the mature RNA after introns are removed by RNA splicing and contribute to the final protein product encoded by that gene.
Waht are these microRNA?
A microRNA (abbreviated miRNA) is a small single-stranded non-coding RNA molecule (containing about 22 nucleotides) found in plants, animals and some viruses, that functions in RNA silencing and post-transcriptional regulation of gene expression.
Initial cost to get started
Another important thing to think about when you're choosing a sequencing platform, especially if you haven't done any sequencing before is the cost to get set up. > So the Illumina machines are benchtop machines, they cost between 20,000 to a million to buy the machine and get started. > The PacBio machines are in the similar category. At 500,000. But this thing here is about size of a fridge. And over here we have the Oxford nanopore mini Ion, which is a tiny little thing about size of a USB drive and they've had new thing and development which fits on your cell phones. So they're far more mobile. And the cost of setting up nanopore of 1000 US dollars actually includes reagents as well as the mini ion flow cells. So it's much quicker to get set up you can get started quite quickly.
Illumina beadchip youtube script https://www.youtube.com/watch?v=lVG04dAAyvY&t=98s
By understanding genetic variations across multiple individuals, we gain valuable insights that can improve our understanding of human health and disease. > With the Infinium assay from Illumina, you can measure these variations across a large number of individuals, giving you a deeper understanding of how they impact different traits and diseases. > An individual Infinium assay genotypes a locus, using two colour readouts, one colour for each allele. > The secret is in the combination of the Infinium assay, with a lllumina beadchip technology. > On the surface of each beadchip, hundreds of 1000s to millions of genotypes for a single individual can be assayed at once. These tiny silica beads are housed and carefully etched microwell. The beads are coated with multiple copies of An oligonucleotide probe that targets a specific locus in the genome. > As DNA fragments pass over the beadchips, each probe will bind to a complimentary sequence in the sample DNA stopping one base before the locus of interests. > Allele specificity is conferred by a single base extension that incorporates one of four labelled nucleotide. > Natural competition among four bases minimises bias, allowing the polymerase to extend the probe with the correct base, matching the target DNA. > Once laser excited, the nucleotide label emits a signal that's detected by an Illumina scanner. Intensity values for each colour convey information about the allelic ratio of a given locus. > For example, if the colour representing A (adenosine) is approximately as intense as the colour representing G (guanosine), the interpretation is that the genotype at that locus is AG. > The data can be seamlessly analysed using Illumina's Genome Studio Software to call genotypes and evaluate copy number variation across the genome. When the assay data from a number of individuals are plotted, distinct patterns emerge. > Samples that have identical genotypes at an assayed locus exhibit similar signal profiles. For diploid organisms, bio allelic loci, are expected to exhibit three clusters. This proven chemistry produces unrivalled accuracy, superior call rates, and the most consistent reproducibility.
DNA microarraay youtube script. https://www.youtube.com/watch?v=HIf7ulJldXs How can you use DNa microarray technology to determine gene expression analysis of cancer and normal cell?
DNA microarray technology allows the expression of 10s of 1000s of genes to be analysed to simultaneously. Useful in cancer research since Some genes are expressed at higher levels in the cancer cells, while others are expressed at lower levels. Consider an experiment that compares the pattern of gene expression in cancer cells and normal cells. > When a gene is expressed it is transcribed into mRNA. In this technology, the mRNA for the two cell types are isolated, and converted into complementary strands of DNA called cDNA by reverse transcription. > The process incorporates a fluorescent dye in the cDNAs so that they may be identified later by illumination. The RNA is destroyed, leaving just the cDNA. DNA microarrays, are used to compare the two cDNA samples. > DNA microarrays can take 20,000 or more different DNA sequences attached in microscopic spots to a glass slide. The different DNA sequences are oligonucleotides of about 20 bases in length. The oligo nucleotides represent tiny but unique regions of genes in the genome. > The cDNAs are added to the microarrays. cDNAs that are complementary to oligonucleotides on the microarray will bind or hybridise with the oligonucleotides and thereby stick to that location on the slide. Unbound cDNAs are washed away. > The arrays are then analysed using a high resolution laser scanner, and the relative extent of transcription of each gene is indicated by the intensity of fluorescence at the appropriate spot on the array.
Illumina sequencing youtube video transcript: much better. Read this https://www.youtube.com/watch?v=womKfikWlxM. Just data analysis
Data analysis: This entire process generates billions of reads, representing all the fragments. Sequences from pooled sample libraries are separated, based on the unique indices introduced during the sample preparation. > For each sample, reads with similar stretches of base calls are locally clustered. > Forward and reverse reads are paired, creating contiguous sequences. > These contiguous sequences are aligned back to the reference genome for variant identification. > The paired end information is used to resolve ambiguous alignments.
oxford Nanopore script: https://www.youtube.com/watch?v=1_mER5qmaVk
Nanopore sequencing enables users to produce ultra long DNA or RNA reads in real time. > The technology works by passing a current across electrically resistant membranes, into which protein nano pores are embedded. > as DNA or RNA moves through the nanopores, its component ntds cause characteristic disruption in the current, which can be analysed to determine the sequence of bases. > The MinIon is a palm sized sequencing device containing a consumable flow cell to which the sample is added. > the flow cell contains a sensor that detects the characteristic characteristic signal of a particular nucleotide as the molecule is analysed. > Each MinION flow cell has 512 nanopore channels available to be sequencing simultaneously. User can achieve as much as 30 Giga bases of sequence data per flow cell, making it ideal for any application from microbiology metgenomics, RNA and cDNA analysis, up to whole genome sequencing. > MinION is easy to use and can be connected to a laptop, making it suitable for use in the lab, or out in the field. Library preparation is a straightforward process which applies across all devices and can be performed manually with Oxford Nanopore kits or with an automated library preparation device called VolTRAX. > MinKNOW- the MinION control software can also be run off a laptop, allowing you to gain real time insight into the progress of your samples, and your sequencing run. > MinKNOW performs local base calling to produce fastq. files that can be then be taken on for analysis, either through the EPI2ME (epitomee) workflows provided by Oxford Nanopore technologies or individuals' analytical pipelines. > The long reads produced using a MinION provide a more unambiguous approach to mapping a DNA or RNA sequence, enabling much simpler assembly. > With long reads you can build a clearer picture when it comes to more complex regions, such as repeats and structural variation, while sequencing directly enables the detection of base modification and methylation, without additional preparation steps.
Sequencing Approaches: Genome assembly hybrid approach
Now it is imp to talk a little bit about sequencing approaches. > So, in genome assembly, hybrid approaches becoming more popular, because long read sequencing is still not as available to most researchers as short read sequencing. it's still not as common as just using short reads to do genome assembly, but the benefits of it are becoming very apparent. > Genomes have repetitive regions, and long chromosomes. So, being able to generate a scaffolding based on long read sequencing, even if of poor sequence quality, and then using short read sequencing upon that scaffolding to fix all those errors, is proving to be a very good way of doing de novo alignment. > So that is becoming more popular. And that's one approach for sequencing models to use two different sequencing technologies. So, Genome assembly hybrid approach - Use long reads to build scaffold, of whole genome with long contigs - Then it use short reads to correct the errors.
Next generation sequencing (NGS): Illumina sequencing by synthesis: NGS basic sequencing pipeline: > What are the other names of NGS sequencing ? Why you are doing - NGS sequencing? Is it a long or short read sequencing? -What is the name of the tech, what method it uses for NGS sequencing? -What are the 5 steps in NGS sequencing>
Now, we're going to move on to next generation sequencing. This is also sometimes called massively parallel sequencing, or high throughput sequencing. And we're going to focus on one technology: Illumina, which use a method known as sequencing by synthesis. > next generation sequencing is sometimes called the second generation sequencing, because of the fact that we now have a third generation sequencing. So we've gone back and now we call the next generation second generation. So in next generation sequencing basic pipeline includes: > you start with your DNA sample, which is extracted and quality controlled, then you produce what's called a library: > So the library prep stage: *After the library has been prepped, it's ready for sequencing. The sequencing output is a fastq file format, which we'll talk about later. And then that has to go through quality control and analysis. > The library prep step is something I'll talk about a little bit now.It includes a number of steps that you may not normally know that, * first of all, the DNA has to be fragmented, or separated into standardised sizes, to make it easy to put the DNA on the flow cell and to ligate the adapters to it. > In order for the adapters to be ligated after fragmentation, the DNA has to have the ends polished and repaired, So that the chemistry at the ends of the DNA is the right chemistry for the ligase to bind the adapters. > then we have the adapter ligation. > then often the size selection is done again to control the sizes of DNA fragments that can float on to the chips, the lanes, the flow cells. And then sometimes it's an amplification step if you've got started with small quantities of sample.
copy number variation part importance. script https://www.youtube.com/watch?v=xu7EZtyq3A8
Our genes control all aspects of how we develop, including organs such as our brain. > One important genetic tests when investigating the child who is not developing right is a chromosome microarray test. > This test examines the chromosomes to find out if there are pieces missing, or gained. Any piece of a chromosome that is missing or gained is called a copy number variant known as a CNV. > We all have CNV, most of which are harmless. They are part of the natural process of evolution.However, if a CNV removes or adds an important gene, it can cause problems. > Genetic test results are not always black and white. What are the rules of thumb when making sense of the test? > Our interpretation of the results depends on which part of the chromosome is involved, and the size of the chunk that's deleted or doubled up. -If a CNV is small and doesn't contain an important gene. It may be harmless. - When a CNV is inherited from a healthy parent, then it is likely to be harmless. -But when a CNV has started new in a child, then it may be harmful. -CNVs, where a large chunk of genetic material is lost, tend to be more harmful than CNVs where there is a gain in genetic material. > Each case has to be assessed individually.So each genetic test results showing a CNV requires careful consideration. When we're trying to determine the cause of a child's problem. Sometimes the CNV explains the diagnosis, sometimes it doesn't.
Flow cells scripts https://www.youtube.com/watch?v=pfZp5Vgsbw0
Patterned flow cells are an innovative new sequencing technology that dramatically increases DNA output and throughput. Let's take a closer look at the technology. > In flow cell: Billions of nano wells arranged in a defined array, allow precise control of cluster size and spacing, enabling accurate resolution of high density flow cells. > The benefits are : higher data output, more reads and faster runtimes. Now you can process more samples in less time. > The flow cell is created by patterning billions of nanoscale structures into the glass substrate. > After flow cell assembly, the DNA primers are deposited exclusively into the Nano wells. > During cluster generation, a new exclusion amplification method ensures that only a single DNA template is able to bind and form a cluster within a single well . > As the DNA template binds to the seeding primer, it immediately and rapidly amplifies. This rapid amplification prevents other templates from binding and forming a polyclonal cluster, thus ensuring that a monoclonal cluster is formed in each nano well. This results in a high percentage of wells that are occupied by clusters, originating from a single template. > Once the exclusion amplification cluster generation process is complete, the flow cell is ready for sequencing
Compare long and short read technology
So when we compare long and short read technology, some of the key points that we discussed are read length, cost and error rate. 1. So, the read length of short read sequencing varies between about 100 to 250 base pairs, a little more, if you're only looking at single end reads. >Long read sequencing can get easily over five kilobases and a lot more with nanopore. 2. the cost per Giga base is currently very comparable, having cost of both technologies has gone down quite rapidly. 3. Processing time is also fairly comparable. Although single molecule real time sequencing technologies (SMRT, PacBio) can generate data a lot quicker .It is just less processing for the data, although the actual sequencing methodologies are fairly similar. 4. The error rates are one of the areas of big difference. > where short read sequencing like Illumina has an error rate of less than 0.1%. And long read sequencing can have an error rate between 1 and up to 15%, which can be a significant hurdle to some applications. 5. Also, because short read sequencing has been around for so long, the software environment and the tools are very well developed. There's a lot of different tools to choose from. And they have gone through lots of iterations of development over the decades. >The long read sequencing technology software tools are very underdeveloped, and there's a lot more to do there. But there's a lot of very active groups in the area and they are progressing very, very fast.
Sequencing strategy: Farjana, say you got to do DNA seq. You need to think abt money, time and tech availability. How do you decide what sequencing you need to do? What is the seq strategy? Waht factors will affect your seq strategy?
Some factors will affect my seq strategy: Whole genome seq: If I don't have a Reference genome, I need to find a cost effective way to seq the entire genome. For humans and model orgs it already exist. So, In that case, short read seq works very well. If ref genome does not exist, do whole genome seq if it is a virus . bc viral genome is smll. It is ok then, But whole genome sew will not be practical in case of plants and animals bc they have large genomes Enrichment by exome sequencing: Just seq only exomes that are the expressed regions of the dna. and they make yp only 1.5% of the whole genome. so, cost effective dna amplicon seq: when you just need to know a speicific dna seq, use PCR amplification to amplify a gene of interest. exL in metagenomics, just amplify 16srRNA to identify certain bacterial species from a sample. Also, use this in diagnostic Transcriptomics: It is getting dna seq by proxy. bc extracting RNA that you desire and use reverse transcriptase enzyme to syntheize cDNA. Then determine the dna seq. Sun tech: single cell rna sew and mirna seq. ----- So, some of the aspects are gonna affect your sequencing strategy. 1. whole genome: If you're doing a whole genome sequence: a) genomes vary dramatically in size. > So , short read sequencing would be fine for a viral genome. > But it wouldn't be fine for most of the plant genomes, which are massive in size, the human genome is about 3 billion base pairs. It's not the largest genome by a long shot, but it's pretty big. b) If your reference genome does not exist, you're gonna have to think about a sequencing approach that gives you a very cost effective way of generating a new genome. > for most human studies and other model organisms, we have very good reference genomes around now. So we don't need to worry about that. >Short read sequencing works pretty well, for a lot of approaches, when you have a reference genome. 2.You could also think about enrichment to cut down the amount of sequencing that you do. > so Exome sequencing can do that : it means only sequencing the expressed regions of the genome. 3. Other approaches involve DNA amplicon sequencing. > so using PCR to amplify only genes of interest. > And this is one approach for metagenomics, where you amplify the 16s ribosome RNA to identify different bacterial species within a mixed population. > amplicon sequencing is also often used for diagnostic testing in a clinical environment. 4. transcriptomics: is looking at the gene expression. So it's measuring RNA, but most technologies can't measure RNA or sequence RNA directly. So you'll have to convert it into DNA. And that's a proxy. you're actually sequencing a proxy, which is telling you what the RNA level was. > And micro RNAs, single cell RNA is also variants of transcriptomics, which require special analysis pipelines, and special choices for what sort of RNA sequencing protocol you'll use.
SNP genotyping technologies. Why SNP genotyping is important? What does it mean by SNP genotyping? script https://www.youtube.com/watch?v=plWYBLy9OaM
why SNP genotyping is important? SNP is the abbreviation of single nucleotide polymorphism. It represents variation in a single nucleotide that occurs at a specific position in the genome. > SNPs are the major contributors to the DNA variations among individuals at a frequency of around one in 1000 bp. >SNPs have been suggested to be responsible for phenotypic differences, and to affect the development and progression of diseases, as well as to determine the response of drug treatment, and environmental stress. > SNPs can serve as ideal molecular markers for identifying genes associated with important biological characters and diseases. > Therefore, SNP profiling is considered of great importance in selective breeding ,agricultural production and productivity, personalised medicine and drug treatment. What does it mean by SNP genotyping? > SNP genotyping refers to the determination of SNP loci on a whole genome scale or within genomic regions of interest. > The major applications of SNP genotyping are in disease treatment and pharmacogenomic studies. > SNP genotyping can be divided into two categories: whole genome association WGA and fine mapping. What are the common platforms for SNP genotyping? > The common platforms for SNP genotyping include SNP microarrays, TaqMan SNP genotyping, massarray SNP genotyping and NGS. With the rapid development of SNP genotyping Technologies, a large number of public databases have been established for collecting identified SNPs, such as BD SNP and ensemble. > The commercially available arrays for SNP genotyping primarily included Affymetrix and Illumina array platforms. Affymetrix and Illumina genotyping solutions provide trusted performance with respect to precision medicine, agricultural genotyping, forensics, population scale studies, and genome wide association studies, commercially available high density SNP chips cover most of the human genome, plants and animals of economic significance. Alternatively, semi custom and custom chips allow researchers to determine desired markers. > The principles of both platforms are outlined in this figure. In the Affymetrix assays, there are probes for both alleles and the DNA binds to both probes, regardless of the allele it carries. The impeded binding manifests itself in a dimmer signal. > In Illumina beadchip assays, a bead contains a sequence, complementary to the sequence adjacent to the SNP loci. It binds to the DNA and exhibits different colours.