Lecture 4/18: Genomics and personalized medicine
Data analysis and work-flow for analyzing the RNA sequence results: An RNA seq workflow for ________ of differentially expressed genes:
identification
New sequencing = ____ the speed ( whole genome in days) + reduced cost
increased
Variants: SVs ( structural variants: large ______ and deletions)
insertions
Variants: Indels (small _______ or deletions) slightly _____ mutations than SNVs
insertions, bigger
All systems require a _____, that may involve ligation and custom adaptors
library
6. Amplifying is important: works best if you have ___ of copies of DNA, can filter out errors, get error free results
lots
16. Massive _____ sequencing: sequencing all groups of DNA that have been _____ multiple times:
parallel, amplified amplification to get accurate results and filter out errors that could occur during sequencing
They found that resistant tumors display
transcriptional signature where certain genes are upregulated Certain genes are differentially regulated
Statistical differences: usually every sequencing study is one at least in
triplicate check picture use of statistics programs, such as R, bioconductor
Some types of NGS sequencing: WGS: sequences the _____ genome
Whole Genome sequencing entire
3. To each tube individually add either ____, _____, _____, or _____ each fluorescently labeled
ddTTP, ddATP, ddGTP, or ddCTP
Gene sets can be
-Pathways (based on databases such as Reactome, KEGG, Wikipathways, Ingenuity Pathway Analysis (IPA)) -Genomic location -Transcription factor targets software is available for many of these tests
7. _ strand is washed off
1
Human Genome: 22 are autosomes, _ pair are sex chromosomes
1
Nucleotide in DNA: A:T, C:G Humans differ from each other by _% of the sequence
1
Technology behind the human genome project: The fragments were broken up further, into fragments about _______ long
1000 bp long
21. Reads obtained from the sequencing machine: These DNA sequences are not arranged in any particular order List of ___ sequences, not in _____ don't know what genes they refer to Possible to be done _____ reference _____- look for overlapping sequencing
150 nucleotide , order without, genome
Era of human genome: don't memorize dates
1866 Mendel's discovery of genes 1871 Nucleic acids discovered 1951 First protein sequence 1953 Structure of DNA 1960s Elucidation of genetic code 1977 Advent of DNA sequencing 1975-1979 First human gene is isolated 1986 DNA sequencing is automated 1992 EST sequencing (Expressed sequence tag) 1995 First whole genome sequenced (Haemophilus Influenzae) 1999 First human chromosome sequenced (22) 2000 Draft of human genome completed 2001 Publications of draft sequences of the human genome 2003 Completion of human genome sequence 2005 Elucidation of genes on X chromosome
RNA sequencing: compare gene expressions between _ conditions (normal cell vs. tumor cell; tumor cell v tumor cell being treated)
2
Basic structure of DNA: _ sugar phosphate backbones, _____ helix, base pairing
2, double
Basic structure of DNA: _ ______ base pairs in human genome
3 billion
The human genome project: project goals: determine the sequences of over ____________ that make up the human DNA
3 billion base pairs
Steps of Sanger sequencing method 1. take DNA sequence and divide into _ tubes
4
4. In total , >_____ genes were found to be differentially expressed between 2 cell lines
5000 included ER and PR genes and associated genes
Summary of some studies that have recently been carried out in cancer genomics, using both DNA and RNA sequencing: Anti PD1 therapy in metastatic melanoma
Anti PD-1 antibody provides clinical benefits for many melanoma patients, but other patients are resistant. —- not clear why
Tumors from responding patients are enriched in mutations in the DNA repair gene _______
BRCA2
13-15. ______ amplification
Bridge
Chip:
Chromatin Immunoprecipitation
4. Test for differential expression
Cuffdiff
5. Statistical analysis: R: the R project for statistical computing R packages
CummeRbund, visualization & analysis How to decide which is significant / which is not (too much info usually given) Statistical tests: -Interpretation of the results from an RNA seq experiment is complicated. -How do you decide if differences in expression are significant -How do you decide which genes are relevant to your system
10. Another round of ____ synthesis
DNA
Bowtie is fine for ___ sequencing
DNA
Fast qc to analyze the quality of
DNA or RNA sequencing results
6. DNA amplification occurs, in the presence of ____ _________
DNA polymerase
1. Assess data quantity and quality
FastQC
Summary of analysis methods we have discussed so far A workflow from reads to differentially expressed genes: 1. Assess data quantity and quality
FastQC
The sequence you get from the machine is in _____ format
Fastq
18. _____________ labeled nucleotides incorporated
Fluorescently
Gene sets:
Gene set enrichment analysis - of pathways Reactome - looks at different pathways, finds genes that fit into certain pathways Takes big set of genes and find ~10 to fit into a certain pathway
Gene ontology
Genes are categorized by function, or by association with a specific term Ex: all transcription factors, or all transcription factor targets, all protein kinases
3. ____ flow cell
Glass
differential expression analysis of RNA sequencing can tell you:
How many copies of RNA are in one cell type compared with another
This signature is referred to as
IPRES (innate anti-PD-1 resistance
Several platforms for NGS
Illumina MySeq, Illumina HiSeq, 454 Sequencer, SOLiD system, Ion Proton system
Technology behind the human genome project: Each fragment was inserted into a ________ ________ ________
bacterial artificial chromosome
Expression analysis: count the number of fragments overlapping with all ______ _____ of a gene
annotated exons
RNA-sequencing 1. Isolate ____ from cells 2. ______ transcribe to cDNA 3. _______ the DNA 4. Add ________ 5. Carry out sequencing 6. Illumina sequencing method 7. Get sequencing output
Look at quanitity mRNA reverse Fragment adapters
Chromosomes in condensed state during _ phase
M
2. Map reads to a reference genome ---> Tophat or Bowtie
Mapping reads to a reference genome This issue of introns and exons when sequencing RNA rather than DNA Introns will be sliced off of RNA , genomic DNA has introns + exons
4. the ddNTPS have no __ group and thus are considered _______ nucleotides (no further nucleotides will be incorporated once they are added)
OH, termination
MAPK targeted therapy has a similar signature
Possible future directions: Determine whether attenuating the biological processes that underlie IPRES, would improve the PD-1 response.
5. Statistical analysis-
R project- analyze differences, focus on only important differences
NGS can sequence both DNA and ____
RNA
RNA sequencing: comparing conditions with ____ sequencing
RNA
Types of NGS sequencing: RNA-Seq: sequencing of
RNA
Which sequencing is the best way to assess differential expression of genes?
RNA sequencing
Check picture
green area: good quality; indicates shorter sequence has better quality (160-169)
Human genome project used _____ sequencing method
Sanger time consuming and cost $1 per base, totaled 3.8 B
Human genome project: Started in 1990, the U.S. Human Genome Project resulted in competition and a race to finish the sequencing of
Started in 1990, the U.S. Human Genome Project is a multi-center effort coordinated by the U.S. Department of Energy and the National Institutes of Health A parallel project was carried out by Celera Corporation, a private company This resulted in competition and a race to finish the sequencing of the entire genome The project was originally planned to last 15 years, but rapid technological advances accelerated its completion in 2003, ahead of schedule.
T:F/Sequencing facilities are found all over the world
T
Variants: Chromosomal rearrangements
bigger mutations
2. Map reads to reference genome
Tophat (RNA) or Bowtie (DNA)
Software used in RNA sequence analysis
Tuxedo software suite
Fastq format: sequence _ quality
a software that assess sequence quality
8. Hybridization to the other primer (________), _ shaped form
adapter, U
2. Short sequences (______) are attached to the ___ of all small sequences
adapters, ends already know the sequence of adapters
Technology behind the human genome project: The bacteria are grown and _________
amplified
1. Fastq ( _____ sequence quality)
analysis
Fastqc
analysis of sequence quality get sequences back, make sure quality is good
17. Sequencing primer is ______, massively parallel sequence
annealed
Sanger Sequencing Method (-___ termination method)
chain, less efficient, costly, only used for smaller sequencing
Next generation sequencing is faster and _____
cheaper
Strict p value for cutoff, fold change cutoff:
check picture In this volcano plot every dot represents one gene on the list. The log fold change is plotted on the horizontal axis ; On the vertical axis is the p value after testing. Here a p value cutoff of 5% (red line) is assigned. A problem with this random cut-off assignment is that important genes might be missed.
groups are called
clusters
Types of NGS sequencing: WES: Sequences only the _____ regions of the genome
coding
R: The R project for statistical _______ R packages such as
computing CummeRbund, visualization and analysis
Sequences are amplified on a solid surface with _______ attached linkers that hybridize the _______ adapters, producing clusters of DNA
covalently, library
3. Transcript reconstruction and count the number of reads per gene
cufflinks
Technology behind the human genome project: In the human genome project the DNA was ___ into overlapping fragments of _____ bp long
cut, 150K
Steps of RNA seq workflow: 1. Assess ____ quantity and quality of reads
data
The human genome project: goals: develop tools for _____
data analysis
Sanger sequence relies on ______ nucleotides (termination nucleotides) get banding pattern, can determine _______ DNA sequence
dideoxy, original
breast cancer has _________ subcategories
different, ones + or -, know differences check picture
Cuffdiff
differential expression analysis
4. Test for differential expression --> Cuffdiff
differential expression analysis Gene ID, gene name, location Differential expression results for the 2 cell lines
How is RNA sequencing useful: Using RNA sequencing one can analyze _______ _____ _______ when comparing 2 different conditions
differential gene expressions
Gene set enrichment analysis
do genes fall into specific categories or sets
The human genome project: goals: address _____, ______, and _____ issues
ethical, legal, and social
4. Test for differential _____
expression
2. Bowtie
fast short-read alignment
20. Everything on _____ cell gets sequences Shows _____ color, pattern of fluorescently colored nucleotides Massive parallel sequencing
flow flurorescent
The human genome project: project goals: identify all ____ in the human DNA
genes
Upregulation genes:
mesenchymal transition, cell adhesion, ECM remodeling, angiogenesis, wound healing
Methyl-Seq: sequencing of ______ DNA
methylated
3. Once the alignment is done you will still need to assign gene _____
names
New sequencing methods are called ____ ________ _________ (NGS)
next generation sequencing
2. In each tube: DNA, a primer , all of the ________ (dTTP, dATP, dGTP, dCTP), DNA polymerase
nucleotides
Expression analysis: add up _ of fragments
number often expressed as FPKM (Fragments per kilobase per million fragments mapped)
This gel can separate sequences that differ by __ __ nucleotide
only one
9. Hybridization to ______ primer
other
5. DNA binds at the matching complementary ______
primer
4. ________ on the flow cell are _______ to the adapters added to the DNA fragments
primers complementary on the glass flow cell
Cufflinks: transcription reconstruction and _____: assign gene names and quantitate transcripts per gene
quanititation Use of annotation file 2 different conditions, compare amount of gene
3. Transcript _______ and count the number of reads per gene
reconstruction
2. map reads to a _____ genome
reference
TopHat eliminates the problem of spliced out sites in
reference genome
Technology behind the human genome project: the sequences were sequenced by the ______ method
sanger
Technology behind the human genome project: The cloned fragments were then _______ in labs around the world
sequenced
The all involve ______ machines that produce raw data at the end of the sequencing run
sequencing
NGS allows __________ of thousands to millions of ___ molecules simultaneously Ex: compare tumor cells to normal cells
sequencing, DNA
Clustering: group of genes or samples that contain
similar sequences or have similar expression profiles Groups are called clusters Many different clustering algorithms exist There are pros and cons to this method. There can be problems with assignment of the clusters, which is sometimes quite arbitrary
5. Run each tube on a gel which separates DNA by _____
size
Steps involved in NGS sequencing: 21 steps check picture 1. Break DNA into ____ pieces ( 50-150nts) - sequence ___ pieces unlike sanger method
small
What to look for in DNA sequencing: variants SNVs
small nucleotide variants small mutations
Tophat vs Bowtie: RNA sequencing- we want to use splice aware aligner Top hat is a _____ aware aligner best for
splice ,RNA seq
2. or TopHat
spliced short-read alignment
The Human genome project: project goals: ____ information in databases
store
Further analysis of RNA sequencing results: Gene by gene analysis
to determine which differentially expressed genes are most relevant This requires extensive literature research This can be very time consuming
3. Cufflinks
transcript reconstruction from alignments
3. Transcript reconstruction and count the number of reads per gene --> Cufflinks
transcript reconstruction from alignments and quantitation
Types of NGS sequencing: Chip-Seq: sequencing that identifies ______ _____ binding
transcription factor