Genome annotation
differences in annotation
- ilusltrates differences in annotation btw two strings - bc there is no annotation using pipeline (software tool that consist of individual components) -they would annotates genes differently with different names -huge problem -leads to a lot confusion
Bacterial genome annotation?
-Bacterial genomes vary in size from 1Mb to 8Mb -Not practical to annotate them by hand -Use an automatic pipeline to transfer information from closely related reference genome -Then, (try to) manually curate the results -Manual curation catches and removes erros
What is genome annotation?
-after a genome is sequenced, it is just a string of bases -we need to assign meaning to the genome -we need to annotate it -structural annotation: identify genes and intro-exon structures -functional annotation: attach data the says what the gene does and what function it's involved in.
steps of virus genome annotation
-break genome into 1000bp fragments -BLATX fragments against GenBank -Find out the proteins that these regions encode -Find the hypothetical proteins by using gene finding program -hypothetical protein: a protein that we don't know the function of. (reason why we want to annotated)
virus genome annotation:
-virus genomes are not large -therefore, we can annotate them manually -for an adenovirus genome of 35kb, the process can take 2-4 weeks -consist of several steps
software tools: Ab initio and evidence-drivable gene predicters
Genemark - a self training gene finder GenomeScan - extension of the popular Genscan algorithm
Eukaryotic genome annotation (how to predict)
Ab-initio gene prediction • Gene predictors became available in the 1990s and revolutionized genome analyses • Need no external evidence to identify a gene or to determine intron-exon structure • Most do not report untranslated regions or alternatively spliced transcripts
software tools: EST, protein and RNA-seq aligners and assemblers
BLAST - Basic local Aligment search tool BLAT - faster than BLAST but has fewer features
Eukaryotic genome annotation
Eukaryotic genome annotation • Ultimate goal is to obtain a synthesis of alignment based evidence with ab-initio prediction to obtain a final gene annotation set • Human curation too time consuming and too expensive • Run different gene finders on the genome and choose the best prediction
What are the steps bacterial Genome annotation
Fasta sequence (format is a text-based format for representing either nucleotide sequences or peptide sequences; uses single-letter codes) predict genes ( process of identifying genes) compare reference: Genome/Uniprot (for each gene you compare to your closed related reference geno, using BLAST) is there a homologue? (determines the closer related protein, if its present or not) Take reference annotation (if homologue is present you reference annotation is used) / label as hypothetical protein predict domains (used when homologue is not present, Run then through Pfam and Prosite. To assign some function to these proteins; and domains will give some clue as to the function) add annotation (assign meaning to the genome) predict other features (e.g. tRNA)
software tools: choosers and combiners
JIGSAW GLEAN
software tools: Genome annotation pipelines and why do we want to use a genome browser?
PASA NCBI -they can do a lot things -to find a list of genes in given genomic region.
Artemis:
is a genome browsing tool and also annotation tool -GenBank file Fasta sequence is: pure sequence, title and > (greater than symble)
Eukaryotic genome annotation
• Use annotation pipelines since these genomes are large • Mask genome repeats - replace repeats with Ns. Failure to mask repeats will lead to millions of spurious BLAST hits, which will provide false evidence for genome annotation • Next, use evidence based annotation • Evidence based annotation aligns ESTs (give evidence of gene expression) and proteins to the genome using BLAST