Module 2
Objects are identified in database through accession numbers that are unique identifiers. Gene names are not good identifiers.
Names have to be precise or the computer will not be able to find them, gene names aren't good because theres so many different names
Accession number
Number generated by laboratory information system (LIS) when specimen request is entered into the computer unique permanent accurate (don't want to have mistakes) designed to be searched through accession numbers or locus ID don't always have accession numbers, so you have to search with gene names or key words ask: do they give you accession number?
searching the best curated database
Refseq- reference sequence swissprot- protein
WP_
When the NCBI genome annotation pipeline annotates a bacterial protein that is 100% identical and the same length as an existing non-redundant protein, NCBI will annotate that protein on the genome by referencing the WP_ accession in the annotated CDS feature get rid of duplicates
Entrez
a biological database retrieval system allows text based searches for a wide variety of data, including annotated genetic sequence information, structural information, as well as citations and abstracts, full papers and taxonomic data ability to integrate information, which comes from cross-referencing between NCBI databases based on preexisting and logical relationships between individual entries users do not have to visit multiple databases located in disparate places
Field tags
can be used to improve the efficiency of obtaining the search results identifiers for each field and are placed in brackets [AU] limits the search for further name [JID] journal name
Bioinformatics paradigm
collect samples, run analysis, results of analysis whatever data you get from analysis goes to bioinformatic pipeline sometimes you run analysis for other people collect from database and then find the data, download it, re-format it run bioinformatic analysis, collect result (different format, different accession), gives nice publish figure for your paper accession and re-formating all the time
preview/index
connects different searches with the boolean operators and uses a string of logically connected keywords to perform a new search
MeSH
consists of collection of more than 20,000 controlled and standardized vocabulary terms used for indexing articles it is a thesaurus that helps convert search keywords into standardized terms to describe a concept allows "smart" searches in which a group of accepted synonyms are employed so that the user not only gets exact matches, but also related matches on the same topic that otherwise might have been missed
NCBI taxonomy
contains the names and taxonomic positions of over 100,000 organisms with at least one nucleotide or protein sequence represented in the Genbank database hierarchical classification scheme root level: archaea, eubacteria, and eukaryota allows the taxonomic tree for a particle organism to be displayed based on molecular phylogenetic data, the small ribosomal RNA data
gene ontology (GO)
controlled vocab that will tell you: Molecular function (specific tasks) Biological process (broad biologial goals - e.g cell division) Cellular component (location) hierarchy way tRNA modification gene - tRNA methylation gene- tRNA methylation of position 34 gene poorly developed for microbial metabolism eukaryote people use them more
Swissprot (Uni-Prot)
curated database contains high quality annotation, is non-redundant and cross referenced to many other databases in may 2009, the swiss-prot database was merged into UniProt database merged with European Molecular biology Laboratory EMBL to create Uni-prot in the swissprot format gets put in Uni-prot and human created in swissprot created by Amos Bairoch in 1986 Advantages A lot of manual curation and integration of literature Disadvantage it takes long time for proteins to get into swissprot Merge with Uniprot Great entry point databases as we will see through out the course, in this module today we are going to explore links to structures pathways and enzyme literature
Abstract syntax notation one (ASN.1)
data mark up language with a structure specifically designed for accessing relational databases describes sequences with each item of information in a sequence record separated by tags so that each sub portion of the sequence record can be easily added to relational tables difficult for people to read, this format makes it easy for computers to filter and parse the data facilitates the transmission and integration of data in-between data bases
problem in naming uniformity
enzyme names: isoleucyl-tRNA synthetase= isoleucyl-tRNA ligase gene names: Kae1=Gcp=TsaD=YgjD Names have to be precise or the computer will not be able to find them, gene names aren't good because theres so many different names chemical names 4-hydroxythreonin= (2S,3S)-2-amino-3,4-dihdroxybutanoic acid
Pubmed warm up
first find date of E.coli K12 NCBI- genome- escherichia coli k12 [orgn] look at publications find 1997, so before 1997 use quotes for "gtp cyclohydrolase I" because youre looking for it together *dont use numerical 1, use I "GTP cyclohydrolase I" and escherichia coli because looking at it in e.coli "GTP cyclohydrolase I" and escherichia coli and gene because looking for the gene as molecular biologist, we only want genes that are linked to this paper
enzyme commission number (EC)
five number system every enzyme gets a number that defines it function can be used to search databases GTP cyclohydrolase I EC 3.5.4.16 3- hydrolase 5- hydrolyses on carbon nitrogen bounds that are not peptides 4- hydrolyses cyclic ?? 16- gives you the substrate (GTP) of the enzyme 1-oxidoreductase reductase Is numerical classification scheme for enzymes, based on the chemical reactions they catalyze. As a system of enzyme nomenclature, every EC number is associated with a recommended name for the respective enzyme. Strictly speaking, EC numbers do not specify enzymes, but enzyme-catalyzed reactions. If different enzymes (for instance from different organisms) catalyze the same reaction, then they receive the same EC number
Readseq
frequent need to convert between sequence formats computer programs for sequence format conversion Don gilbert at Indiana University recognizes sequences in almost any format and writes a new file in an alternative format
Refseq
genbank is a dump of everything curated database subdivision of NCBI gene bank (by NCBI staff and collaborators, with reviewed records indicated) non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current knowledge of sequence data and biology data validation and format consistency distinct accession series (all accession include an underscore _ character -The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins.
database search strategies
general search principles- not limited to sequence (or to biology) use accession numbers whenever possible (use it, saves time) start with broad keywords and narrow the search using more specific terms try variants of spelling, numbers, etc search all relevant databases be persistent
In genbank field
give you accession number right after locus, definition
Give the position(s) of the Histidine(s) that is/are phosphorylated in the histidine kinase(s) you identified.
go to Amino acid modifications look at "Phosphohistidine" example Evgs: 721
Go back to the Pubmed entry found above and in the associated proteins find the link to the structure from a protein coming from a pathogenic bacteria with Manganese in the structure. Find the PDB accession Number and post it. Follow the links until you are on the PDB site for this structure. Do the same exercise by navigating the corresponding entry through Uniprot/Swissprot.
go to the pub med article press structure brings you to 3D2O: Crystal Structure of Manganese-metallated GTP Cyclohydrolase Type IB 3D2O is the accession number
limits
help to restrict the search to a subset of a particular database set to restrict a search to a particular database (author, date) or particular type of data (DNA/RNA) particular search field (gene name or accession number)
data formats
how to organize various types of genetic data need standard formats DNA sequence? what do you say at the beginning, or where it ends? or you don't know the nucleotide Proteins? are you gonna use 1 letter abv or 3 letter abv
features
includes annotation information about the gene and gene product, as well as regions of biological significance reported in the sequence, with identifiers and qualifiers
journal (header)
includes the citation information as well as the date of sequence submission often hyperlinked to the Epubmed record for access to the original literature information last part of the header is the contact information of the sequence submitted
Organism (header)
includes the source of the organism with the scientific name of the species and sometimes the tissue type information of taxonomic classification of the organism different levels of classification are hyperlinked to the NCBI taxonomy database with more detailed descriptions
gene field (features)
information about the nucleotide coding sequence and its name DNA entries: CDS field, information about the boundaries of the sequence that can be translated into amino acids eukaryotic DNA: field contains information of the locations of axons and translated protein sequences is entered
goal of database is two fold
information retrieval and knowledge discovery
KEGG
kyoto Encyclopedia of Genes and Genomes all linked to swiss-prot has gone commercial Uni-prot has started to hid it
Use refseq
making sure you have no duplicate sequences in your NCBI search
Genbank
most complete collection of annotated nucleic acid sequence data for almost every organim genetic DNA, mRNA, cDNA, ESts, high throughput raw sequence data, and sequence polymorphism use text-based keywords similar to a Pubmed search using molecular sequences to search by sequence similarity using BLAST
reformatting data files
much of the routine work of bioinformatics involved messing around with data files to get them into formats that will work with various software messing around with the results produced by the software to create a useful summary https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/ http://biomodel.uah.es/en/lab/cybertory/analysis/massager.htm
Header
origin of sequence, identification of the organism and unique identifiers associated with the record top line: locus, which contains a unique database identifier to a sequence location in the database (not a chromosome locus) followed by sequence length and molecule type (DNA or RNA) followed by a three-letter code for Genbank divisions 17 divisions in total, which were set up simply based on convenience of data storage without necessarily having rigorous scientific basis PLN: plant, fungal, algal sequences PRI: primate sequences MAM: non-primate mammalian sequences BCT: bacterial sequences EST: est sequences
Databases use different formats such as GenBank or Fasta. Depending on the algorithm you are using different formats might be required.
other types of important medical and genetic data may not have universal standards: -genotype/haplotype -clinical records -gene expression -protein structure -alignments -phylogenetic trees
BRENDA
out of Germany all information on any enzyme capture literature look at the BRENDA database every new time you look at the enzyme
fasta
plain sequence information that is readable by many bioinformatics analysis programs single definition line > followed by a sequence name extra info such as gi number or comments can be given, which are separated from the sequence name by a "I" symbol the drawback is that much annotation information is lost In the process of writing a similarity searching program (in 1985), William Pearson designed a simple text format for DNA and protein sequences The FASTA format is now universal for all databases and software that handles DNA and protein sequences
PDB
protein data bank imposed strict structure where protein structures get deposited from every swiss-prot entry, you can go see if there are any PDB structures
Genpept
protein sequences, the majority of which are conceptual translations from DNA sequences, although a small number of the amino acid sequences are derived using peptide sequencing techniques
history
provides a record of the previous searches so that the user can review, revise, or combine the results of earlier searches
source (features)
provides the length of the sequence, the scientific name of the organism, and the taxonomy identification number optional info: clone source, tissue type, cell line
Reference (header)
provides the publication citation related to the sequence entry includes the author and title information of the published work (or tentative title for unpublished work)
Definition (in header)
provides the summary information for the sequence recording including the name of the sequence, the name and taxonomy of the source organism if known, and whether the sequence is complete or partial followed by the accession number for the sequence, which is a unique number assigned to a piece of DNA when it is first submitted to GenBank and is permanently associated with that sequence number that should be cited in publications two different formats: two letters with five digits or one letter with six digits for a nucleotide sequence that has been translated into a protein sequence, a new accession number is given in the form of a string of alphanumeric characters also a version number and a gene index (gi) number purpose is to identify the current version of the sequence if the sequence annotation is revised at a later data, the accession number remains the same, but the version number is incremented as the gi number translated protein sequence also has a different gi number from the DNA sequence is is derived from
Genbank sequence format
relational database search output for sequence files is produced as flat files for easy reading flat files contain 3 sections: header, features, and sequence entry each field has an unique identifier for easy indexing by computer software can be limited to annotation "organism" "accession number" "authors" and "publication date" use a combination of "limits" and "preview/index" [gene] field for gene name [auth] author name [orgn] organism name
base count (origin)
report that includes the numbers of A, G, C, T in the sequence for both DNA or protein sequences, ends with two forward slash //
Sequence Retrieval System (SRS)
retrieval system maintained by the EBI (comparable to NCBI Entrez) not as integrated at Entrez allows the user to query multiple databases simultaneously, database integration offers direct access to certain sequence analysis applications such as sequence similarity searching and cluster sequence alignment queries can be launched using "quick text search" with only one query box in which to enter infromation Standard query form vs extended query form standard query: allows four criteria (fields) to be used which are linked by boolean operators extended query: allows many more diversified criteria and field to be used query sequence and sequence annotation, links to literature, metabolic pathways, and other biological databases
boolean operators
series of keywords using logical terms such as AND, OR, and NOT to indicate relationships between key works AND: the search results must contain both words OR: to search for results containing either word or both NOT: excludes results containing either one of the words (parentheses) to define a concept if multiple words and relationships are involved so that the computer knows which part of the search to execute first
clipboard
stores search results for later viewing for a limits time "send to clipboard"
In uniprotkb:
swiss prot database inside uniprotkb (curated) TrEMBL automatically annotated unreviewed database
genbank format
text file where the fields are listed, you have an entry very set format not used by most computer programs most require FASTA format example: >URO1 uro1.seq length: 200 then press return sequence
origin
third section of flat file sequence itself starting with the label "origin" format of the sequence display can be changed by choosing options at a display pull down menu at the upper left corner
refseq option
use when you want to find "unique" get rid of duplicates
complex search
user can use boolean operators or a combination of limits and preview/index features to conduct complex searches
Related articles
uses a word weight algorithm to identify related articles with similar words in the titles, abstracts, and MeSH articles on the same topic that were missed in the original search can be retrieved
multi sequence fasta file
when you have multiple pasta sequences >URO1 uro1.seq length: 200 (sdsdfdfsgksdfgjdflg) >URO2 uro2.seq length: 80 (dfklasdlfdj) text can vary
PubMed
Entrez biomedical literature database contains abstracts and in some cases the full text articles from nearly 4,000 journals retrieval of information based on medical subject headings (MeSH) terms
Online mendelian inheritance in man (OMIM)
Entrez non-sequence based database of human disease genes and human genetic disorders contains summary information about a particular disease as well as genes related to the disease contains numerous hyperlinks to literature citations, primary sequence records, as well as chromosome loci of the disease genes excellent starting point to study genes related to a disease