Bioinformatics Midterm
Entrez
"Coming in!" A text-based data retrieval system developed by the National Center for Biotechnology Information (NCBI) that provides integrated access to a wide range of data domains, including literature, nucleotide and protein sequences, complete genomes, three-dimensional structures, and more. Entez connects DB entries with neighboring and hard links
Which of the following is a valid variable in Perl? @10_numbers $number-10 %MyHashTable STDIN
%MyHashTable Because @ means array/list of elements, a name can't start with a number, $ is a scalar valuable, hyphens (-) are not allowed, and STDIN could be a file handle... Has to have %, @, or $ for a variable in Perl with the proper words after it
The human genome consists of how much repetitive elements of various kinds?
50%
If the hash table, %student_grades = ("Andrew", 90, "Jennifer", 80, "Patrick", 70), what is the value of $student_grades{"Jennifer"}?
80
Ontology
A set of terms, relationships, and definitions which capture the knowledge of a certain domain; terms are linked by relationships: Ex. GO (Gene Ontology), providing controlled vocab for describing gene products and consists of 3 unlinked hierarchies: molecular function, biological process, and cellular component) - *Many to Many relationship between genes and GO terms
Dotplot
A simple picture that gives an overview of the similarities between 2 sequences; doesn't provide a robust statistical measure for the alignment quality - Dotplot diagrams can be used to find regions of local alignment, repeats, insertions, or deletions; -For nucleotide sequences, +1 says a match and -1 says a mismatch -For protein sequences, PAM and BLOSUM are used as scoring schemes to measure the sequence similarity (BLOSUM62 is default scoring matrix because it's the most effective in finding all potential similarities)
Global vs Local Alignment Algorithms
-Global Alignment Algorithms compare 2 sequences along their entire length and are most applicable to highly similar sequences (based on the math technique called dynamic programming) -Local Alignment Sequences find the most similar regions in 2 sequences and are best for sequences that share some degree of similarity (uses the Smith-Waterman algorithm, but it's super slow so BLAST is a rapid, heuristic version of it)
Boolean Values
-if the value is a number, 0 means false and all other numbers mean true -if the value is a string the empty string " and '0' means false and all other strings mean true -the not operator is ! to get the opposite Ex. 5 #true ! 5 #false The boolean operators in PubMed and Entrez searches, for ex. are AND, OR, NOT
/n /t /" //
/n is a newline /t is a tab /" is a double quote (double quoted strings) // is backslash
Bioinformatics
The field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.
GO (gene ontology) results interpretation
The symbols + and - indicate over or underrepresentation of a term. P-value is the probability or chance of seeing at least x number of genes out of the total n genes in the list annotated to a particular GO term, given the proportion of genes in the whole genome that are annotated to that GO Term. The closer the p-value is to zero, the more significant the particular GO term associated with the group of genes is (i.e. the less likely the observed annotation of the particular GO term to a group of genes occurs by chance)
If you do a BLAST search using a protein query and the best match in the target data base has an E-value of 3.5e-15, what does the result mean for the homology of the 2 matched sequences?
The target sequence is a CANDIDATE for a homologous sequence (not 100% though), but the smaller the E-value, the better the probability of not getting a false positive
Database tier
This is how data storage is handled in the three-tier architecture of a database system
use strict Pragma in Perl
To impose some discipline in Perl programming, put the use strict pragma and now Perl will insist that you declare every new variable using the my operator: my ($m, $n);
Transcriptome Proteome Metabolome
Transcriptome = all RNA transcripts synthesized by an organism Proteome = entire set of proteins translated Metabolome = sum total of all the low-molecular-weight metabolites produced by an organism, cell, or tissue
Genetic Code
Unambiguous, 1 Start Codon (AUG) and 3 Stop codons (UAA, UAG, UGA), and Degenerate (multiple codons for most amino acids)
Which of the following is NOT one of the 3 genome browsers developed to facilitate human genome annotation? A) UCSC Genome Browser B) Ensembl C) NCBI Map Viewer D) UniProtKB
UniProtKB! It's the Universal Protein resource, a central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases. Data types. captured. Protein annotation.
Database Management System (DBMS)
Used to define, create, query, update, and administer databases. -Relational Data model = organizes data into relations aka tables (one to one, one to many, many to many, many to one) -Relation (Table) = contains rows aka tuples, and data in each row corresponds to a real world entity or relationship, and each column (header) is called an attribute and the value of an attribute is the key (primary key is the chosen/main key, and a foreign key is an attribute in 1 table that is the primary key of a different table)
Foreign key
An attribute in one table and the primary key of another table
Biological Databases (primary and secondary)
Archival (primary) databases contain experimental data, not modified by curators, can be highly redundant, Ex. GenBank, Protein Data Bank, etc. Derived (Secondary) databases contain info derived from primary and inferred from content analysis, often annotated by experts and curators, can contain computationally derived results, Ex. Pfam, PubMed, etc.
Secondary Databases
Ex. RefSeq or UniGene; an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products; non-redundant. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from viruses to bacteria to eukaryotes. For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts.
Hidden Markov Models (HMMs)
Class of probabilistic models (loaded dice example) that have been widely used in gene prediction programs, including GENSCAN, FGENESH, and AUGUSTUS - generally applicable to time series or linear sequences
Computer Networks and the internet
Computer networks = set of computers that are connected and able to exchange data Internet = vast collection of interconnected networks Unifying principle = use the same protocol (TCP/IP) for network interconnection *Can locate a computer on the internet via its IP address, Domain name, or URL
Perl
Created by Larry Wall in 1980s; optimized for problems that are 90% text and 10% everything else. You type "perl my_program.pl" to run the Perl program in a MS/DOS command prompt
CASP program
Cristical Assessment of Structure Prediction (CASP) uses blind tests to judge the methods of protein structure prediction (structural biologists release the amino acid sequence months before the publication date of structure so predictors can submit models to CASP)
Federated databases
Current data, flexible architecture, and no data consolidation; but slower queries and data at the sources. EX - Entrez and IBM DiscoveryLink
Central Dogma of Molecular Biology
DNA --> Transcription --> RNA --> Translation --> Protein (in all organisms except some viruses with RNA genomes and using reverse transcriptase to make DNA)
Sequence File Format - FASTA
Definition line starts with a greater than sign > followed by a unique sequence identifier and short description (>NM_139058.1 Homo sapiens aristaless related homeobox)
UniGene
Derived from mRNA information
Database and software development
Development and implementation of tools that enable efficient access and management of different types of biological information
Which of the following provides an estimate of the number of false positives from a BLAST search? Bit Score, E-Value, Percent Identities, or Percent Positives
E-Value!
Primary Databases
Ex. GenBank; repositories for nucleotide sequence data from all organisms. Primary database because it houses original sequence data. Normally has high redundancy which is an issue with them.
Prokaryotic genomes
Size is less DNA and fewer genes than eukaryotic genomes, its coding capacity is compact with continuous genes, its genes are organized into operons, and prokaryotes often contain plasmids
Standard WHERE clause
Specifies the condition for restricting the rows returned by the query
SQL
Structured Query Language - contains SELECT statements, FROM clauses, and WHERE clauses, among other things
Intrinsic (ab initio) gene-finding algorithms
Struggle in predicting protein-coding genes in eukaryotic genomic DNA because exon/intron borders are hard to predict; AUGUSTUS is one of the most accurate programs for ab initio gene prediction
The ENCODE Project
The Encyclopedia of DNA Elements (ENCODE) is to understand the function of human genome; almost 500 scientists from 32 research groups, over 75% of genome is transcribed - has 8.4 million regulatory sites (twice as much DNA as protein-coding genes) and 23,000 protein coding genes (about 1.5% of the genome)
GenBank
NCBI GenBank contains an annotated collection of all publicly available DNA sequences (Ex. Could type PAX-6 gene into search and the nucleotide sequence would come up; accession version = NM_123456.7)
Filehandles in Perl
Name of an I/O connection for a Perl program, normally the names are in all uppercase letters <STDIN> reads from the standard input stream __END__ indicates the Perl code is finished and what follows is input data INFILE opens a new file handle for input OUTFILE opens a new file handle for output
NCBI
National Center for Biotechnology Infomation - creates public databases for the biological community
If you want to access text information about human diseases, what is the best database to visit? OMIM, PubMed, UniGene, or UniProtKB
OMIM! (UniGene is used to check expression... and UniProtKB is a major source for high quality protein sets, made of 2 parts: SwissProt and tREMBL)
Protein Data Bank (PDB)
The RCSB PDB has been the primary repository for protein structures - contains 125,795 structures so far
UniProtKB
The Universal Protein resource! A central repository of protein data created by combining the Swiss-Prot, TrEMBL and PIR-PSD databases. Data types. Captured. Protein annotation. Data resources include Swiss-Prot, TrEMBL, and PIR (Protein Information Resource), Proteomes.
PubMed
The best database for retrieving biomedical literature information; its the literature component of Entrez at NCBI
chomp()
The chomp() function removes the newline in a string and if there is no newline, chomp() does nothing
P-Value
The closer the p-value is to zero, the more significant the particular result is (i.e. the less likely the observed annotation of the particular result has occurred by chance)
Hash table
a hash table is a data structure that can look up values by keys; refer to the entire table by using the % sign. the hash table holds key-value pairs and hash keys are unique strings; better than arrays! Ex. (arrays) @students = ("Tim", "Jim"); @grades = ( 90, 95); (hash table) %student_grades = ("Tim", 90, "Jim", 95);
BLASTN can be used to search:
a nucleotide query against a nucleotide sequence
BLASTP can be used to search:
a protein query against a protein sequence database
TBLASTN can be used to search:
a protein query against a translated nucleotide sequence database - (used to translate every DNA sequence in a database into 6 potential proteins, and then to compare your protein query against each of those translated proteins)
BLASTX can be used to search:
a translated nucleotide query against a protein sequence database - (translates a DNA sequence into 6 protein sequences using all 6 possible reading frames and then compares each of these proteins to a protein database)
Arrays
an array is a vaiable that conatinas a list of data elements and each element holds scalar value (starts at zero and increases by 1 for each element); uses the @ sign to refer to the entire array: $students[0] = "Andrew"; $students[1] = "Jennifer"; $students[2] = "Patrick"; @students = ("Andrew", "Jennifer", "Patrick");
JIGSAW
an integrative gene prediction program that combines the outputs from other gene finders, splice site predictors and sequence alignments.
Computational biology
analysis and interpretation of various types of biological data, including nucleotide and amino acid sequences, protein domains, protein structures, etc.
Standard SELECT statement
can include a FROM clause to indicate the tables to retrieve data from
Whole genome shotgun sequencing
it has become the most commonly used strategy for sequencing and assembling an entire genome; many software programs are available for assembling shotgun reads into contigs.
Eukaryotic genome
larger, contains introns/exons, only 2-10% of genome is protein-coding sequences and repetitive sequences comprise >50% of genome. Gene regulatory sequences (promoters and enhancers), pseudogenes (dysfunctional gene copies with significant mutations, usually not transcribed), and noncoding RNA genes (rRNAs, tRNAs, microRNAs, etc.)
TBLASTX can be used to search:
most computationally intensive because it translates DNA from both query and a database into 6 potential proteins, then performs 36 protein-protein database searches
RepeatMasker
most widely used tool for characterizing repetitive DNA; can identify SINE, LINE, LTR, and DNA transposons (it searches a DNA sequence query against curated libraries of repeats such as Repbase (database of prototypic sequences of repetitive DNA from diff eukaryotes) and Dfam (collections of sequence alignments and Hidden Markov Models/HMMs of transposable elements)
print operator
the print operator takes a scalar argument and puts it out to standard output: print "The answer is "; print 10 * 2; print "./n"; #the output is: The answer is 20. followed by a newline
Database Development Process:
1) Requirement analysis 2) Data modeling 3) Database construction 4) Application development
Architecture of a Database System:
3-tiers: 1) Interface Tier (web interface, client-side) 2) Application Tier (app logic, database connection, server-side) 3) Database Tier (data storage, handling queries)
In a BLAST search, what type of P-value do you want?
A LOW P-value! Because that means a lesser chance that the results are due to chance.
TrEMBL
A computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.
SWISS-PROT
A curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc), a minimal level of redundancy and a high level of integration with other databases.
Relational database
A foreign key (an attribute in one table and the primary key of another table) is used to establish a link between the data in 2 tables
PAX-6 Genes
A master regulatory gene controlling a complex cascade of events in eye development - mutations cause aniridia, where the iris of eye is absent or deformed
Intrinsic gene-finding algorithms
Ab Initio gene prediction is an intrinsic method based on gene content and signal detection; the genomic DNA sequence alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of the protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional... Use genomic sequences based on input using Augustus and FGenesh
Genome annotation
Analysis of repetitive DNA, gene prediction, and gene functional annotation (sequence similarity analysis, protein domain analysis which use HMMs, Gene Ontology annotation)...
You have 2 distantly related proteins... which BLOSUM or PAM matrix is best suited to compare the 2 protein sequences?
BLOSUM45 or PAM250 (because PAM is mutations so you would want higher mutation rate, and BLOSUM is the opposite)
If you want to align 2 protein sequences that are very similar, which BLOSUM or PAM matrix would be most appropriate?
BLOSUM90 or PAM30 Because high BLOSUM means more similarity and PAM (Point Acceptance Mutations) should be low for less mutations and therefore more similarity.
BLAST
Basic Local Alignment Search Tool (BLAST) is a rapid, heuristic version of the Smith-Waterman algorithm, which starts database searching with query words.
BLAST
Basic Local Alignment Search Tool: an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. Most widely used method for sequence similarity analysis (can identify homologous genes that are evolutionarily related) - starts with a dotplot - PSI BLAST improves database search sensitivity (detection of true positives) and selectivity (rejection of false positives)
UCSC Table Browser
Can be used to query and retrieve genomic annotation data in text format
The Gene Ontology Hierarchies:
GO hierarchies include: Molecular function, biological process, and cellular component
Which of the following is NOT one of the Gene Ontology (GO) hierarchies? Molecular Function, Biological Process, Cellular Component, or Genetic Map
Genetic Map
NCBI Entrez
Good example of the database federation approach for data integration
HTML, XML, and NLP
HTML = HyperText Markup Language (used to create webpages) XML = EXtensible Markup Language (rules for encoding documents, used in info retrieval systems) NLP = Natural Language Processing, enables computers to derive meaning from natural language input
3 types of gene-finding algorithms
Intrinsic Extrinsic Combined
Protein Structure Prediction
Major problems to devising algorithms to accurately predict protein structures: -secondary structure prediction (predicting alpha and beta sheets) -fold recognition -homology modeling (predicting the 3D structure based on known structures of homologous proteins)
Gene Ontology
One of the main uses of the GO is to perform enrichment analysis on gene sets. For example, given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set.
OMIM
Online Mendelian Inheritance in Man. An online catalog of human genes and genetic disorders. (Ex. can type Fragile X into search to learn about the disease)
Biological Networks
Physical Networks Logical Networks
PSI-BLAST
Position-Specific Iterated BLAST is particularly well suited for identification of distantly related proteins through constructing a Position-Specific Scoring Matrix (PSSM) to facilitate the search
iHOP system
Processes PubMed abstracts and generates a hyperlinked set of data to identify protein interactions
Structured Query Language (SQL)
SELECT Attribute(s) FROM Table(s) WHERE Condition(s) Ex. SELECT Amino_acid, Three_letter FROM AAtable, DGtable WHERE (AAtable.Distal_ group = DGtable.Distal_ group_ AND (H_bond_ donor = "Yes" )
Scalar Variables in Perl
Scalar variable names consist of $ and an identifier - an identifier has to begin with a letter or underscore $name $Name $_n2 $A_very_Long_varable_name
UCSC Genome Browser
Supports both text-based query and sequence-based search (BLAT); organizes genomic annotations into tracks, each track provides a different type of data like CpG islands and SNPs, AND tracks allow users to submit their own data as custom annotation tracks, and to change the presentation of the available tracks.
Pfam
a database of common protein domains (multiple sequence alignments and HMMs); used in protein domain analysis
Database Schema
The logical structure or expression of the interrelationships among the data, Ex: STUDENT (SSN, Name, Major, Birthdate) ENROLL (SSN, CourseID, Year, Semester, Grade) COURSE (CourseID, CourseName, DeptNum) DEPARTMENT (DeptNum, DeptName, Office, Head) -DeptNum is the primary key for DEPARTMENT and the foreign key for the COURSE table
A relational database
The primary key of a table is an attribute (or set of attributes) that has a unique value for each row in the table
Genome Annotation
The process of identifying genes, their regulatory sequences, their functions - relies heavily on bioinformatics
Theoretical bioinformatics
development of new algorithms and statistics to assess relationships among members of large data sets
Database Warehouse
first consolidates all the data from different sources into a local database and then utilizes the local database for fast queries; uses cleaning, loading, etc. to improve the quality of the data - it can be faster because you keep data all in one local database - EX. UniProt, UCSC Genome Browser, FlyBase
Gene Prediction
refers to the process of identifying the regions of genomic DNA that encode genes (Programs incude GENSCAN, FGENESH, Augustus, etc.)
chop()
the chop() function removes the last character of a string
#!/usr/bin/perl
tells the operating system where Pel is installed (most Perl statements end with a semicolon ; and comments run from a # sign to end of the line
Scalar Assignment
the = sign is the Perl assignment operator $fred = 8; #Give $fred value of 8 $barney = $fred + 2; #Give $barney value 10 $fred - = 5; # $fred = $fred-5 now which is 3
Extrinsic gene-finding algorithms
to collect extrinsic evidence for most or all of the genes in a complex organism requires the study of many hundreds or thousands of cell types, which presents further difficulties... the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl system comprehensively maps this evidence to human and several other genomes (data presented as views, which each view showing a different level of detail). It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data, and there are new high-throughput transcriptome sequencing technologies such as RNA-Seq and ChIP-sequencing... Major challenges involved in gene prediction involve dealing with sequencing errors in raw DNA data, dependence on the quality of the sequence assembly, handling short reads, Frameshift mutations, overlapping genes and incomplete genes. With extrinsic ones, use additional information such as protein sequences, nucleotide (genomic) sequences, etc. using BLAST to find the intron/exon junction.